# Monitoreo y Observabilidad - Skynet v8 ## Arquitectura de Monitoreo ``` ┌─────────────────────────────────────────────────────────────┐ │ Skynet v8 Services │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ skynet-core │ │ skynet-api │ │ skynet-db │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ │ └──────────────────┼───────────────────┘ │ │ │ │ │ ┌──────────────────┴──────────────────┐ │ │ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Exporters / Collectors │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │ │ │ │ │ Prometheus │ │ Node Expt │ │ Filebeat │ │ │ │ │ │ Exporter │ │ (system) │ │ (logs) │ │ │ │ │ └──────────────┘ └──────────────┘ └────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ └──────────────────────┬──────────────────────────────────────┘ │ ┌──────────────┼──────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌──────────┐ ┌──────────────┐ │ Grafana │ │ ELK │ │ Alert Mgr │ │ (Viz) │ │ (Logs) │ │ (Alerts) │ └─────────┘ └──────────┘ └──────────────┘ │ │ │ └──────────────┼───────────────┘ ▼ ┌─────────────────┐ │ Notification │ │ Channels │ │ (Slack/Email) │ └─────────────────┘ ``` --- ## Métricas a Monitorear ### 1. Métricas de Aplicación (skynet-core) #### Procesamiento y Throughput ``` Métrica | Umbral Normal | Alerta Warn | Alerta Critical ---------------------------|---------------|-------------|---------------- requests_per_second | 100-500 | > 750 | > 1000 response_time_p50 | < 100ms | > 200ms | > 500ms response_time_p95 | < 500ms | > 1000ms | > 2000ms response_time_p99 | < 1000ms | > 2000ms | > 5000ms error_rate | < 0.1% | > 0.5% | > 1% success_rate | > 99.9% | < 99.5% | < 99% ``` #### Ejemplos Prometheus ``` # Request rate (requests/sec) rate(http_requests_total[5m]) # Response time percentiles histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # Error rate rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) # Active connections skynet_active_connections ``` ### 2. Métricas de Base de Datos (skynet-database) #### Performance ``` Métrica | Umbral Normal | Alerta | Critical -----------------------------|---------------|---------|---------- query_execution_time_p95 | < 100ms | > 500ms | > 2000ms transactions_per_second | 100-1000 | > 2000 | > 5000 active_connections | 10-50 | > 100 | > 150 connection_utilization_pct | < 50% | > 80% | > 95% replication_lag_bytes | < 1MB | > 100MB | > 1GB ``` #### Ejemplo Prometheus ``` # Query latency histogram_quantile(0.95, rate(pg_slow_queries_duration_seconds_bucket[5m])) # Transactions/sec rate(pg_transactions_total[1m]) # Active connections pg_stat_activity_count{state="active"} # Replication lag pg_replication_lag_bytes / 1024 / 1024 # MB # Connection ratio pg_stat_activity_count / pg_settings_max_connections * 100 ``` ### 3. Métricas de Sistema (Node Exporter) #### CPU y Memoria ``` Métrica | Umbral Normal | Alerta Warn | Critical ---------------------------|---------------|-------------|---------- cpu_usage_percent | < 70% | > 80% | > 95% load_average_1min | < cores | > cores*1.5 | > cores*2 memory_usage_percent | < 80% | > 85% | > 95% swap_usage_percent | < 10% | > 30% | > 50% ``` #### Disco e I/O ``` Métrica | Umbral Normal | Alerta Warn | Critical ---------------------------|---------------|-------------|---------- disk_usage_percent | < 70% | > 80% | > 95% disk_io_read_mb_sec | variable | > 500MB/s | > 1GB/s disk_io_write_mb_sec | variable | > 500MB/s | > 1GB/s inode_usage_percent | < 70% | > 80% | > 95% ``` #### Red ``` Métrica | Umbral Normal | Alerta Warn | Critical ---------------------------|---------------|-------------|---------- network_in_mbps | variable | > 900Mbps | > 980Mbps network_out_mbps | variable | > 900Mbps | > 980Mbps packet_loss_percent | < 0.1% | > 0.5% | > 1% tcp_connections | < 10000 | > 20000 | > 30000 tcp_time_wait | < 5000 | > 10000 | > 20000 ``` #### Ejemplos Prometheus ``` # CPU usage 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory usage (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 # Disk usage (node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 # Network throughput rate(node_network_transmit_bytes_total{device="eth0"}[5m]) / 1024 / 1024 # MB/s ``` ### 4. Métricas de Seguridad #### Acceso y Autenticación ``` Métrica | Umbral Normal | Alerta -----------------------------|---------------|-------- failed_login_attempts_per_min | < 5 | > 20 ssh_login_failures | < 3/10min | > 10/10min sudo_usage_anomaly | histórico | +50% desviación unauthorized_api_calls | < 1/min | > 10/min ``` #### Integridad de datos ``` Métrica | Acción -----------------------------|------ database_checksum_errors | Alerta + Investigar inmediatamente file_integrity_changes | Log + Review semanal unauthorized_user_creation | Alerta crítica privilege_escalation_attempt | Alerta crítica + Log forense ``` #### Ejemplo: Detección de anomalías ```bash # Buscar cambios no autorizados en archivos críticos aide --check --config=/etc/aide/aide.conf # Auditar cambios de permisos auditctl -l ausearch -k privileged | tail -100 # Monitorear login attempts faillog -a # Summary de todos los usuarios lastb # Últimas 10 sesiones fallidas ``` --- ## Alertas Configuradas ### Sistema de Alertas Archivo: `/etc/prometheus/alert_rules.yml` #### Alertas P1 (CRÍTICAS - 1 minuto) ```yaml - alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "{{ $labels.instance }} DOWN" action: "Verificar immediately, ejecutar runbook de reinicio" - alert: DiskFull expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.05 for: 5m labels: severity: critical annotations: summary: "Disco al 95% en {{ $labels.instance }}" action: "Ejecutar escalado de almacenamiento" - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 2m labels: severity: critical annotations: summary: "Error rate > 5% en {{ $labels.job }}" action: "Investigar logs, posible DDoS o fallo de aplicación" - alert: DatabaseDown expr: pg_up == 0 for: 1m labels: severity: critical annotations: summary: "PostgreSQL DOWN en {{ $labels.instance }}" action: "Ejecutar runbook de failover o reinicio" - alert: ReplicationLagCritical expr: (pg_replication_lag_bytes / 1024 / 1024) > 1000 # 1GB for: 5m labels: severity: critical annotations: summary: "Replication lag {{ $value }}MB" action: "Verificar red, aumentar bandwidth, o promover replica" ``` #### Alertas P2 (ALTAS - 5 minutos) ```yaml - alert: HighCPU expr: (100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80 for: 5m labels: severity: high annotations: summary: "CPU > 80% en {{ $labels.instance }}" action: "Investigar procesos, considerar escalado" - alert: HighMemory expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85 for: 5m labels: severity: high annotations: summary: "Memoria > 85% en {{ $labels.instance }}" action: "Investigar memory leaks, escalar RAM" - alert: SlowQueries expr: histogram_quantile(0.95, rate(pg_slow_queries_duration_seconds_bucket[5m])) > 1 for: 5m labels: severity: high annotations: summary: "Queries lentas p95 {{ $value }}s" action: "Analizar EXPLAIN PLAN, optimizar índices" - alert: HighConnectionCount expr: (pg_stat_activity_count / pg_settings_max_connections) > 0.8 for: 10m labels: severity: high annotations: summary: "Conexiones DB {{ $value }}%" action: "Investigar connection leaks, escalar max_connections" - alert: DDoSDetected expr: rate(http_requests_total[5m]) > 10000 for: 2m labels: severity: high annotations: summary: "Posible DDoS: {{ $value }} req/sec" action: "Activar rate limiting, contactar ISP" ``` #### Alertas P3 (MEDIAS - 15 minutos) ```yaml - alert: DiskSpaceLow expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15 for: 15m labels: severity: medium annotations: summary: "Disco < 15% en {{ $labels.instance }}" action: "Agendar escalado de almacenamiento" - alert: BackupFailed expr: time() - backup_last_successful_timestamp > 86400 # 24h for: 1h labels: severity: medium annotations: summary: "Backup no completado en > 24h" action: "Verificar proceso de backup, logs" - alert: CertificateExpiring expr: (ssl_cert_expire_timestamp - time()) / 86400 < 30 labels: severity: medium annotations: summary: "SSL cert vence en {{ $value }} días" action: "Renovar certificado" ``` --- ## Dashboards Disponibles ### Dashboard 1: Visión General (System Overview) **URL**: `http://grafana.skynet.ttzr/d/overview` **Paneles**: - Estado de servicios (verde/rojo) - Request rate (últimas 24h) - Error rate (trending) - CPU/Memoria (gauges) - Disk space (stacked) - Network I/O (dual axis) - Database connections - Top errors (tabla) ### Dashboard 2: Rendimiento de Aplicación (Application Performance) **URL**: `http://grafana.skynet.ttzr/d/app-performance` **Paneles**: - Response time p50/p95/p99 - Throughput (requests/sec) - Error rate by endpoint - Hot endpoints (top 10) - Request distribution (by method) - Slowest endpoints - Error distribution by type - Request size distribution ### Dashboard 3: Rendimiento de Base de Datos (Database Performance) **URL**: `http://grafana.skynet.ttzr/d/db-performance` **Paneles**: - Transactions/sec - Query execution time percentiles - Slow queries (top 10) - Table sizes - Index usage - Cache hit ratio - Replication lag - Active connections - Lock contention - Autovacuum status ### Dashboard 4: Infraestructura y Hardware (Infrastructure) **URL**: `http://grafana.skynet.ttzr/d/infrastructure` **Paneles**: - CPU utilization (por core) - Memory breakdown (used/buffers/cache) - Load average - Disk I/O (read/write MB/s) - Disk space by mount point - Network throughput (in/out) - Network errors/dropped packets - TCP connection states - Inode usage ### Dashboard 5: Seguridad y Logs (Security) **URL**: `http://grafana.skynet.ttzr/d/security` **Paneles**: - Failed login attempts - SSH connection attempts - Firewall blocks (top IPs) - Privilege escalation attempts - API calls unauthorized - File integrity changes - Process anomalies - Network connections unusual --- ## Logs Importantes ### Ubicaciones de logs ``` /var/log/skynet/ ├── core.log # Logs de aplicación principal ├── api.log # Logs de API server ├── database.log # Logs de conexiones DB └── error.log # Errores consolidados /var/log/ ├── auth.log # Logins, sudo ├── syslog # Mensajes del kernel ├── fail2ban.log # Intentos de acceso fallidos └── postgresql/ └── postgresql.log # Logs de PostgreSQL /var/lib/postgresql/ └── pg_log/ # Logs detallados de DB ``` ### Log Analysis Queries #### ELK / Kibana ```json # Top 10 endpoints por error { "aggs": { "endpoints": { "terms": { "field": "endpoint.keyword", "size": 10, "order": {"errors": "desc"} }, "aggs": { "errors": { "filter": {"range": {"status": {"gte": 500}}} } } } } } # Response time trend { "aggs": { "time_trend": { "date_histogram": { "field": "@timestamp", "interval": "1m" }, "aggs": { "p95_latency": { "percentiles": { "field": "response_time_ms", "percents": [95] } } } } } } # Failed logins por IP { "query": { "term": {"event": "failed_login"} }, "aggs": { "by_ip": { "terms": {"field": "source_ip", "size": 20} } } } ``` ### Comandos útiles de análisis ```bash # Top errores en última hora grep ERROR /var/log/skynet/error.log | grep -o "ERROR [^:]]*" | \ sort | uniq -c | sort -rn | head -10 # Endpoints lentos grep "response_time" /var/log/skynet/api.log | \ awk '{print $5, $1}' | sort -n | tail -10 # Intentos de acceso fallidos grep "Failed" /var/log/auth.log | \ grep -o "from [0-9.]*" | sort | uniq -c | sort -rn # Errors por hora grep ERROR /var/log/skynet/error.log | \ cut -d' ' -f1 | cut -d: -f1-2 | uniq -c ``` --- ## Instrumentación y Exporters ### Prometheus Node Exporter ```bash # Verificar que está corriendo systemctl status prometheus-node-exporter # Metrics disponibles curl http://localhost:9100/metrics | head -50 # Configuración cat /etc/prometheus/node_exporter_args # Típicamente: --collector.textfile.directory=/var/lib/node_exporter/textfile_collector ``` ### Custom Exporter (skynet-exporter) ```bash # Métricas customizadas de Skynet curl http://localhost:9200/metrics | grep skynet_ # Ejemplos: # skynet_requests_total{endpoint="/api/v1/data"} # skynet_response_time_seconds_bucket{endpoint="/api/v1/data", le="0.1"} # skynet_database_connections{state="active"} # skynet_cache_hit_ratio ``` ### PostgreSQL Exporter ```bash # Verificar exporter de PostgreSQL curl http://localhost:9187/metrics | grep pg_ # Métricas principales: # pg_stat_database_tup_fetched # pg_stat_database_tup_inserted # pg_stat_database_tup_updated # pg_stat_database_tup_deleted # pg_up (1 si DB está up, 0 si down) ``` --- ## Checklist de Monitoreo Diario - [ ] Revisar estado de todos los dashboards - [ ] Verificar que no hay alertas P1 sin resolver - [ ] Analizar tendencias de performance (¿empeorando?) - [ ] Verificar backup completó exitosamente - [ ] Revisar log de acceso para anomalías - [ ] Confirmar que replicación está al día - [ ] Verificar espacio en disco (< 85%) - [ ] Revisar certificados SSL (> 30 días para expirar) - [ ] Documentar incident report si hubo P1/P2