Files
system-docs/07_OPERACION/monitoring.md
ARCHITECT 9f3a4214d3 Sync from R2 skynet v8: manuales, operación, glosario v3
Añadido:
- MANUAL_USUARIO_ARCHITECT.md
- MANUAL_USUARIO_CORP.md
- MANUAL_USUARIO_DECK.md
- MANUAL_USUARIO_HST.md
- 07_OPERACION/ (monitoring, runbooks, incident_response)
- glosario_she_enterprise_v3.md

Eliminado:
- glosario_she_enterprise_v2.md (reemplazado por v3)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-01 10:53:57 +00:00

17 KiB

Monitoreo y Observabilidad - Skynet v8

Arquitectura de Monitoreo

┌─────────────────────────────────────────────────────────────┐
│                    Skynet v8 Services                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ skynet-core  │  │ skynet-api   │  │ skynet-db    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│         │                  │                   │             │
│         └──────────────────┼───────────────────┘             │
│                            │                                 │
│         ┌──────────────────┴──────────────────┐              │
│         ▼                  ▼                   ▼              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Exporters / Collectors                   │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌────────────┐ │   │
│  │  │ Prometheus   │  │  Node Expt   │  │ Filebeat   │ │   │
│  │  │  Exporter    │  │  (system)    │  │ (logs)     │ │   │
│  │  └──────────────┘  └──────────────┘  └────────────┘ │   │
│  └─────────────────────────────────────────────────────┘   │
└──────────────────────┬──────────────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
   ┌─────────┐  ┌──────────┐   ┌──────────────┐
   │ Grafana │  │ ELK      │   │ Alert Mgr    │
   │ (Viz)   │  │ (Logs)   │   │ (Alerts)     │
   └─────────┘  └──────────┘   └──────────────┘
        │              │               │
        └──────────────┼───────────────┘
                       ▼
              ┌─────────────────┐
              │ Notification    │
              │ Channels        │
              │ (Slack/Email)   │
              └─────────────────┘

Métricas a Monitorear

1. Métricas de Aplicación (skynet-core)

Procesamiento y Throughput

Métrica                    | Umbral Normal | Alerta Warn | Alerta Critical
---------------------------|---------------|-------------|----------------
requests_per_second        | 100-500       | > 750       | > 1000
response_time_p50          | < 100ms       | > 200ms     | > 500ms
response_time_p95          | < 500ms       | > 1000ms    | > 2000ms
response_time_p99          | < 1000ms      | > 2000ms    | > 5000ms
error_rate                 | < 0.1%        | > 0.5%      | > 1%
success_rate               | > 99.9%       | < 99.5%     | < 99%

Ejemplos Prometheus

# Request rate (requests/sec)
rate(http_requests_total[5m])

# Response time percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m])

# Active connections
skynet_active_connections

2. Métricas de Base de Datos (skynet-database)

Performance

Métrica                      | Umbral Normal | Alerta  | Critical
-----------------------------|---------------|---------|----------
query_execution_time_p95     | < 100ms       | > 500ms | > 2000ms
transactions_per_second      | 100-1000      | > 2000  | > 5000
active_connections           | 10-50         | > 100   | > 150
connection_utilization_pct   | < 50%         | > 80%   | > 95%
replication_lag_bytes        | < 1MB         | > 100MB | > 1GB

Ejemplo Prometheus

# Query latency
histogram_quantile(0.95, rate(pg_slow_queries_duration_seconds_bucket[5m]))

# Transactions/sec
rate(pg_transactions_total[1m])

# Active connections
pg_stat_activity_count{state="active"}

# Replication lag
pg_replication_lag_bytes / 1024 / 1024  # MB

# Connection ratio
pg_stat_activity_count / pg_settings_max_connections * 100

3. Métricas de Sistema (Node Exporter)

CPU y Memoria

Métrica                    | Umbral Normal | Alerta Warn | Critical
---------------------------|---------------|-------------|----------
cpu_usage_percent          | < 70%         | > 80%       | > 95%
load_average_1min          | < cores       | > cores*1.5 | > cores*2
memory_usage_percent       | < 80%         | > 85%       | > 95%
swap_usage_percent         | < 10%         | > 30%       | > 50%

Disco e I/O

Métrica                    | Umbral Normal | Alerta Warn | Critical
---------------------------|---------------|-------------|----------
disk_usage_percent         | < 70%         | > 80%       | > 95%
disk_io_read_mb_sec        | variable      | > 500MB/s   | > 1GB/s
disk_io_write_mb_sec       | variable      | > 500MB/s   | > 1GB/s
inode_usage_percent        | < 70%         | > 80%       | > 95%

Red

Métrica                    | Umbral Normal | Alerta Warn | Critical
---------------------------|---------------|-------------|----------
network_in_mbps            | variable      | > 900Mbps   | > 980Mbps
network_out_mbps           | variable      | > 900Mbps   | > 980Mbps
packet_loss_percent        | < 0.1%        | > 0.5%      | > 1%
tcp_connections            | < 10000       | > 20000     | > 30000
tcp_time_wait              | < 5000        | > 10000     | > 20000

Ejemplos Prometheus

# CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage
(node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_avail_bytes) /
node_filesystem_size_bytes * 100

# Network throughput
rate(node_network_transmit_bytes_total{device="eth0"}[5m]) / 1024 / 1024  # MB/s

4. Métricas de Seguridad

Acceso y Autenticación

Métrica                      | Umbral Normal | Alerta
-----------------------------|---------------|--------
failed_login_attempts_per_min | < 5           | > 20
ssh_login_failures           | < 3/10min     | > 10/10min
sudo_usage_anomaly           | histórico     | +50% desviación
unauthorized_api_calls       | < 1/min       | > 10/min

Integridad de datos

Métrica                      | Acción
-----------------------------|------
database_checksum_errors     | Alerta + Investigar inmediatamente
file_integrity_changes       | Log + Review semanal
unauthorized_user_creation   | Alerta crítica
privilege_escalation_attempt | Alerta crítica + Log forense

Ejemplo: Detección de anomalías

# Buscar cambios no autorizados en archivos críticos
aide --check --config=/etc/aide/aide.conf

# Auditar cambios de permisos
auditctl -l
ausearch -k privileged | tail -100

# Monitorear login attempts
faillog -a  # Summary de todos los usuarios
lastb       # Últimas 10 sesiones fallidas

Alertas Configuradas

Sistema de Alertas

Archivo: /etc/prometheus/alert_rules.yml

Alertas P1 (CRÍTICAS - 1 minuto)

- alert: ServiceDown
  expr: up == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.instance }} DOWN"
    action: "Verificar immediately, ejecutar runbook de reinicio"

- alert: DiskFull
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Disco al 95% en {{ $labels.instance }}"
    action: "Ejecutar escalado de almacenamiento"

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Error rate > 5% en {{ $labels.job }}"
    action: "Investigar logs, posible DDoS o fallo de aplicación"

- alert: DatabaseDown
  expr: pg_up == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "PostgreSQL DOWN en {{ $labels.instance }}"
    action: "Ejecutar runbook de failover o reinicio"

- alert: ReplicationLagCritical
  expr: (pg_replication_lag_bytes / 1024 / 1024) > 1000  # 1GB
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Replication lag {{ $value }}MB"
    action: "Verificar red, aumentar bandwidth, o promover replica"

Alertas P2 (ALTAS - 5 minutos)

- alert: HighCPU
  expr: (100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
  for: 5m
  labels:
    severity: high
  annotations:
    summary: "CPU > 80% en {{ $labels.instance }}"
    action: "Investigar procesos, considerar escalado"

- alert: HighMemory
  expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
  for: 5m
  labels:
    severity: high
  annotations:
    summary: "Memoria > 85% en {{ $labels.instance }}"
    action: "Investigar memory leaks, escalar RAM"

- alert: SlowQueries
  expr: histogram_quantile(0.95, rate(pg_slow_queries_duration_seconds_bucket[5m])) > 1
  for: 5m
  labels:
    severity: high
  annotations:
    summary: "Queries lentas p95 {{ $value }}s"
    action: "Analizar EXPLAIN PLAN, optimizar índices"

- alert: HighConnectionCount
  expr: (pg_stat_activity_count / pg_settings_max_connections) > 0.8
  for: 10m
  labels:
    severity: high
  annotations:
    summary: "Conexiones DB {{ $value }}%"
    action: "Investigar connection leaks, escalar max_connections"

- alert: DDoSDetected
  expr: rate(http_requests_total[5m]) > 10000
  for: 2m
  labels:
    severity: high
  annotations:
    summary: "Posible DDoS: {{ $value }} req/sec"
    action: "Activar rate limiting, contactar ISP"

Alertas P3 (MEDIAS - 15 minutos)

- alert: DiskSpaceLow
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
  for: 15m
  labels:
    severity: medium
  annotations:
    summary: "Disco < 15% en {{ $labels.instance }}"
    action: "Agendar escalado de almacenamiento"

- alert: BackupFailed
  expr: time() - backup_last_successful_timestamp > 86400  # 24h
  for: 1h
  labels:
    severity: medium
  annotations:
    summary: "Backup no completado en > 24h"
    action: "Verificar proceso de backup, logs"

- alert: CertificateExpiring
  expr: (ssl_cert_expire_timestamp - time()) / 86400 < 30
  labels:
    severity: medium
  annotations:
    summary: "SSL cert vence en {{ $value }} días"
    action: "Renovar certificado"

Dashboards Disponibles

Dashboard 1: Visión General (System Overview)

URL: http://grafana.skynet.ttzr/d/overview

Paneles:

  • Estado de servicios (verde/rojo)
  • Request rate (últimas 24h)
  • Error rate (trending)
  • CPU/Memoria (gauges)
  • Disk space (stacked)
  • Network I/O (dual axis)
  • Database connections
  • Top errors (tabla)

Dashboard 2: Rendimiento de Aplicación (Application Performance)

URL: http://grafana.skynet.ttzr/d/app-performance

Paneles:

  • Response time p50/p95/p99
  • Throughput (requests/sec)
  • Error rate by endpoint
  • Hot endpoints (top 10)
  • Request distribution (by method)
  • Slowest endpoints
  • Error distribution by type
  • Request size distribution

Dashboard 3: Rendimiento de Base de Datos (Database Performance)

URL: http://grafana.skynet.ttzr/d/db-performance

Paneles:

  • Transactions/sec
  • Query execution time percentiles
  • Slow queries (top 10)
  • Table sizes
  • Index usage
  • Cache hit ratio
  • Replication lag
  • Active connections
  • Lock contention
  • Autovacuum status

Dashboard 4: Infraestructura y Hardware (Infrastructure)

URL: http://grafana.skynet.ttzr/d/infrastructure

Paneles:

  • CPU utilization (por core)
  • Memory breakdown (used/buffers/cache)
  • Load average
  • Disk I/O (read/write MB/s)
  • Disk space by mount point
  • Network throughput (in/out)
  • Network errors/dropped packets
  • TCP connection states
  • Inode usage

Dashboard 5: Seguridad y Logs (Security)

URL: http://grafana.skynet.ttzr/d/security

Paneles:

  • Failed login attempts
  • SSH connection attempts
  • Firewall blocks (top IPs)
  • Privilege escalation attempts
  • API calls unauthorized
  • File integrity changes
  • Process anomalies
  • Network connections unusual

Logs Importantes

Ubicaciones de logs

/var/log/skynet/
├── core.log              # Logs de aplicación principal
├── api.log               # Logs de API server
├── database.log          # Logs de conexiones DB
└── error.log             # Errores consolidados

/var/log/
├── auth.log              # Logins, sudo
├── syslog                # Mensajes del kernel
├── fail2ban.log          # Intentos de acceso fallidos
└── postgresql/
    └── postgresql.log    # Logs de PostgreSQL

/var/lib/postgresql/
└── pg_log/               # Logs detallados de DB

Log Analysis Queries

ELK / Kibana

# Top 10 endpoints por error
{
  "aggs": {
    "endpoints": {
      "terms": {
        "field": "endpoint.keyword",
        "size": 10,
        "order": {"errors": "desc"}
      },
      "aggs": {
        "errors": {
          "filter": {"range": {"status": {"gte": 500}}}
        }
      }
    }
  }
}

# Response time trend
{
  "aggs": {
    "time_trend": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "1m"
      },
      "aggs": {
        "p95_latency": {
          "percentiles": {
            "field": "response_time_ms",
            "percents": [95]
          }
        }
      }
    }
  }
}

# Failed logins por IP
{
  "query": {
    "term": {"event": "failed_login"}
  },
  "aggs": {
    "by_ip": {
      "terms": {"field": "source_ip", "size": 20}
    }
  }
}

Comandos útiles de análisis

# Top errores en última hora
grep ERROR /var/log/skynet/error.log | grep -o "ERROR [^:]]*" | \
  sort | uniq -c | sort -rn | head -10

# Endpoints lentos
grep "response_time" /var/log/skynet/api.log | \
  awk '{print $5, $1}' | sort -n | tail -10

# Intentos de acceso fallidos
grep "Failed" /var/log/auth.log | \
  grep -o "from [0-9.]*" | sort | uniq -c | sort -rn

# Errors por hora
grep ERROR /var/log/skynet/error.log | \
  cut -d' ' -f1 | cut -d: -f1-2 | uniq -c

Instrumentación y Exporters

Prometheus Node Exporter

# Verificar que está corriendo
systemctl status prometheus-node-exporter

# Metrics disponibles
curl http://localhost:9100/metrics | head -50

# Configuración
cat /etc/prometheus/node_exporter_args
# Típicamente: --collector.textfile.directory=/var/lib/node_exporter/textfile_collector

Custom Exporter (skynet-exporter)

# Métricas customizadas de Skynet
curl http://localhost:9200/metrics | grep skynet_

# Ejemplos:
# skynet_requests_total{endpoint="/api/v1/data"}
# skynet_response_time_seconds_bucket{endpoint="/api/v1/data", le="0.1"}
# skynet_database_connections{state="active"}
# skynet_cache_hit_ratio

PostgreSQL Exporter

# Verificar exporter de PostgreSQL
curl http://localhost:9187/metrics | grep pg_

# Métricas principales:
# pg_stat_database_tup_fetched
# pg_stat_database_tup_inserted
# pg_stat_database_tup_updated
# pg_stat_database_tup_deleted
# pg_up (1 si DB está up, 0 si down)

Checklist de Monitoreo Diario

  • Revisar estado de todos los dashboards
  • Verificar que no hay alertas P1 sin resolver
  • Analizar tendencias de performance (¿empeorando?)
  • Verificar backup completó exitosamente
  • Revisar log de acceso para anomalías
  • Confirmar que replicación está al día
  • Verificar espacio en disco (< 85%)
  • Revisar certificados SSL (> 30 días para expirar)
  • Documentar incident report si hubo P1/P2