Añadido: - MANUAL_USUARIO_ARCHITECT.md - MANUAL_USUARIO_CORP.md - MANUAL_USUARIO_DECK.md - MANUAL_USUARIO_HST.md - 07_OPERACION/ (monitoring, runbooks, incident_response) - glosario_she_enterprise_v3.md Eliminado: - glosario_she_enterprise_v2.md (reemplazado por v3) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
558 lines
17 KiB
Markdown
558 lines
17 KiB
Markdown
# Monitoreo y Observabilidad - Skynet v8
|
|
|
|
## Arquitectura de Monitoreo
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Skynet v8 Services │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ skynet-core │ │ skynet-api │ │ skynet-db │ │
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
│ │ │ │ │
|
|
│ └──────────────────┼───────────────────┘ │
|
|
│ │ │
|
|
│ ┌──────────────────┴──────────────────┐ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────┐ │
|
|
│ │ Exporters / Collectors │ │
|
|
│ │ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │ │
|
|
│ │ │ Prometheus │ │ Node Expt │ │ Filebeat │ │ │
|
|
│ │ │ Exporter │ │ (system) │ │ (logs) │ │ │
|
|
│ │ └──────────────┘ └──────────────┘ └────────────┘ │ │
|
|
│ └─────────────────────────────────────────────────────┘ │
|
|
└──────────────────────┬──────────────────────────────────────┘
|
|
│
|
|
┌──────────────┼──────────────┐
|
|
▼ ▼ ▼
|
|
┌─────────┐ ┌──────────┐ ┌──────────────┐
|
|
│ Grafana │ │ ELK │ │ Alert Mgr │
|
|
│ (Viz) │ │ (Logs) │ │ (Alerts) │
|
|
└─────────┘ └──────────┘ └──────────────┘
|
|
│ │ │
|
|
└──────────────┼───────────────┘
|
|
▼
|
|
┌─────────────────┐
|
|
│ Notification │
|
|
│ Channels │
|
|
│ (Slack/Email) │
|
|
└─────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Métricas a Monitorear
|
|
|
|
### 1. Métricas de Aplicación (skynet-core)
|
|
|
|
#### Procesamiento y Throughput
|
|
```
|
|
Métrica | Umbral Normal | Alerta Warn | Alerta Critical
|
|
---------------------------|---------------|-------------|----------------
|
|
requests_per_second | 100-500 | > 750 | > 1000
|
|
response_time_p50 | < 100ms | > 200ms | > 500ms
|
|
response_time_p95 | < 500ms | > 1000ms | > 2000ms
|
|
response_time_p99 | < 1000ms | > 2000ms | > 5000ms
|
|
error_rate | < 0.1% | > 0.5% | > 1%
|
|
success_rate | > 99.9% | < 99.5% | < 99%
|
|
```
|
|
|
|
#### Ejemplos Prometheus
|
|
```
|
|
# Request rate (requests/sec)
|
|
rate(http_requests_total[5m])
|
|
|
|
# Response time percentiles
|
|
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
|
|
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
|
|
|
|
# Error rate
|
|
rate(http_requests_total{status=~"5.."}[5m]) /
|
|
rate(http_requests_total[5m])
|
|
|
|
# Active connections
|
|
skynet_active_connections
|
|
```
|
|
|
|
### 2. Métricas de Base de Datos (skynet-database)
|
|
|
|
#### Performance
|
|
```
|
|
Métrica | Umbral Normal | Alerta | Critical
|
|
-----------------------------|---------------|---------|----------
|
|
query_execution_time_p95 | < 100ms | > 500ms | > 2000ms
|
|
transactions_per_second | 100-1000 | > 2000 | > 5000
|
|
active_connections | 10-50 | > 100 | > 150
|
|
connection_utilization_pct | < 50% | > 80% | > 95%
|
|
replication_lag_bytes | < 1MB | > 100MB | > 1GB
|
|
```
|
|
|
|
#### Ejemplo Prometheus
|
|
```
|
|
# Query latency
|
|
histogram_quantile(0.95, rate(pg_slow_queries_duration_seconds_bucket[5m]))
|
|
|
|
# Transactions/sec
|
|
rate(pg_transactions_total[1m])
|
|
|
|
# Active connections
|
|
pg_stat_activity_count{state="active"}
|
|
|
|
# Replication lag
|
|
pg_replication_lag_bytes / 1024 / 1024 # MB
|
|
|
|
# Connection ratio
|
|
pg_stat_activity_count / pg_settings_max_connections * 100
|
|
```
|
|
|
|
### 3. Métricas de Sistema (Node Exporter)
|
|
|
|
#### CPU y Memoria
|
|
```
|
|
Métrica | Umbral Normal | Alerta Warn | Critical
|
|
---------------------------|---------------|-------------|----------
|
|
cpu_usage_percent | < 70% | > 80% | > 95%
|
|
load_average_1min | < cores | > cores*1.5 | > cores*2
|
|
memory_usage_percent | < 80% | > 85% | > 95%
|
|
swap_usage_percent | < 10% | > 30% | > 50%
|
|
```
|
|
|
|
#### Disco e I/O
|
|
```
|
|
Métrica | Umbral Normal | Alerta Warn | Critical
|
|
---------------------------|---------------|-------------|----------
|
|
disk_usage_percent | < 70% | > 80% | > 95%
|
|
disk_io_read_mb_sec | variable | > 500MB/s | > 1GB/s
|
|
disk_io_write_mb_sec | variable | > 500MB/s | > 1GB/s
|
|
inode_usage_percent | < 70% | > 80% | > 95%
|
|
```
|
|
|
|
#### Red
|
|
```
|
|
Métrica | Umbral Normal | Alerta Warn | Critical
|
|
---------------------------|---------------|-------------|----------
|
|
network_in_mbps | variable | > 900Mbps | > 980Mbps
|
|
network_out_mbps | variable | > 900Mbps | > 980Mbps
|
|
packet_loss_percent | < 0.1% | > 0.5% | > 1%
|
|
tcp_connections | < 10000 | > 20000 | > 30000
|
|
tcp_time_wait | < 5000 | > 10000 | > 20000
|
|
```
|
|
|
|
#### Ejemplos Prometheus
|
|
```
|
|
# CPU usage
|
|
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
|
|
|
# Memory usage
|
|
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
|
|
|
|
# Disk usage
|
|
(node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_avail_bytes) /
|
|
node_filesystem_size_bytes * 100
|
|
|
|
# Network throughput
|
|
rate(node_network_transmit_bytes_total{device="eth0"}[5m]) / 1024 / 1024 # MB/s
|
|
```
|
|
|
|
### 4. Métricas de Seguridad
|
|
|
|
#### Acceso y Autenticación
|
|
```
|
|
Métrica | Umbral Normal | Alerta
|
|
-----------------------------|---------------|--------
|
|
failed_login_attempts_per_min | < 5 | > 20
|
|
ssh_login_failures | < 3/10min | > 10/10min
|
|
sudo_usage_anomaly | histórico | +50% desviación
|
|
unauthorized_api_calls | < 1/min | > 10/min
|
|
```
|
|
|
|
#### Integridad de datos
|
|
```
|
|
Métrica | Acción
|
|
-----------------------------|------
|
|
database_checksum_errors | Alerta + Investigar inmediatamente
|
|
file_integrity_changes | Log + Review semanal
|
|
unauthorized_user_creation | Alerta crítica
|
|
privilege_escalation_attempt | Alerta crítica + Log forense
|
|
```
|
|
|
|
#### Ejemplo: Detección de anomalías
|
|
```bash
|
|
# Buscar cambios no autorizados en archivos críticos
|
|
aide --check --config=/etc/aide/aide.conf
|
|
|
|
# Auditar cambios de permisos
|
|
auditctl -l
|
|
ausearch -k privileged | tail -100
|
|
|
|
# Monitorear login attempts
|
|
faillog -a # Summary de todos los usuarios
|
|
lastb # Últimas 10 sesiones fallidas
|
|
```
|
|
|
|
---
|
|
|
|
## Alertas Configuradas
|
|
|
|
### Sistema de Alertas
|
|
|
|
Archivo: `/etc/prometheus/alert_rules.yml`
|
|
|
|
#### Alertas P1 (CRÍTICAS - 1 minuto)
|
|
```yaml
|
|
- alert: ServiceDown
|
|
expr: up == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "{{ $labels.instance }} DOWN"
|
|
action: "Verificar immediately, ejecutar runbook de reinicio"
|
|
|
|
- alert: DiskFull
|
|
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.05
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Disco al 95% en {{ $labels.instance }}"
|
|
action: "Ejecutar escalado de almacenamiento"
|
|
|
|
- alert: HighErrorRate
|
|
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Error rate > 5% en {{ $labels.job }}"
|
|
action: "Investigar logs, posible DDoS o fallo de aplicación"
|
|
|
|
- alert: DatabaseDown
|
|
expr: pg_up == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "PostgreSQL DOWN en {{ $labels.instance }}"
|
|
action: "Ejecutar runbook de failover o reinicio"
|
|
|
|
- alert: ReplicationLagCritical
|
|
expr: (pg_replication_lag_bytes / 1024 / 1024) > 1000 # 1GB
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Replication lag {{ $value }}MB"
|
|
action: "Verificar red, aumentar bandwidth, o promover replica"
|
|
```
|
|
|
|
#### Alertas P2 (ALTAS - 5 minutos)
|
|
```yaml
|
|
- alert: HighCPU
|
|
expr: (100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
|
|
for: 5m
|
|
labels:
|
|
severity: high
|
|
annotations:
|
|
summary: "CPU > 80% en {{ $labels.instance }}"
|
|
action: "Investigar procesos, considerar escalado"
|
|
|
|
- alert: HighMemory
|
|
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
|
|
for: 5m
|
|
labels:
|
|
severity: high
|
|
annotations:
|
|
summary: "Memoria > 85% en {{ $labels.instance }}"
|
|
action: "Investigar memory leaks, escalar RAM"
|
|
|
|
- alert: SlowQueries
|
|
expr: histogram_quantile(0.95, rate(pg_slow_queries_duration_seconds_bucket[5m])) > 1
|
|
for: 5m
|
|
labels:
|
|
severity: high
|
|
annotations:
|
|
summary: "Queries lentas p95 {{ $value }}s"
|
|
action: "Analizar EXPLAIN PLAN, optimizar índices"
|
|
|
|
- alert: HighConnectionCount
|
|
expr: (pg_stat_activity_count / pg_settings_max_connections) > 0.8
|
|
for: 10m
|
|
labels:
|
|
severity: high
|
|
annotations:
|
|
summary: "Conexiones DB {{ $value }}%"
|
|
action: "Investigar connection leaks, escalar max_connections"
|
|
|
|
- alert: DDoSDetected
|
|
expr: rate(http_requests_total[5m]) > 10000
|
|
for: 2m
|
|
labels:
|
|
severity: high
|
|
annotations:
|
|
summary: "Posible DDoS: {{ $value }} req/sec"
|
|
action: "Activar rate limiting, contactar ISP"
|
|
```
|
|
|
|
#### Alertas P3 (MEDIAS - 15 minutos)
|
|
```yaml
|
|
- alert: DiskSpaceLow
|
|
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
|
|
for: 15m
|
|
labels:
|
|
severity: medium
|
|
annotations:
|
|
summary: "Disco < 15% en {{ $labels.instance }}"
|
|
action: "Agendar escalado de almacenamiento"
|
|
|
|
- alert: BackupFailed
|
|
expr: time() - backup_last_successful_timestamp > 86400 # 24h
|
|
for: 1h
|
|
labels:
|
|
severity: medium
|
|
annotations:
|
|
summary: "Backup no completado en > 24h"
|
|
action: "Verificar proceso de backup, logs"
|
|
|
|
- alert: CertificateExpiring
|
|
expr: (ssl_cert_expire_timestamp - time()) / 86400 < 30
|
|
labels:
|
|
severity: medium
|
|
annotations:
|
|
summary: "SSL cert vence en {{ $value }} días"
|
|
action: "Renovar certificado"
|
|
```
|
|
|
|
---
|
|
|
|
## Dashboards Disponibles
|
|
|
|
### Dashboard 1: Visión General (System Overview)
|
|
|
|
**URL**: `http://grafana.skynet.ttzr/d/overview`
|
|
|
|
**Paneles**:
|
|
- Estado de servicios (verde/rojo)
|
|
- Request rate (últimas 24h)
|
|
- Error rate (trending)
|
|
- CPU/Memoria (gauges)
|
|
- Disk space (stacked)
|
|
- Network I/O (dual axis)
|
|
- Database connections
|
|
- Top errors (tabla)
|
|
|
|
### Dashboard 2: Rendimiento de Aplicación (Application Performance)
|
|
|
|
**URL**: `http://grafana.skynet.ttzr/d/app-performance`
|
|
|
|
**Paneles**:
|
|
- Response time p50/p95/p99
|
|
- Throughput (requests/sec)
|
|
- Error rate by endpoint
|
|
- Hot endpoints (top 10)
|
|
- Request distribution (by method)
|
|
- Slowest endpoints
|
|
- Error distribution by type
|
|
- Request size distribution
|
|
|
|
### Dashboard 3: Rendimiento de Base de Datos (Database Performance)
|
|
|
|
**URL**: `http://grafana.skynet.ttzr/d/db-performance`
|
|
|
|
**Paneles**:
|
|
- Transactions/sec
|
|
- Query execution time percentiles
|
|
- Slow queries (top 10)
|
|
- Table sizes
|
|
- Index usage
|
|
- Cache hit ratio
|
|
- Replication lag
|
|
- Active connections
|
|
- Lock contention
|
|
- Autovacuum status
|
|
|
|
### Dashboard 4: Infraestructura y Hardware (Infrastructure)
|
|
|
|
**URL**: `http://grafana.skynet.ttzr/d/infrastructure`
|
|
|
|
**Paneles**:
|
|
- CPU utilization (por core)
|
|
- Memory breakdown (used/buffers/cache)
|
|
- Load average
|
|
- Disk I/O (read/write MB/s)
|
|
- Disk space by mount point
|
|
- Network throughput (in/out)
|
|
- Network errors/dropped packets
|
|
- TCP connection states
|
|
- Inode usage
|
|
|
|
### Dashboard 5: Seguridad y Logs (Security)
|
|
|
|
**URL**: `http://grafana.skynet.ttzr/d/security`
|
|
|
|
**Paneles**:
|
|
- Failed login attempts
|
|
- SSH connection attempts
|
|
- Firewall blocks (top IPs)
|
|
- Privilege escalation attempts
|
|
- API calls unauthorized
|
|
- File integrity changes
|
|
- Process anomalies
|
|
- Network connections unusual
|
|
|
|
---
|
|
|
|
## Logs Importantes
|
|
|
|
### Ubicaciones de logs
|
|
|
|
```
|
|
/var/log/skynet/
|
|
├── core.log # Logs de aplicación principal
|
|
├── api.log # Logs de API server
|
|
├── database.log # Logs de conexiones DB
|
|
└── error.log # Errores consolidados
|
|
|
|
/var/log/
|
|
├── auth.log # Logins, sudo
|
|
├── syslog # Mensajes del kernel
|
|
├── fail2ban.log # Intentos de acceso fallidos
|
|
└── postgresql/
|
|
└── postgresql.log # Logs de PostgreSQL
|
|
|
|
/var/lib/postgresql/
|
|
└── pg_log/ # Logs detallados de DB
|
|
```
|
|
|
|
### Log Analysis Queries
|
|
|
|
#### ELK / Kibana
|
|
|
|
```json
|
|
# Top 10 endpoints por error
|
|
{
|
|
"aggs": {
|
|
"endpoints": {
|
|
"terms": {
|
|
"field": "endpoint.keyword",
|
|
"size": 10,
|
|
"order": {"errors": "desc"}
|
|
},
|
|
"aggs": {
|
|
"errors": {
|
|
"filter": {"range": {"status": {"gte": 500}}}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
# Response time trend
|
|
{
|
|
"aggs": {
|
|
"time_trend": {
|
|
"date_histogram": {
|
|
"field": "@timestamp",
|
|
"interval": "1m"
|
|
},
|
|
"aggs": {
|
|
"p95_latency": {
|
|
"percentiles": {
|
|
"field": "response_time_ms",
|
|
"percents": [95]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
# Failed logins por IP
|
|
{
|
|
"query": {
|
|
"term": {"event": "failed_login"}
|
|
},
|
|
"aggs": {
|
|
"by_ip": {
|
|
"terms": {"field": "source_ip", "size": 20}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Comandos útiles de análisis
|
|
|
|
```bash
|
|
# Top errores en última hora
|
|
grep ERROR /var/log/skynet/error.log | grep -o "ERROR [^:]]*" | \
|
|
sort | uniq -c | sort -rn | head -10
|
|
|
|
# Endpoints lentos
|
|
grep "response_time" /var/log/skynet/api.log | \
|
|
awk '{print $5, $1}' | sort -n | tail -10
|
|
|
|
# Intentos de acceso fallidos
|
|
grep "Failed" /var/log/auth.log | \
|
|
grep -o "from [0-9.]*" | sort | uniq -c | sort -rn
|
|
|
|
# Errors por hora
|
|
grep ERROR /var/log/skynet/error.log | \
|
|
cut -d' ' -f1 | cut -d: -f1-2 | uniq -c
|
|
```
|
|
|
|
---
|
|
|
|
## Instrumentación y Exporters
|
|
|
|
### Prometheus Node Exporter
|
|
```bash
|
|
# Verificar que está corriendo
|
|
systemctl status prometheus-node-exporter
|
|
|
|
# Metrics disponibles
|
|
curl http://localhost:9100/metrics | head -50
|
|
|
|
# Configuración
|
|
cat /etc/prometheus/node_exporter_args
|
|
# Típicamente: --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
|
|
```
|
|
|
|
### Custom Exporter (skynet-exporter)
|
|
```bash
|
|
# Métricas customizadas de Skynet
|
|
curl http://localhost:9200/metrics | grep skynet_
|
|
|
|
# Ejemplos:
|
|
# skynet_requests_total{endpoint="/api/v1/data"}
|
|
# skynet_response_time_seconds_bucket{endpoint="/api/v1/data", le="0.1"}
|
|
# skynet_database_connections{state="active"}
|
|
# skynet_cache_hit_ratio
|
|
```
|
|
|
|
### PostgreSQL Exporter
|
|
```bash
|
|
# Verificar exporter de PostgreSQL
|
|
curl http://localhost:9187/metrics | grep pg_
|
|
|
|
# Métricas principales:
|
|
# pg_stat_database_tup_fetched
|
|
# pg_stat_database_tup_inserted
|
|
# pg_stat_database_tup_updated
|
|
# pg_stat_database_tup_deleted
|
|
# pg_up (1 si DB está up, 0 si down)
|
|
```
|
|
|
|
---
|
|
|
|
## Checklist de Monitoreo Diario
|
|
|
|
- [ ] Revisar estado de todos los dashboards
|
|
- [ ] Verificar que no hay alertas P1 sin resolver
|
|
- [ ] Analizar tendencias de performance (¿empeorando?)
|
|
- [ ] Verificar backup completó exitosamente
|
|
- [ ] Revisar log de acceso para anomalías
|
|
- [ ] Confirmar que replicación está al día
|
|
- [ ] Verificar espacio en disco (< 85%)
|
|
- [ ] Revisar certificados SSL (> 30 días para expirar)
|
|
- [ ] Documentar incident report si hubo P1/P2
|
|
|