docs: Add RunPod Serverless configuration and troubleshooting

- Document working REST API configuration (vs broken GraphQL) - Add endpoint IDs for GRACE, PENNY, FACTORY - Include troubleshooting for workers not starting - Document Docker image rebuild process
2025-12-28 00:24:35 +00:00
parent cad5512a59
commit ac481fe266
1 changed files with 169 additions and 0 deletions
--- a/05_OPERACIONES/runpod-serverless.md
+++ b/05_OPERACIONES/runpod-serverless.md
@@ -0,0 +1,169 @@
+# RunPod Serverless - GPU Services
+
+## Resumen
+
+RunPod Serverless proporciona GPUs on-demand para los servicios GRACE, PENNY y FACTORY.
+
+**Cuenta**: rpd@tzr.systems
+**Balance**: ~$69 USD
+**Cuota máxima**: 5 workers simultáneos
+
+---
+
+## Endpoints Activos
+
+| Servicio | Endpoint ID | Workers Max | Módulos |
+|----------|-------------|-------------|---------|
+| **GRACE** | `rfltzijgn1jno4` | 2 | ASR, OCR, TTS, Face, Embeddings, Avatar |
+| **PENNY** | `zsu7eah0fo7xt6` | 2 | TTS (voz) |
+| **FACTORY** | `hffu4q5pywjzng` | 1 | Embeddings, procesamiento |
+
+**GPUs soportadas**: RTX 3090, RTX 4090, RTX A4000, NVIDIA L4
+
+---
+
+## Configuración Crítica
+
+### Problema Resuelto (2025-12-27)
+
+Los workers serverless no iniciaban usando la API GraphQL. La solución fue usar la **REST API** con parámetros específicos.
+
+### API Correcta: REST (no GraphQL)
+
+```bash
+# CORRECTO - REST API
+curl -X POST "https://rest.runpod.io/v1/endpoints" \
+  -H "Authorization: Bearer $RUNPOD_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "mi-endpoint",
+    "templateId": "TEMPLATE_ID",
+    "gpuTypeIds": ["NVIDIA GeForce RTX 3090", "NVIDIA GeForce RTX 4090"],
+    "scalerType": "QUEUE_DELAY",
+    "scalerValue": 4,
+    "workersMin": 0,
+    "workersMax": 2,
+    "idleTimeout": 60,
+    "executionTimeoutMs": 600000,
+    "flashboot": true
+  }'
+```
+
+**Campos obligatorios**:
+- `gpuTypeIds`: Array de strings (NO string separado por comas)
+- `scalerType`: "QUEUE_DELAY"
+- `scalerValue`: 4 (segundos de delay antes de escalar)
+- `flashboot`: true (arranque rápido)
+
+---
+
+## Uso de la API
+
+### Enviar Job
+
+```bash
+curl -X POST "https://api.runpod.ai/v2/{ENDPOINT_ID}/run" \
+  -H "Authorization: $RUNPOD_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{"input": {"module": "ASR_ENGINE", "audio_base64": "..."}}'
+```
+
+### Verificar Estado
+
+```bash
+curl "https://api.runpod.ai/v2/{ENDPOINT_ID}/status/{JOB_ID}" \
+  -H "Authorization: $RUNPOD_API_KEY"
+```
+
+### Health Check
+
+```bash
+curl "https://api.runpod.ai/v2/{ENDPOINT_ID}/health" \
+  -H "Authorization: $RUNPOD_API_KEY"
+```
+
+Respuesta esperada:
+```json
+{
+  "jobs": {"completed": 5, "failed": 0, "inQueue": 0},
+  "workers": {"idle": 1, "ready": 1, "running": 0}
+}
+```
+
+---
+
+## Módulos GRACE
+
+| Módulo | Descripción | Modelo |
+|--------|-------------|--------|
+| `ASR_ENGINE` | Speech-to-Text | Faster Whisper Large V3 |
+| `OCR_CORE` | Reconocimiento de texto | GOT-OCR 2.0 |
+| `TTS` | Text-to-Speech | XTTS-v2 |
+| `FACE_VECTOR` | Vectores faciales | InsightFace |
+| `EMBEDDINGS` | Embeddings de texto | BGE-Large |
+| `AVATAR_GEN` | Generación de avatares | SDXL |
+
+---
+
+## Docker Image
+
+**Registry temporal**: `ttl.sh/tzzr-grace:24h` (expira en 24h)
+
+### Rebuild
+
+```bash
+# En servidor deck (72.62.1.113)
+cd /tmp/docker-grace
+docker build -t ttl.sh/tzzr-grace:24h .
+docker push ttl.sh/tzzr-grace:24h
+```
+
+### Dockerfile clave
+
+```dockerfile
+FROM runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04
+# Fix blinker antes de requirements
+RUN pip install --no-cache-dir --ignore-installed blinker
+RUN pip install --no-cache-dir -r requirements.txt
+```
+
+---
+
+## Troubleshooting
+
+### Workers no inician (0 workers)
+
+1. Usar REST API, no GraphQL
+2. Verificar `gpuTypeIds` es array
+3. Incluir `flashboot: true`
+4. Verificar cuota de workers no excedida
+
+### Job en cola indefinidamente
+
+```bash
+# Purgar cola
+curl -X POST "https://api.runpod.ai/v2/{ENDPOINT_ID}/purge-queue" \
+  -H "Authorization: $RUNPOD_API_KEY"
+```
+
+### Error de modelo (TOS)
+
+Agregar variable de entorno: `COQUI_TOS_AGREED=1`
+
+---
+
+## Credenciales
+
+Ubicación: `/home/orchestrator/.secrets/runpod_api_key`
+
+```bash
+export RUNPOD_API_KEY=$(cat ~/.secrets/runpod_api_key)
+```
+
+---
+
+## Referencias
+
+- [RunPod REST API](https://docs.runpod.io/api-reference)
+- [Serverless Endpoints](https://docs.runpod.io/serverless/endpoints/manage-endpoints)
+- [Status Page](https://uptime.runpod.io)