Consolidate cron job docs into single README.md

2026-04-28 17:26:21 +00:00
parent 30f8ca3863
commit 7ce0e46670
2 changed files with 74 additions and 346 deletions
--- a/assets/ai-optimizer/README.md
+++ b/assets/ai-optimizer/README.md
@@ -0,0 +1,214 @@
+# AI Model Optimization Cron Job
+
+**Purpose:** Automatically find optimal ollama/llama.cpp configurations for maximum context size and hardware utilization.
+
+**Schedule:** Every hour
+
+**Hardware:**
+- 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
+- 128GB system RAM
+- ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1
+
+---
+
+## File Locations
+
+```
+STATE:   /opt/data/infra/assets/ai-optimizer/state.json
+RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
+REPO:    /opt/data/infra (persistent - do not reclone)
+```
+
+---
+
+## Model Queues
+
+### GPU Track (Coding - prioritize speed + context on GPU)
+1. `devstral-small-2:24b`
+2. `qwen2.5-coder:32b`
+3. `codellama:34b-instruct`
+
+### RAM Track (Knowledge - prioritize max context)
+1. `qwen2.5:72b`
+2. `nemotron-3-nano:30b`
+3. `mixtral:8x7b-instruct`
+
+---
+
+## Context Steps (in order)
+```
+[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
+```
+
+---
+
+## Optimization Strategy
+
+### GPU Track (Coding)
+- Start: num_ctx=32768, num_gpu=99, flash_attn=true
+- Increase context until OOM or tokens/sec < 5
+- Record best config before hitting wall
+- Target: >10 tokens/sec with max context
+
+### RAM Track (Knowledge)
+- Start: num_ctx=65536, num_gpu=50, flash_attn=true
+- Allow heavy RAM offload (up to 100GB system RAM)
+- Increase context until OOM
+- Speed secondary to context size
+
+---
+
+## Each Run - Step by Step
+
+### 1. Read State
+```bash
+cd /opt/data/infra
+cat assets/ai-optimizer/state.json
+```
+
+### 2. Determine Next Test
+- Read `track` (gpu or ram)
+- Get `current_model` from queue at `model_index`
+- Get `current_config` for parameters to test
+- Select next context step from `context_steps`
+
+### 3. Pull Model (if needed)
+```bash
+docker exec ollama ollama list | grep -q "<model>" || docker exec ollama ollama pull <model>
+```
+
+### 4. Create Test Modelfile
+```bash
+docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
+FROM ${model}
+PARAMETER num_ctx ${current_config.num_ctx}
+PARAMETER num_gpu ${current_config.num_gpu}
+PARAMETER flash_attn ${current_config.flash_attn}
+PARAMETER num_predict 4096
+PARAMETER num_keep 1024
+PARAMETER repeat_penalty 1.1
+EOF"
+
+docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile
+```
+
+### 5. Run Benchmark
+```bash
+# Warm up
+docker exec ollama ollama run test-model "Hello" > /dev/null
+
+# Coding prompt
+docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."
+
+# Knowledge prompt
+docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."
+```
+
+### 6. Measure VRAM (if possible)
+```bash
+# Try host first
+rocm-smi --showmeminfo vram 2>/dev/null || \
+# Try via docker
+docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
+echo "VRAM unavailable"
+```
+
+### 7. Record Results
+- Parse tokens/sec from ollama output
+- Record VRAM/RAM usage
+- Update `best_configs` if improved
+
+### 8. Update State
+```python
+if test_successful:
+    if context_step < max_reached:
+        current_config.num_ctx = next_context_step
+    else:
+        model_index += 1
+        current_config.num_ctx = context_steps[0]
+else:
+    best_configs[track][current_model] = last_good_config
+    model_index += 1
+```
+
+### 9. Commit to Repo
+```bash
+cd /opt/data/infra
+git add assets/ai-optimizer/
+git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
+git push
+```
+
+### 10. Matrix Notification (if available)
+```python
+import os
+if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
+    # Send notification
+    pass
+# Else: silent
+```
+
+---
+
+## State File Structure
+
+```json
+{
+  "track": "gpu",
+  "current_model": "devstral-small-2:24b",
+  "model_index": 0,
+  "phase": "context_scaling",
+  "backend": "ollama",
+  "current_config": {
+    "num_ctx": 32768,
+    "num_gpu": 99,
+    "flash_attn": true
+  },
+  "best_configs": {
+    "gpu": {},
+    "ram": {}
+  },
+  "completed_models": [],
+  "gpu_queue": ["devstral-small-2:24b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
+  "ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
+  "context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
+  "last_updated": "2026-04-28T17:00:00Z"
+}
+```
+
+---
+
+## Results CSV Format
+
+```csv
+timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
+2026-04-28T17:00:00Z,gpu,devstral-small-2:24b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false
+```
+
+---
+
+## Stop Conditions
+
+1. All models in both queues have `best_configs` recorded
+2. Manual intervention needed (error in state.json `error` field)
+3. No progress for 3 consecutive runs
+
+---
+
+## Error Handling
+
+If any step fails:
+1. Log error: `"error": {"message": "...", "timestamp": "..."}`
+2. Do NOT increment model_index (retry next run)
+3. Commit state with error field
+4. Exit gracefully
+
+---
+
+## Notes
+
+- **No num_parallel**: Removed to avoid limiting other settings
+- **Two tracks**: Complete GPU track first, then RAM track
+- **Backend**: Start with ollama, llama.cpp optional
+- **Host access**: Use docker exec or SSH for rocm-smi
+- **Ask before deploy**: Show diff before `nh os switch`