Add AI model optimizer cron job draft and initial state files

2026-04-28 17:19:45 +00:00
parent 7efba3ac5b
commit 30f8ca3863
4 changed files with 508 additions and 0 deletions
--- a/assets/ai-optimizer/CRON_EXECUTION_PROMPT.md
+++ b/assets/ai-optimizer/CRON_EXECUTION_PROMPT.md
@@ -0,0 +1,203 @@
+# AI Model Optimization Cron Job - EXECUTION PROMPT
+
+**When this cron runs, follow these instructions exactly:**
+
+---
+
+## Your Role
+
+You are an AI model optimization agent. Your task is to find the best ollama/llama.cpp configuration for maximum context size and hardware utilization.
+
+**Hardware:**
+- 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
+- 128GB system RAM
+- ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1
+
+---
+
+## File Locations
+
+```
+STATE: /opt/data/infra/assets/ai-optimizer/state.json
+RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
+INFRA_REPO: /opt/data/infra
+```
+
+---
+
+## Model Queues
+
+### GPU Track (Coding - prioritize speed + context on GPU)
+1. `devstral-small-2:24b`
+2. `qwen2.5-coder:32b`
+3. `codellama:34b-instruct`
+
+### RAM Track (Knowledge - prioritize max context)
+1. `qwen2.5:72b`
+2. `nemotron-3-nano:30b`
+3. `mixtral:8x7b-instruct`
+
+---
+
+## Context Steps (in order)
+```
+[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
+```
+
+---
+
+## Each Run - Step by Step
+
+### 1. Read State
+```bash
+cd /opt/data/infra
+cat assets/ai-optimizer/state.json
+```
+
+### 2. Determine Next Test
+- Read `track` (gpu or ram)
+- Read `current_model` from queue at `model_index`
+- Read `current_config` for parameters to test
+- Select next context step from `context_steps` based on `phase`
+
+### 3. Pull Model (if needed)
+```bash
+docker exec ollama ollama list | grep -q "<model>" || docker exec ollama ollama pull <model>
+```
+
+### 4. Create Test Modelfile
+```bash
+docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
+FROM ${model}
+PARAMETER num_ctx ${current_config.num_ctx}
+PARAMETER num_gpu ${current_config.num_gpu}
+PARAMETER flash_attn ${current_config.flash_attn}
+PARAMETER num_predict 4096
+PARAMETER num_keep 1024
+PARAMETER repeat_penalty 1.1
+EOF"
+
+docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile
+```
+
+### 5. Run Benchmark
+```bash
+# Warm up
+docker exec ollama ollama run test-model "Hello" > /dev/null
+
+# Coding prompt
+START=$(date +%s%N)
+docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."
+END=$(date +%s%N)
+
+# Calculate tokens/sec from output
+```
+
+### 6. Measure VRAM (if possible)
+```bash
+# Try host first
+rocm-smi --showmeminfo vram 2>/dev/null || \
+# Try via docker
+docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
+# Fallback
+echo "VRAM measurement unavailable"
+```
+
+### 7. Record Results
+- Parse tokens/sec from ollama output
+- Record VRAM/RAM usage
+- Determine if this is best config so far for this model
+- Update `best_configs` if tokens/sec improved or context increased
+
+### 8. Update State
+```python
+# Logic:
+if test_successful:
+    if context_step < max_reached:
+        phase = "context_scaling"
+        current_config.num_ctx = next_context_step
+    else:
+        # Move to next model
+        model_index += 1
+        phase = "context_scaling"
+        current_config.num_ctx = context_steps[0]
+else:
+    # OOM or error - record last good as best
+    best_configs[track][current_model] = last_good_config
+    model_index += 1
+    phase = "context_scaling"
+```
+
+### 9. Commit to Repo
+```bash
+cd /opt/data/infra
+git add assets/ai-optimizer/
+git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
+git push origin master
+```
+
+### 10. Matrix Notification (if available)
+```python
+import os
+if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
+    # Send notification to Matrix room
+    # Room ID from env or config
+    pass
+# Else: silent
+```
+
+---
+
+## Stop Conditions
+
+1. All models in both queues have `best_configs` recorded
+2. Manual intervention needed (error in state.json `error` field)
+3. No progress for 3 consecutive runs (stuck)
+
+---
+
+## Error Handling
+
+If any step fails:
+1. Log error to state.json: `"error": {"message": "...", "timestamp": "..."}`
+2. Do NOT increment model_index (retry next run)
+3. Commit state with error field
+4. Exit gracefully
+
+---
+
+## Important Notes
+
+- **No num_parallel**: Do not use this parameter
+- **Two tracks**: Complete GPU track first, then RAM track
+- **Backend**: Start with ollama, llama.cpp testing is optional (requires uncommenting in compose.yml)
+- **Host access**: Some commands need host - use docker exec or SSH if available
+- **Ask before deploy**: If config changes needed in NixOS modules, show diff and wait for user confirmation before `nh os switch`
+
+---
+
+## Example State Transitions
+
+**Start:**
+```json
+{"track": "gpu", "model_index": 0, "current_model": "devstral-small-2:24b", "current_config": {"num_ctx": 32768, ...}}
+```
+
+**After successful test at 32k:**
+```json
+{"track": "gpu", "model_index": 0, "current_model": "devstral-small-2:24b", "current_config": {"num_ctx": 65536, ...}}
+```
+
+**After OOM at 131k:**
+```json
+{
+  "track": "gpu",
+  "model_index": 1,
+  "current_model": "qwen2.5-coder:32b",
+  "best_configs": {
+    "gpu": {
+      "devstral-small-2:24b": {"num_ctx": 98304, "num_gpu": 99, "tokens_per_sec": 11.2}
+    }
+  }
+}
+```