infra/assets/ai-optimizer/README.md

# AI Model Optimization Cron Job

**Purpose:** Automatically find optimal ollama/llama.cpp configurations for maximum context size and hardware utilization.

**Schedule:** Every hour

**Hardware:**
- 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
- 128GB system RAM
- ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1

---

## File Locations

```
STATE:   /opt/data/infra/assets/ai-optimizer/state.json
RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
REPO:    /opt/data/infra (persistent - do not reclone)
```

---

## Model Queues

### GPU Track (Coding - prioritize speed + context on GPU)
1. `devstral-small-2:24b`
2. `qwen2.5-coder:32b`
3. `codellama:34b-instruct`

### RAM Track (Knowledge - prioritize max context)
1. `qwen2.5:72b`
2. `nemotron-3-nano:30b`
3. `mixtral:8x7b-instruct`

---

## Context Steps (in order)
```
[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
```

---

## Optimization Strategy

### GPU Track (Coding)
- Start: num_ctx=32768, num_gpu=99, flash_attn=true
- Increase context until OOM or tokens/sec < 5
- Record best config before hitting wall
- Target: >10 tokens/sec with max context

### RAM Track (Knowledge)
- Start: num_ctx=65536, num_gpu=50, flash_attn=true
- Allow heavy RAM offload (up to 100GB system RAM)
- Increase context until OOM
- Speed secondary to context size

---

## Each Run - Step by Step

### 1. Read State
```bash
cd /opt/data/infra
cat assets/ai-optimizer/state.json
```

### 2. Determine Next Test
- Read `track` (gpu or ram)
- Get `current_model` from queue at `model_index`
- Get `current_config` for parameters to test
- Select next context step from `context_steps`

### 3. Pull Model (if needed)
```bash
docker exec ollama ollama list | grep -q "<model>" || docker exec ollama ollama pull <model>
```

### 4. Create Test Modelfile
```bash
docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
FROM ${model}
PARAMETER num_ctx ${current_config.num_ctx}
PARAMETER num_gpu ${current_config.num_gpu}
PARAMETER flash_attn ${current_config.flash_attn}
PARAMETER num_predict 4096
PARAMETER num_keep 1024
PARAMETER repeat_penalty 1.1
EOF"

docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile
```

### 5. Run Benchmark
```bash
# Warm up
docker exec ollama ollama run test-model "Hello" > /dev/null

# Coding prompt
docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."

# Knowledge prompt
docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."
```

### 6. Measure VRAM (if possible)
```bash
# Try host first
rocm-smi --showmeminfo vram 2>/dev/null || \
# Try via docker
docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
echo "VRAM unavailable"
```

### 7. Record Results
- Parse tokens/sec from ollama output
- Record VRAM/RAM usage
- Update `best_configs` if improved

### 8. Update State
```python
if test_successful:
    if context_step < max_reached:
        current_config.num_ctx = next_context_step
    else:
        model_index += 1
        current_config.num_ctx = context_steps[0]
else:
    best_configs[track][current_model] = last_good_config
    model_index += 1
```

### 9. Commit to Repo
```bash
cd /opt/data/infra
git add assets/ai-optimizer/
git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
git push
```

### 10. Matrix Notification (if available)
```python
import os
if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
    # Send notification
    pass
# Else: silent
```

---

## State File Structure

```json
{
  "track": "gpu",
  "current_model": "devstral-small-2:24b",
  "model_index": 0,
  "phase": "context_scaling",
  "backend": "ollama",
  "current_config": {
    "num_ctx": 32768,
    "num_gpu": 99,
    "flash_attn": true
  },
  "best_configs": {
    "gpu": {},
    "ram": {}
  },
  "completed_models": [],
  "gpu_queue": ["devstral-small-2:24b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
  "ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
  "context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
  "last_updated": "2026-04-28T17:00:00Z"
}
```

---

## Results CSV Format

```csv
timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
2026-04-28T17:00:00Z,gpu,devstral-small-2:24b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false
```

---

## Stop Conditions

1. All models in both queues have `best_configs` recorded
2. Manual intervention needed (error in state.json `error` field)
3. No progress for 3 consecutive runs

---

## Error Handling

If any step fails:
1. Log error: `"error": {"message": "...", "timestamp": "..."}`
2. Do NOT increment model_index (retry next run)
3. Commit state with error field
4. Exit gracefully

---

## Notes

- **No num_parallel**: Removed to avoid limiting other settings
- **Two tracks**: Complete GPU track first, then RAM track
- **Backend**: Start with ollama, llama.cpp optional
- **Host access**: Use docker exec or SSH for rocm-smi
- **Ask before deploy**: Show diff before `nh os switch`