Files
infra/assets/ai-optimizer/README.md

215 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AI Model Optimization Cron Job
**Purpose:** Automatically find optimal ollama/llama.cpp configurations for maximum context size and hardware utilization.
**Schedule:** Every hour
**Hardware:**
- 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
- 128GB system RAM
- ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1
---
## File Locations
```
STATE: /opt/data/infra/assets/ai-optimizer/state.json
RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
REPO: /opt/data/infra (persistent - do not reclone)
```
---
## Model Queues
### GPU Track (Coding - prioritize speed + context on GPU)
1. `devstral-small-2:24b`
2. `qwen2.5-coder:32b`
3. `codellama:34b-instruct`
### RAM Track (Knowledge - prioritize max context)
1. `qwen2.5:72b`
2. `nemotron-3-nano:30b`
3. `mixtral:8x7b-instruct`
---
## Context Steps (in order)
```
[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
```
---
## Optimization Strategy
### GPU Track (Coding)
- Start: num_ctx=32768, num_gpu=99, flash_attn=true
- Increase context until OOM or tokens/sec < 5
- Record best config before hitting wall
- Target: >10 tokens/sec with max context
### RAM Track (Knowledge)
- Start: num_ctx=65536, num_gpu=50, flash_attn=true
- Allow heavy RAM offload (up to 100GB system RAM)
- Increase context until OOM
- Speed secondary to context size
---
## Each Run - Step by Step
### 1. Read State
```bash
cd /opt/data/infra
cat assets/ai-optimizer/state.json
```
### 2. Determine Next Test
- Read `track` (gpu or ram)
- Get `current_model` from queue at `model_index`
- Get `current_config` for parameters to test
- Select next context step from `context_steps`
### 3. Pull Model (if needed)
```bash
docker exec ollama ollama list | grep -q "<model>" || docker exec ollama ollama pull <model>
```
### 4. Create Test Modelfile
```bash
docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
FROM ${model}
PARAMETER num_ctx ${current_config.num_ctx}
PARAMETER num_gpu ${current_config.num_gpu}
PARAMETER flash_attn ${current_config.flash_attn}
PARAMETER num_predict 4096
PARAMETER num_keep 1024
PARAMETER repeat_penalty 1.1
EOF"
docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile
```
### 5. Run Benchmark
```bash
# Warm up
docker exec ollama ollama run test-model "Hello" > /dev/null
# Coding prompt
docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."
# Knowledge prompt
docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."
```
### 6. Measure VRAM (if possible)
```bash
# Try host first
rocm-smi --showmeminfo vram 2>/dev/null || \
# Try via docker
docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
echo "VRAM unavailable"
```
### 7. Record Results
- Parse tokens/sec from ollama output
- Record VRAM/RAM usage
- Update `best_configs` if improved
### 8. Update State
```python
if test_successful:
if context_step < max_reached:
current_config.num_ctx = next_context_step
else:
model_index += 1
current_config.num_ctx = context_steps[0]
else:
best_configs[track][current_model] = last_good_config
model_index += 1
```
### 9. Commit to Repo
```bash
cd /opt/data/infra
git add assets/ai-optimizer/
git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
git push
```
### 10. Matrix Notification (if available)
```python
import os
if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
# Send notification
pass
# Else: silent
```
---
## State File Structure
```json
{
"track": "gpu",
"current_model": "devstral-small-2:24b",
"model_index": 0,
"phase": "context_scaling",
"backend": "ollama",
"current_config": {
"num_ctx": 32768,
"num_gpu": 99,
"flash_attn": true
},
"best_configs": {
"gpu": {},
"ram": {}
},
"completed_models": [],
"gpu_queue": ["devstral-small-2:24b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
"ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
"context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
"last_updated": "2026-04-28T17:00:00Z"
}
```
---
## Results CSV Format
```csv
timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
2026-04-28T17:00:00Z,gpu,devstral-small-2:24b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false
```
---
## Stop Conditions
1. All models in both queues have `best_configs` recorded
2. Manual intervention needed (error in state.json `error` field)
3. No progress for 3 consecutive runs
---
## Error Handling
If any step fails:
1. Log error: `"error": {"message": "...", "timestamp": "..."}`
2. Do NOT increment model_index (retry next run)
3. Commit state with error field
4. Exit gracefully
---
## Notes
- **No num_parallel**: Removed to avoid limiting other settings
- **Two tracks**: Complete GPU track first, then RAM track
- **Backend**: Start with ollama, llama.cpp optional
- **Host access**: Use docker exec or SSH for rocm-smi
- **Ask before deploy**: Show diff before `nh os switch`