Files
infra/assets/ai-optimizer/CRON_JOB_DRAFT.md

284 lines
7.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AI Model Optimization Cron Job
**Goal:** Find optimal configurations for maximum context size with full hardware utilization.
**Hardware:**
- 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
- 128GB system RAM
- ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1
---
## Model Queue
### GPU-Optimized (Coding - prioritize speed + context on GPU)
1. `devstral-small-2:24b` - Best coding model
2. `qwen2.5-coder:32b` - Strong coder, fits on GPU+offload
3. `codellama:34b-instruct` - Legacy but solid
### RAM-Optimized (Knowledge - prioritize max context, accept slower)
1. `qwen2.5:72b` - Best knowledge, needs heavy offload
2. `nemotron-3-nano:30b` - Good general knowledge
3. `mixtral:8x7b-instruct` - MoE, efficient for knowledge
---
## Optimization Strategy
**Two separate tracks:**
### Track A: GPU-Focused (Coding)
```
Baseline: num_ctx=32768, num_gpu=99, flash_attn=true
Steps:
1. Increase context: 32k → 65k → 98k → 131k → 163k
2. At each step, verify VRAM usage < 60GB (leave headroom)
3. If OOM: reduce num_gpu until stable, record best
4. Measure tokens/sec - if < 5 tok/s, consider context too high
```
### Track B: RAM-Focused (Knowledge)
```
Baseline: num_ctx=65536, num_gpu=50, flash_attn=true
Steps:
1. Increase context: 65k → 131k → 200k → 262k → 327k
2. Allow heavy RAM offload (system RAM up to 100GB)
3. If OOM: reduce context or num_gpu
4. Speed less critical - focus on max stable context
```
---
## Backend-Specific Configs
### Ollama (Modelfile parameters)
```
PARAMETER num_ctx <value>
PARAMETER num_gpu <layers>
PARAMETER flash_attn true/false
PARAMETER num_predict 4096
PARAMETER num_keep 1024
PARAMETER repeat_penalty 1.1
```
### Llama.cpp (CLI flags)
```
--ctx-size <value>
--n-gpu-layers <layers>
--flash-attn on/off
--n-predict 4096
--batch-size 4096
--ubatch-size 512
--cache-type-k f16
--cache-type-v f16
--split-mode layer
--no-mmap
```
---
## Host Test Instructions
**The cron runs inside the hermes container. Some tests require host access:**
### 1. VRAM Monitoring (HOST)
```bash
# Run on host to check VRAM usage during/after benchmark
sudo rocm-smi --showmeminfo vram
# Or via docker exec if rocm-smi available in container
docker exec --privileged ollama rocm-smi --showmeminfo vram
```
### 2. Running Ollama Benchmarks (CONTAINER)
```bash
# Pull model
docker exec ollama ollama pull <model>
# Create custom modelfile
docker exec ollama bash -c 'cat <<EOF > /root/.ollama/test.modelfile
FROM <model>
PARAMETER num_ctx 65536
PARAMETER num_gpu 99
PARAMETER flash_attn true
EOF'
# Create model from modelfile
docker exec ollama ollama create test-model -f /root/.ollama/test.modelfile
# Run benchmark (warm model first)
docker exec ollama ollama run test-model "Write a Python async context manager with exponential backoff"
# Cleanup
docker exec ollama ollama rm test-model
```
### 3. Running Llama.cpp Benchmarks (CONTAINER - needs llama.cpp container)
```bash
# Uncomment llama_cpp_devstral in compose.yml first
# Then rebuild: sudo nh os switch --flake .#lazyworkhorse
# Test via HTTP API
curl http://localhost:8300/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "devstral-2-small-llama_cpp",
"prompt": "Write a Python function",
"max_tokens": 100
}'
```
### 4. Deploying Changes (HOST via ai-worker)
```bash
# After optimization, commit results
cd /home/ai-worker/infra
git add assets/ai-optimizer/
git commit -m "ai-optimizer: new best config for <model>"
git push
# If config changes needed in ollama_init_custom_models.nix:
# 1. Edit the file
# 2. nixpkgs-fmt .
# 3. Show diff to user
# 4. Wait for confirmation
# 5. sudo nh os switch --flake .#lazyworkhorse
```
### 5. Accessing Host from Hermes Container
```bash
# SSH to host as ai-worker (key should be mounted)
ssh -i /path/to/key ai-worker@host.docker.internal
# Or via docker socket if mounted
# (not recommended for security)
```
---
## Benchmark Prompts
### Coding (Track A)
```
"Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints and error handling."
```
### Knowledge (Track B)
```
"Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication. Include bandwidth considerations for each level."
```
### Measurement
- Tokens per second (generation speed)
- Time to first token (latency)
- VRAM usage (via rocm-smi)
- System RAM usage (via free -h)
- Context success (did it complete without OOM?)
---
## State File Structure
`/opt/data/infra/assets/ai-optimizer/state.json`
```json
{
"track": "gpu",
"current_model": "devstral-small-2:24b",
"model_index": 0,
"phase": "context_scaling",
"backend": "ollama",
"current_config": {
"num_ctx": 65536,
"num_gpu": 99,
"flash_attn": true
},
"best_configs": {
"gpu": {
"devstral-small-2:24b": {
"backend": "ollama",
"num_ctx": 131072,
"num_gpu": 99,
"flash_attn": true,
"tokens_per_sec": 12.5,
"vram_used_gb": 58.2,
"tested_at": "2026-04-28T17:00:00Z"
}
},
"ram": {}
},
"completed_models": [],
"gpu_queue": ["devstral-small-2:24b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
"ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"]
}
```
---
## Results CSV
`/opt/data/infra/assets/ai-optimizer/results.csv`
```csv
timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
2026-04-28T17:00:00Z,gpu,devstral-small-2:24b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false
```
---
## Cron Job Flow
```
1. Read state.json
2. If both queues empty → STOP (all models tested)
3. Select next model from current track queue
4. Pull model if needed (docker exec ollama ollama pull)
5. Create Modelfile / llama.cpp config with current test params
6. Run benchmark (both prompts)
7. Measure: tokens/sec, VRAM (rocm-smi), RAM (free -h)
8. If successful:
- Increase context (next step)
- Update current_config in state
9. If OOM/error:
- Record last good config as best_configs[track][model]
- Move to next model in queue
10. Update state.json
11. Append to results.csv
12. Git commit + push to /opt/data/infra
13. Send Matrix notification if available, else silent
```
---
## Matrix Notification (Optional)
```python
# If matrix credentials available in environment
if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
# Send completion notification
# Room: !ai-optimizer:lazyworkhorse.net (or similar)
pass
# Else: silent, just commit
```
---
## Files to Create
```
/opt/data/infra/assets/ai-optimizer/
├── state.json # Current progress
├── results.csv # All test results
├── best_configs.json # Final best configs (human-readable)
└── CRON_JOB_DRAFT.md # This file
```
---
## Notes
- **No num_parallel**: Removed to avoid limiting other settings
- **Two tracks**: GPU (coding/speed) vs RAM (knowledge/context)
- **Both backends**: Test ollama first, then llama.cpp if available
- **Host tests**: rocm-smi must run on host or privileged container
- **Deploy**: ai-worker has sudo for nh/nixos-rebuild, must ask user first