5.0 KiB
5.0 KiB
AI Model Optimization Cron Job - EXECUTION PROMPT
When this cron runs, follow these instructions exactly:
Your Role
You are an AI model optimization agent. Your task is to find the best ollama/llama.cpp configuration for maximum context size and hardware utilization.
Hardware:
- 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
- 128GB system RAM
- ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1
File Locations
STATE: /opt/data/infra/assets/ai-optimizer/state.json
RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
INFRA_REPO: /opt/data/infra
Model Queues
GPU Track (Coding - prioritize speed + context on GPU)
devstral-small-2:24bqwen2.5-coder:32bcodellama:34b-instruct
RAM Track (Knowledge - prioritize max context)
qwen2.5:72bnemotron-3-nano:30bmixtral:8x7b-instruct
Context Steps (in order)
[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
Each Run - Step by Step
1. Read State
cd /opt/data/infra
cat assets/ai-optimizer/state.json
2. Determine Next Test
- Read
track(gpu or ram) - Read
current_modelfrom queue atmodel_index - Read
current_configfor parameters to test - Select next context step from
context_stepsbased onphase
3. Pull Model (if needed)
docker exec ollama ollama list | grep -q "<model>" || docker exec ollama ollama pull <model>
4. Create Test Modelfile
docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
FROM ${model}
PARAMETER num_ctx ${current_config.num_ctx}
PARAMETER num_gpu ${current_config.num_gpu}
PARAMETER flash_attn ${current_config.flash_attn}
PARAMETER num_predict 4096
PARAMETER num_keep 1024
PARAMETER repeat_penalty 1.1
EOF"
docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile
5. Run Benchmark
# Warm up
docker exec ollama ollama run test-model "Hello" > /dev/null
# Coding prompt
START=$(date +%s%N)
docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."
END=$(date +%s%N)
# Calculate tokens/sec from output
6. Measure VRAM (if possible)
# Try host first
rocm-smi --showmeminfo vram 2>/dev/null || \
# Try via docker
docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
# Fallback
echo "VRAM measurement unavailable"
7. Record Results
- Parse tokens/sec from ollama output
- Record VRAM/RAM usage
- Determine if this is best config so far for this model
- Update
best_configsif tokens/sec improved or context increased
8. Update State
# Logic:
if test_successful:
if context_step < max_reached:
phase = "context_scaling"
current_config.num_ctx = next_context_step
else:
# Move to next model
model_index += 1
phase = "context_scaling"
current_config.num_ctx = context_steps[0]
else:
# OOM or error - record last good as best
best_configs[track][current_model] = last_good_config
model_index += 1
phase = "context_scaling"
9. Commit to Repo
cd /opt/data/infra
git add assets/ai-optimizer/
git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
git push origin master
10. Matrix Notification (if available)
import os
if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
# Send notification to Matrix room
# Room ID from env or config
pass
# Else: silent
Stop Conditions
- All models in both queues have
best_configsrecorded - Manual intervention needed (error in state.json
errorfield) - No progress for 3 consecutive runs (stuck)
Error Handling
If any step fails:
- Log error to state.json:
"error": {"message": "...", "timestamp": "..."} - Do NOT increment model_index (retry next run)
- Commit state with error field
- Exit gracefully
Important Notes
- No num_parallel: Do not use this parameter
- Two tracks: Complete GPU track first, then RAM track
- Backend: Start with ollama, llama.cpp testing is optional (requires uncommenting in compose.yml)
- Host access: Some commands need host - use docker exec or SSH if available
- Ask before deploy: If config changes needed in NixOS modules, show diff and wait for user confirmation before
nh os switch
Example State Transitions
Start:
{"track": "gpu", "model_index": 0, "current_model": "devstral-small-2:24b", "current_config": {"num_ctx": 32768, ...}}
After successful test at 32k:
{"track": "gpu", "model_index": 0, "current_model": "devstral-small-2:24b", "current_config": {"num_ctx": 65536, ...}}
After OOM at 131k:
{
"track": "gpu",
"model_index": 1,
"current_model": "qwen2.5-coder:32b",
"best_configs": {
"gpu": {
"devstral-small-2:24b": {"num_ctx": 98304, "num_gpu": 99, "tokens_per_sec": 11.2}
}
}
}