# AI Model Optimization Cron Job - EXECUTION PROMPT **When this cron runs, follow these instructions exactly:** --- ## Your Role You are an AI model optimization agent. Your task is to find the best ollama/llama.cpp configuration for maximum context size and hardware utilization. **Hardware:** - 2× AMD MI50 GPUs (32GB VRAM each, 64GB total) - 128GB system RAM - ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1 --- ## File Locations ``` STATE: /opt/data/infra/assets/ai-optimizer/state.json RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv INFRA_REPO: /opt/data/infra ``` --- ## Model Queues ### GPU Track (Coding - prioritize speed + context on GPU) 1. `devstral-small-2:24b` 2. `qwen2.5-coder:32b` 3. `codellama:34b-instruct` ### RAM Track (Knowledge - prioritize max context) 1. `qwen2.5:72b` 2. `nemotron-3-nano:30b` 3. `mixtral:8x7b-instruct` --- ## Context Steps (in order) ``` [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680] ``` --- ## Each Run - Step by Step ### 1. Read State ```bash cd /opt/data/infra cat assets/ai-optimizer/state.json ``` ### 2. Determine Next Test - Read `track` (gpu or ram) - Read `current_model` from queue at `model_index` - Read `current_config` for parameters to test - Select next context step from `context_steps` based on `phase` ### 3. Pull Model (if needed) ```bash docker exec ollama ollama list | grep -q "" || docker exec ollama ollama pull ``` ### 4. Create Test Modelfile ```bash docker exec ollama bash -c "cat < /root/.ollama/test_${model}.modelfile FROM ${model} PARAMETER num_ctx ${current_config.num_ctx} PARAMETER num_gpu ${current_config.num_gpu} PARAMETER flash_attn ${current_config.flash_attn} PARAMETER num_predict 4096 PARAMETER num_keep 1024 PARAMETER repeat_penalty 1.1 EOF" docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile ``` ### 5. Run Benchmark ```bash # Warm up docker exec ollama ollama run test-model "Hello" > /dev/null # Coding prompt START=$(date +%s%N) docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints." END=$(date +%s%N) # Calculate tokens/sec from output ``` ### 6. Measure VRAM (if possible) ```bash # Try host first rocm-smi --showmeminfo vram 2>/dev/null || \ # Try via docker docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \ # Fallback echo "VRAM measurement unavailable" ``` ### 7. Record Results - Parse tokens/sec from ollama output - Record VRAM/RAM usage - Determine if this is best config so far for this model - Update `best_configs` if tokens/sec improved or context increased ### 8. Update State ```python # Logic: if test_successful: if context_step < max_reached: phase = "context_scaling" current_config.num_ctx = next_context_step else: # Move to next model model_index += 1 phase = "context_scaling" current_config.num_ctx = context_steps[0] else: # OOM or error - record last good as best best_configs[track][current_model] = last_good_config model_index += 1 phase = "context_scaling" ``` ### 9. Commit to Repo ```bash cd /opt/data/infra git add assets/ai-optimizer/ git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}" git push origin master ``` ### 10. Matrix Notification (if available) ```python import os if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"): # Send notification to Matrix room # Room ID from env or config pass # Else: silent ``` --- ## Stop Conditions 1. All models in both queues have `best_configs` recorded 2. Manual intervention needed (error in state.json `error` field) 3. No progress for 3 consecutive runs (stuck) --- ## Error Handling If any step fails: 1. Log error to state.json: `"error": {"message": "...", "timestamp": "..."}` 2. Do NOT increment model_index (retry next run) 3. Commit state with error field 4. Exit gracefully --- ## Important Notes - **No num_parallel**: Do not use this parameter - **Two tracks**: Complete GPU track first, then RAM track - **Backend**: Start with ollama, llama.cpp testing is optional (requires uncommenting in compose.yml) - **Host access**: Some commands need host - use docker exec or SSH if available - **Ask before deploy**: If config changes needed in NixOS modules, show diff and wait for user confirmation before `nh os switch` --- ## Example State Transitions **Start:** ```json {"track": "gpu", "model_index": 0, "current_model": "devstral-small-2:24b", "current_config": {"num_ctx": 32768, ...}} ``` **After successful test at 32k:** ```json {"track": "gpu", "model_index": 0, "current_model": "devstral-small-2:24b", "current_config": {"num_ctx": 65536, ...}} ``` **After OOM at 131k:** ```json { "track": "gpu", "model_index": 1, "current_model": "qwen2.5-coder:32b", "best_configs": { "gpu": { "devstral-small-2:24b": {"num_ctx": 98304, "num_gpu": 99, "tokens_per_sec": 11.2} } } } ```