# AI Model Optimization Cron Job **Goal:** Find optimal configurations for maximum context size with full hardware utilization. **Hardware:** - 2× AMD MI50 GPUs (32GB VRAM each, 64GB total) - 128GB system RAM - ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1 --- ## Model Queue ### GPU-Optimized (Coding - prioritize speed + context on GPU) 1. `devstral-small-2:24b` - Best coding model 2. `qwen2.5-coder:32b` - Strong coder, fits on GPU+offload 3. `codellama:34b-instruct` - Legacy but solid ### RAM-Optimized (Knowledge - prioritize max context, accept slower) 1. `qwen2.5:72b` - Best knowledge, needs heavy offload 2. `nemotron-3-nano:30b` - Good general knowledge 3. `mixtral:8x7b-instruct` - MoE, efficient for knowledge --- ## Optimization Strategy **Two separate tracks:** ### Track A: GPU-Focused (Coding) ``` Baseline: num_ctx=32768, num_gpu=99, flash_attn=true Steps: 1. Increase context: 32k → 65k → 98k → 131k → 163k 2. At each step, verify VRAM usage < 60GB (leave headroom) 3. If OOM: reduce num_gpu until stable, record best 4. Measure tokens/sec - if < 5 tok/s, consider context too high ``` ### Track B: RAM-Focused (Knowledge) ``` Baseline: num_ctx=65536, num_gpu=50, flash_attn=true Steps: 1. Increase context: 65k → 131k → 200k → 262k → 327k 2. Allow heavy RAM offload (system RAM up to 100GB) 3. If OOM: reduce context or num_gpu 4. Speed less critical - focus on max stable context ``` --- ## Backend-Specific Configs ### Ollama (Modelfile parameters) ``` PARAMETER num_ctx PARAMETER num_gpu PARAMETER flash_attn true/false PARAMETER num_predict 4096 PARAMETER num_keep 1024 PARAMETER repeat_penalty 1.1 ``` ### Llama.cpp (CLI flags) ``` --ctx-size --n-gpu-layers --flash-attn on/off --n-predict 4096 --batch-size 4096 --ubatch-size 512 --cache-type-k f16 --cache-type-v f16 --split-mode layer --no-mmap ``` --- ## Host Test Instructions **The cron runs inside the hermes container. Some tests require host access:** ### 1. VRAM Monitoring (HOST) ```bash # Run on host to check VRAM usage during/after benchmark sudo rocm-smi --showmeminfo vram # Or via docker exec if rocm-smi available in container docker exec --privileged ollama rocm-smi --showmeminfo vram ``` ### 2. Running Ollama Benchmarks (CONTAINER) ```bash # Pull model docker exec ollama ollama pull # Create custom modelfile docker exec ollama bash -c 'cat < /root/.ollama/test.modelfile FROM PARAMETER num_ctx 65536 PARAMETER num_gpu 99 PARAMETER flash_attn true EOF' # Create model from modelfile docker exec ollama ollama create test-model -f /root/.ollama/test.modelfile # Run benchmark (warm model first) docker exec ollama ollama run test-model "Write a Python async context manager with exponential backoff" # Cleanup docker exec ollama ollama rm test-model ``` ### 3. Running Llama.cpp Benchmarks (CONTAINER - needs llama.cpp container) ```bash # Uncomment llama_cpp_devstral in compose.yml first # Then rebuild: sudo nh os switch --flake .#lazyworkhorse # Test via HTTP API curl http://localhost:8300/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "devstral-2-small-llama_cpp", "prompt": "Write a Python function", "max_tokens": 100 }' ``` ### 4. Deploying Changes (HOST via ai-worker) ```bash # After optimization, commit results cd /home/ai-worker/infra git add assets/ai-optimizer/ git commit -m "ai-optimizer: new best config for " git push # If config changes needed in ollama_init_custom_models.nix: # 1. Edit the file # 2. nixpkgs-fmt . # 3. Show diff to user # 4. Wait for confirmation # 5. sudo nh os switch --flake .#lazyworkhorse ``` ### 5. Accessing Host from Hermes Container ```bash # SSH to host as ai-worker (key should be mounted) ssh -i /path/to/key ai-worker@host.docker.internal # Or via docker socket if mounted # (not recommended for security) ``` --- ## Benchmark Prompts ### Coding (Track A) ``` "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints and error handling." ``` ### Knowledge (Track B) ``` "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication. Include bandwidth considerations for each level." ``` ### Measurement - Tokens per second (generation speed) - Time to first token (latency) - VRAM usage (via rocm-smi) - System RAM usage (via free -h) - Context success (did it complete without OOM?) --- ## State File Structure `/opt/data/infra/assets/ai-optimizer/state.json` ```json { "track": "gpu", "current_model": "devstral-small-2:24b", "model_index": 0, "phase": "context_scaling", "backend": "ollama", "current_config": { "num_ctx": 65536, "num_gpu": 99, "flash_attn": true }, "best_configs": { "gpu": { "devstral-small-2:24b": { "backend": "ollama", "num_ctx": 131072, "num_gpu": 99, "flash_attn": true, "tokens_per_sec": 12.5, "vram_used_gb": 58.2, "tested_at": "2026-04-28T17:00:00Z" } }, "ram": {} }, "completed_models": [], "gpu_queue": ["devstral-small-2:24b", "qwen2.5-coder:32b", "codellama:34b-instruct"], "ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"] } ``` --- ## Results CSV `/opt/data/infra/assets/ai-optimizer/results.csv` ```csv timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best 2026-04-28T17:00:00Z,gpu,devstral-small-2:24b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false ``` --- ## Cron Job Flow ``` 1. Read state.json 2. If both queues empty → STOP (all models tested) 3. Select next model from current track queue 4. Pull model if needed (docker exec ollama ollama pull) 5. Create Modelfile / llama.cpp config with current test params 6. Run benchmark (both prompts) 7. Measure: tokens/sec, VRAM (rocm-smi), RAM (free -h) 8. If successful: - Increase context (next step) - Update current_config in state 9. If OOM/error: - Record last good config as best_configs[track][model] - Move to next model in queue 10. Update state.json 11. Append to results.csv 12. Git commit + push to /opt/data/infra 13. Send Matrix notification if available, else silent ``` --- ## Matrix Notification (Optional) ```python # If matrix credentials available in environment if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"): # Send completion notification # Room: !ai-optimizer:lazyworkhorse.net (or similar) pass # Else: silent, just commit ``` --- ## Files to Create ``` /opt/data/infra/assets/ai-optimizer/ ├── state.json # Current progress ├── results.csv # All test results ├── best_configs.json # Final best configs (human-readable) └── CRON_JOB_DRAFT.md # This file ``` --- ## Notes - **No num_parallel**: Removed to avoid limiting other settings - **Two tracks**: GPU (coding/speed) vs RAM (knowledge/context) - **Both backends**: Test ollama first, then llama.cpp if available - **Host tests**: rocm-smi must run on host or privileged container - **Deploy**: ai-worker has sudo for nh/nixos-rebuild, must ask user first