diff --git a/assets/ai-optimizer/README.md b/assets/ai-optimizer/README.md index b7b2461..a0bf589 100644 --- a/assets/ai-optimizer/README.md +++ b/assets/ai-optimizer/README.md @@ -1,13 +1,13 @@ -# AI Model Optimization Cron Job +# AI Model Optimizer - Manual Skill -**Purpose:** Automatically find optimal ollama/llama.cpp configurations for maximum context size and hardware utilization. +**Purpose:** Find optimal ollama configurations for maximum context size and GPU utilization on AMD MI50 GPUs. -**Schedule:** Every hour +**Usage:** Run manually via Hermes skill when needed (not automated). **Hardware:** - 2× AMD MI50 GPUs (32GB VRAM each, 64GB total) - 128GB system RAM -- ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1 +- ROCm: `HSA_OVERRIDE_GFX_VERSION=9.0.6`, `HIP_VISIBLE_DEVICES=0,1` --- @@ -21,21 +21,33 @@ REPO: /opt/data/infra (persistent - do not reclone) --- +## Quick Start + +```bash +# From Hermes container or any machine with ollama access +ollama-test-model --model devstral-small-2:24b --ctx 65536 +``` + +Or use the full workflow skill for systematic testing. + +--- + ## Model Queues ### GPU Track (Coding - prioritize speed + context on GPU) -1. `devstral-small-2:24b` -2. `qwen2.5-coder:32b` -3. `codellama:34b-instruct` +1. `deepseek-coder-v2:16b` - Best coding model, fits on GPU +2. `qwen2.5-coder:32b` - Alternative coding model +3. `codellama:34b-instruct` - Legacy option ### RAM Track (Knowledge - prioritize max context) -1. `qwen2.5:72b` -2. `nemotron-3-nano:30b` -3. `mixtral:8x7b-instruct` +1. `qwen2.5:72b` - Large knowledge model +2. `nemotron-3-nano:30b` - Efficient large model +3. `mixtral:8x7b-instruct` - MoE architecture --- ## Context Steps (in order) + ``` [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680] ``` @@ -45,45 +57,49 @@ REPO: /opt/data/infra (persistent - do not reclone) ## Optimization Strategy ### GPU Track (Coding) -- Start: num_ctx=32768, num_gpu=99, flash_attn=true +- Start: `num_ctx=32768`, `num_gpu=99`, `flash_attn=true` - Increase context until OOM or tokens/sec < 5 - Record best config before hitting wall - Target: >10 tokens/sec with max context ### RAM Track (Knowledge) -- Start: num_ctx=65536, num_gpu=50, flash_attn=true +- Start: `num_ctx=65536`, `num_gpu=50`, `flash_attn=true` - Allow heavy RAM offload (up to 100GB system RAM) - Increase context until OOM - Speed secondary to context size --- -## Each Run - Step by Step +## Manual Testing Workflow + +### 1. Quick Model Test + +```bash +# Test a model at specific context size +docker exec ollama ollama run : "Your prompt here" +``` + +### 2. Check Current State -### 1. Read State ```bash cd /opt/data/infra cat assets/ai-optimizer/state.json ``` -### 2. Determine Next Test -- Read `track` (gpu or ram) -- Get `current_model` from queue at `model_index` -- Get `current_config` for parameters to test -- Select next context step from `context_steps` - ### 3. Pull Model (if needed) + ```bash -docker exec ollama ollama list | grep -q "" || docker exec ollama ollama pull +docker exec ollama ollama pull : ``` ### 4. Create Test Modelfile + ```bash docker exec ollama bash -c "cat < /root/.ollama/test_${model}.modelfile FROM ${model} -PARAMETER num_ctx ${current_config.num_ctx} -PARAMETER num_gpu ${current_config.num_gpu} -PARAMETER flash_attn ${current_config.flash_attn} +PARAMETER num_ctx ${num_ctx} +PARAMETER num_gpu ${num_gpu} +PARAMETER flash_attn true PARAMETER num_predict 4096 PARAMETER num_keep 1024 PARAMETER repeat_penalty 1.1 @@ -93,6 +109,7 @@ docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.model ``` ### 5. Run Benchmark + ```bash # Warm up docker exec ollama ollama run test-model "Hello" > /dev/null @@ -104,7 +121,8 @@ docker exec ollama ollama run test-model "Write a Python async context manager t docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication." ``` -### 6. Measure VRAM (if possible) +### 6. Measure VRAM + ```bash # Try host first rocm-smi --showmeminfo vram 2>/dev/null || \ @@ -114,24 +132,14 @@ echo "VRAM unavailable" ``` ### 7. Record Results -- Parse tokens/sec from ollama output -- Record VRAM/RAM usage -- Update `best_configs` if improved -### 8. Update State -```python -if test_successful: - if context_step < max_reached: - current_config.num_ctx = next_context_step - else: - model_index += 1 - current_config.num_ctx = context_steps[0] -else: - best_configs[track][current_model] = last_good_config - model_index += 1 -``` +Update `state.json` and append to `results.csv`: +- tokens/sec from ollama output +- VRAM/RAM usage +- Whether this config is the new best + +### 8. Commit Changes -### 9. Commit to Repo ```bash cd /opt/data/infra git add assets/ai-optimizer/ @@ -139,15 +147,6 @@ git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}" git push ``` -### 10. Matrix Notification (if available) -```python -import os -if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"): - # Send notification - pass -# Else: silent -``` - --- ## State File Structure @@ -155,7 +154,7 @@ if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"): ```json { "track": "gpu", - "current_model": "devstral-small-2:24b", + "current_model": "deepseek-coder-v2:16b", "model_index": 0, "phase": "context_scaling", "backend": "ollama", @@ -169,10 +168,10 @@ if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"): "ram": {} }, "completed_models": [], - "gpu_queue": ["devstral-small-2:24b", "qwen2.5-coder:32b", "codellama:34b-instruct"], + "gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"], "ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"], "context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680], - "last_updated": "2026-04-28T17:00:00Z" + "last_updated": "2026-04-30T00:00:00Z" } ``` @@ -182,33 +181,29 @@ if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"): ```csv timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best -2026-04-28T17:00:00Z,gpu,devstral-small-2:24b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false +2026-04-30T00:00:00Z,gpu,deepseek-coder-v2:16b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false ``` --- -## Stop Conditions +## Skill Usage -1. All models in both queues have `best_configs` recorded -2. Manual intervention needed (error in state.json `error` field) -3. No progress for 3 consecutive runs +Once PR #1 (ai-worker-restricted-access) is merged: ---- +```bash +# From Hermes container, SSH to host for direct ollama access +ssh -i /path/to/key ai-worker@host docker exec ollama ollama run -## Error Handling - -If any step fails: -1. Log error: `"error": {"message": "...", "timestamp": "..."}` -2. Do NOT increment model_index (retry next run) -3. Commit state with error field -4. Exit gracefully +# Or run the skill directly +ollama-benchmark --model deepseek-coder-v2:16b --track gpu +``` --- ## Notes -- **No num_parallel**: Removed to avoid limiting other settings -- **Two tracks**: Complete GPU track first, then RAM track -- **Backend**: Start with ollama, llama.cpp optional +- **Manual execution only** - No cron job, run when needed +- **Two tracks**: Complete GPU track first (coding models), then RAM track +- **Backend**: ollama (llama.cpp optional for advanced users) - **Host access**: Use docker exec or SSH for rocm-smi -- **Ask before deploy**: Show diff before `nh os switch` +- **Commit results**: Push best configs to repo for reference diff --git a/assets/ai-optimizer/state.json b/assets/ai-optimizer/state.json index fff69f9..41b6d04 100644 --- a/assets/ai-optimizer/state.json +++ b/assets/ai-optimizer/state.json @@ -1,6 +1,6 @@ { "track": "gpu", - "current_model": "devstral-small-2:24b", + "current_model": "deepseek-coder-v2:16b", "model_index": 0, "phase": "context_scaling", "backend": "ollama", @@ -14,8 +14,8 @@ "ram": {} }, "completed_models": [], - "gpu_queue": ["devstral-small-2:24b", "qwen2.5-coder:32b", "codellama:34b-instruct"], + "gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"], "ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"], "context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680], - "last_updated": "2026-04-28T17:00:00Z" + "last_updated": "2026-04-30T00:00:00Z" }