diff --git a/assets/ai-optimizer/README.md b/assets/ai-optimizer/README.md new file mode 100644 index 0000000..cde9392 --- /dev/null +++ b/assets/ai-optimizer/README.md @@ -0,0 +1,194 @@ +# AI Model Optimizer - Ollama GPU Benchmark Plan + +**Purpose:** Find optimal ollama configurations for maximum context size and GPU utilization on AMD MI50 GPUs. + +**Hardware:** +- 2x AMD MI50 GPUs (32GB VRAM each, 64GB total) +- 128GB system RAM +- ROCm: `HSA_OVERRIDE_GFX_VERSION=9.0.6`, `HIP_VISIBLE_DEVICES=0,1` + +--- + +## File Locations + +``` +STATE: /opt/data/infra/assets/ai-optimizer/state.json +RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv +REPO: /opt/data/infra (persistent clone) +``` + +--- + +## Model Queues + +### GPU Track (Coding - prioritize speed + context on GPU) +1. `deepseek-coder-v2:16b` - Best coding model, fits on GPU +2. `qwen2.5-coder:32b` - Alternative coding model +3. `codellama:34b-instruct` - Legacy option + +### RAM Track (Knowledge - prioritize max context) +1. `qwen2.5:72b` - Large knowledge model +2. `nemotron-3-nano:30b` - Efficient large model +3. `mixtral:8x7b-instruct` - MoE architecture + +--- + +## Context Steps (in order) + +``` +[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680] +``` + +--- + +## Optimization Strategy + +### GPU Track (Coding) +- Start: `num_ctx=32768`, `num_gpu=99`, `flash_attn=true` +- Increase context until OOM or tokens/sec < 5 +- Record best config before hitting wall +- Target: >10 tokens/sec with max context + +### RAM Track (Knowledge) +- Start: `num_ctx=65536`, `num_gpu=50`, `flash_attn=true` +- Allow heavy RAM offload (up to 100GB system RAM) +- Increase context until OOM +- Speed secondary to context size + +--- + +## Prerequisites + +This PR adds the `ai-worker` user with docker group access. After merge: + +```bash +# SSH from Hermes container to run benchmarks on the host +ssh -i /path/to/key ai-worker@host docker exec ollama ollama list + +# Or if running directly on host +docker exec ollama ollama list +``` + +--- + +## Manual Testing Workflow + +### 1. Quick Model Test + +```bash +docker exec ollama ollama run : "Your prompt here" +``` + +### 2. Check Current State + +```bash +cd /opt/data/infra +cat assets/ai-optimizer/state.json +``` + +### 3. Pull Model (if needed) + +```bash +docker exec ollama ollama pull : +``` + +### 4. Create Test Modelfile + +```bash +docker exec ollama bash -c "cat < /root/.ollama/test_${model}.modelfile +FROM ${model} +PARAMETER num_ctx ${num_ctx} +PARAMETER num_gpu ${num_gpu} +PARAMETER flash_attn true +PARAMETER num_predict 4096 +PARAMETER num_keep 1024 +PARAMETER repeat_penalty 1.1 +EOF" + +docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile +``` + +### 5. Run Benchmark + +```bash +# Warm up +docker exec ollama ollama run test-model "Hello" > /dev/null + +# Coding prompt +docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints." + +# Knowledge prompt +docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication." +``` + +### 6. Measure VRAM + +```bash +# Try host first +rocm-smi --showmeminfo vram 2>/dev/null || \ +# Try via docker +docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \ +echo "VRAM unavailable" +``` + +### 7. Record Results + +Update `state.json` and append to `results.csv`: +- tokens/sec from ollama output +- VRAM/RAM usage +- Whether this config is the new best + +### 8. Commit Changes + +```bash +cd /opt/data/infra +git add assets/ai-optimizer/ +git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}" +git push +``` + +--- + +## State File Structure + +```json +{ + "track": "gpu", + "current_model": "deepseek-coder-v2:16b", + "model_index": 0, + "phase": "context_scaling", + "backend": "ollama", + "current_config": { + "num_ctx": 32768, + "num_gpu": 99, + "flash_attn": true + }, + "best_configs": { + "gpu": {}, + "ram": {} + }, + "completed_models": [], + "gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"], + "ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"], + "context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680], + "last_updated": "2026-04-30T00:00:00Z" +} +``` + +--- + +## Results CSV Format + +```csv +timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best +``` + +--- + +## Notes + +- **Manual execution** - Run benchmarks when needed, no automated cron job +- **Two tracks**: Complete GPU track first (coding models), then RAM track +- **Backend**: ollama (llama.cpp optional for advanced users) +- **Host access**: Use docker exec (or SSH via ai-worker) for rocm-smi +- **Commit results**: Push best configs to repo for reference diff --git a/assets/ai-optimizer/results.csv b/assets/ai-optimizer/results.csv new file mode 100644 index 0000000..7e25194 --- /dev/null +++ b/assets/ai-optimizer/results.csv @@ -0,0 +1 @@ +timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best diff --git a/assets/ai-optimizer/state.json b/assets/ai-optimizer/state.json new file mode 100644 index 0000000..08dac90 --- /dev/null +++ b/assets/ai-optimizer/state.json @@ -0,0 +1,21 @@ +{ + "track": "gpu", + "current_model": "deepseek-coder-v2:16b", + "model_index": 0, + "phase": "context_scaling", + "backend": "ollama", + "current_config": { + "num_ctx": 32768, + "num_gpu": 99, + "flash_attn": true + }, + "best_configs": { + "gpu": {}, + "ram": {} + }, + "completed_models": [], + "gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"], + "ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"], + "context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680], + "last_updated": "2026-05-09T00:00:00Z" +}