4.6 KiB
4.6 KiB
AI Model Optimizer - Ollama GPU Benchmark Plan
Purpose: Find optimal ollama configurations for maximum context size and GPU utilization on AMD MI50 GPUs.
Hardware:
- 2x AMD MI50 GPUs (32GB VRAM each, 64GB total)
- 128GB system RAM
- ROCm:
HSA_OVERRIDE_GFX_VERSION=9.0.6,HIP_VISIBLE_DEVICES=0,1
File Locations
STATE: /opt/data/infra/assets/ai-optimizer/state.json
RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
REPO: /opt/data/infra (persistent clone)
Model Queues
GPU Track (Coding - prioritize speed + context on GPU)
deepseek-coder-v2:16b- Best coding model, fits on GPUqwen2.5-coder:32b- Alternative coding modelcodellama:34b-instruct- Legacy option
RAM Track (Knowledge - prioritize max context)
qwen2.5:72b- Large knowledge modelnemotron-3-nano:30b- Efficient large modelmixtral:8x7b-instruct- MoE architecture
Context Steps (in order)
[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
Optimization Strategy
GPU Track (Coding)
- Start:
num_ctx=32768,num_gpu=99,flash_attn=true - Increase context until OOM or tokens/sec < 5
- Record best config before hitting wall
- Target: >10 tokens/sec with max context
RAM Track (Knowledge)
- Start:
num_ctx=65536,num_gpu=50,flash_attn=true - Allow heavy RAM offload (up to 100GB system RAM)
- Increase context until OOM
- Speed secondary to context size
Prerequisites
This PR adds the ai-worker user with docker group access. After merge:
# SSH from Hermes container to run benchmarks on the host
ssh -i /path/to/key ai-worker@host docker exec ollama ollama list
# Or if running directly on host
docker exec ollama ollama list
Manual Testing Workflow
1. Quick Model Test
docker exec ollama ollama run <model>:<tag> "Your prompt here"
2. Check Current State
cd /opt/data/infra
cat assets/ai-optimizer/state.json
3. Pull Model (if needed)
docker exec ollama ollama pull <model>:<tag>
4. Create Test Modelfile
docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
FROM ${model}
PARAMETER num_ctx ${num_ctx}
PARAMETER num_gpu ${num_gpu}
PARAMETER flash_attn true
PARAMETER num_predict 4096
PARAMETER num_keep 1024
PARAMETER repeat_penalty 1.1
EOF"
docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile
5. Run Benchmark
# Warm up
docker exec ollama ollama run test-model "Hello" > /dev/null
# Coding prompt
docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."
# Knowledge prompt
docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."
6. Measure VRAM
# Try host first
rocm-smi --showmeminfo vram 2>/dev/null || \
# Try via docker
docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
echo "VRAM unavailable"
7. Record Results
Update state.json and append to results.csv:
- tokens/sec from ollama output
- VRAM/RAM usage
- Whether this config is the new best
8. Commit Changes
cd /opt/data/infra
git add assets/ai-optimizer/
git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
git push
State File Structure
{
"track": "gpu",
"current_model": "deepseek-coder-v2:16b",
"model_index": 0,
"phase": "context_scaling",
"backend": "ollama",
"current_config": {
"num_ctx": 32768,
"num_gpu": 99,
"flash_attn": true
},
"best_configs": {
"gpu": {},
"ram": {}
},
"completed_models": [],
"gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
"ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
"context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
"last_updated": "2026-04-30T00:00:00Z"
}
Results CSV Format
timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
Notes
- Manual execution - Run benchmarks when needed, no automated cron job
- Two tracks: Complete GPU track first (coding models), then RAM track
- Backend: ollama (llama.cpp optional for advanced users)
- Host access: Use docker exec (or SSH via ai-worker) for rocm-smi
- Commit results: Push best configs to repo for reference