2026-04-30 16:07:05 +00:00
# AI Model Optimizer - Manual Skill
2026-04-28 17:19:45 +00:00
2026-04-30 16:07:05 +00:00
**Purpose:** Find optimal ollama configurations for maximum context size and GPU utilization on AMD MI50 GPUs.
2026-04-28 17:19:45 +00:00
2026-04-30 16:07:05 +00:00
**Usage:** Run manually via Hermes skill when needed (not automated).
2026-04-28 17:19:45 +00:00
**Hardware:**
- 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
- 128GB system RAM
2026-04-30 16:07:05 +00:00
- ROCm: `HSA_OVERRIDE_GFX_VERSION=9.0.6` , `HIP_VISIBLE_DEVICES=0,1`
2026-04-28 17:19:45 +00:00
---
## File Locations
```
2026-04-28 17:26:21 +00:00
STATE: /opt/data/infra/assets/ai-optimizer/state.json
2026-04-28 17:19:45 +00:00
RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
2026-04-28 17:26:21 +00:00
REPO: /opt/data/infra (persistent - do not reclone)
2026-04-28 17:19:45 +00:00
```
---
2026-04-30 16:07:05 +00:00
## Quick Start
```bash
# From Hermes container or any machine with ollama access
ollama-test-model --model devstral-small-2:24b --ctx 65536
```
Or use the full workflow skill for systematic testing.
---
2026-04-28 17:19:45 +00:00
## Model Queues
### GPU Track (Coding - prioritize speed + context on GPU)
2026-04-30 16:07:05 +00:00
1. `deepseek-coder-v2:16b` - Best coding model, fits on GPU
2. `qwen2.5-coder:32b` - Alternative coding model
3. `codellama:34b-instruct` - Legacy option
2026-04-28 17:19:45 +00:00
### RAM Track (Knowledge - prioritize max context)
2026-04-30 16:07:05 +00:00
1. `qwen2.5:72b` - Large knowledge model
2. `nemotron-3-nano:30b` - Efficient large model
3. `mixtral:8x7b-instruct` - MoE architecture
2026-04-28 17:19:45 +00:00
---
## Context Steps (in order)
2026-04-30 16:07:05 +00:00
2026-04-28 17:19:45 +00:00
```
[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
```
---
2026-04-28 17:26:21 +00:00
## Optimization Strategy
### GPU Track (Coding)
2026-04-30 16:07:05 +00:00
- Start: `num_ctx=32768` , `num_gpu=99` , `flash_attn=true`
2026-04-28 17:26:21 +00:00
- Increase context until OOM or tokens/sec < 5
- Record best config before hitting wall
- Target: >10 tokens/sec with max context
### RAM Track (Knowledge)
2026-04-30 16:07:05 +00:00
- Start: `num_ctx=65536` , `num_gpu=50` , `flash_attn=true`
2026-04-28 17:26:21 +00:00
- Allow heavy RAM offload (up to 100GB system RAM)
- Increase context until OOM
- Speed secondary to context size
---
2026-04-30 16:07:05 +00:00
## Manual Testing Workflow
### 1. Quick Model Test
```bash
# Test a model at specific context size
docker exec ollama ollama run < model > :< tag > "Your prompt here"
```
### 2. Check Current State
2026-04-28 17:19:45 +00:00
```bash
cd /opt/data/infra
cat assets/ai-optimizer/state.json
```
### 3. Pull Model (if needed)
2026-04-30 16:07:05 +00:00
2026-04-28 17:19:45 +00:00
```bash
2026-04-30 16:07:05 +00:00
docker exec ollama ollama pull < model > :< tag >
2026-04-28 17:19:45 +00:00
```
### 4. Create Test Modelfile
2026-04-30 16:07:05 +00:00
2026-04-28 17:19:45 +00:00
```bash
docker exec ollama bash -c "cat < < EOF > /root/.ollama/test_${model}.modelfile
FROM ${model}
2026-04-30 16:07:05 +00:00
PARAMETER num_ctx ${num_ctx}
PARAMETER num_gpu ${num_gpu}
PARAMETER flash_attn true
2026-04-28 17:19:45 +00:00
PARAMETER num_predict 4096
PARAMETER num_keep 1024
PARAMETER repeat_penalty 1.1
EOF"
docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile
```
### 5. Run Benchmark
2026-04-30 16:07:05 +00:00
2026-04-28 17:19:45 +00:00
```bash
# Warm up
docker exec ollama ollama run test-model "Hello" > /dev/null
# Coding prompt
docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."
2026-04-28 17:26:21 +00:00
# Knowledge prompt
docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."
2026-04-28 17:19:45 +00:00
```
2026-04-30 16:07:05 +00:00
### 6. Measure VRAM
2026-04-28 17:19:45 +00:00
```bash
# Try host first
rocm-smi --showmeminfo vram 2>/dev/null || \
# Try via docker
docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
2026-04-28 17:26:21 +00:00
echo "VRAM unavailable"
2026-04-28 17:19:45 +00:00
```
### 7. Record Results
2026-04-30 16:07:05 +00:00
Update `state.json` and append to `results.csv` :
- tokens/sec from ollama output
- VRAM/RAM usage
- Whether this config is the new best
### 8. Commit Changes
2026-04-28 17:19:45 +00:00
```bash
cd /opt/data/infra
git add assets/ai-optimizer/
git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
2026-04-28 17:26:21 +00:00
git push
2026-04-28 17:19:45 +00:00
```
---
2026-04-28 17:26:21 +00:00
## State File Structure
```json
{
"track": "gpu",
2026-04-30 16:07:05 +00:00
"current_model": "deepseek-coder-v2:16b",
2026-04-28 17:26:21 +00:00
"model_index": 0,
"phase": "context_scaling",
"backend": "ollama",
"current_config": {
"num_ctx": 32768,
"num_gpu": 99,
"flash_attn": true
},
"best_configs": {
"gpu": {},
"ram": {}
},
"completed_models": [],
2026-04-30 16:07:05 +00:00
"gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
2026-04-28 17:26:21 +00:00
"ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
"context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
2026-04-30 16:07:05 +00:00
"last_updated": "2026-04-30T00:00:00Z"
2026-04-28 17:26:21 +00:00
}
```
---
## Results CSV Format
```csv
timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
2026-04-30 16:07:05 +00:00
2026-04-30T00:00:00Z,gpu,deepseek-coder-v2:16b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false
2026-04-28 17:26:21 +00:00
```
---
2026-04-30 16:07:05 +00:00
## Skill Usage
2026-04-28 17:19:45 +00:00
2026-04-30 16:07:05 +00:00
Once PR #1 (ai-worker-restricted-access) is merged:
2026-04-28 17:19:45 +00:00
2026-04-30 16:07:05 +00:00
```bash
# From Hermes container, SSH to host for direct ollama access
ssh -i /path/to/key ai-worker@host docker exec ollama ollama run < model >
2026-04-28 17:19:45 +00:00
2026-04-30 16:07:05 +00:00
# Or run the skill directly
ollama-benchmark --model deepseek-coder-v2:16b --track gpu
```
2026-04-28 17:19:45 +00:00
---
2026-04-28 17:26:21 +00:00
## Notes
2026-04-28 17:19:45 +00:00
2026-04-30 16:07:05 +00:00
- **Manual execution only** - No cron job, run when needed
- **Two tracks**: Complete GPU track first (coding models), then RAM track
- **Backend**: ollama (llama.cpp optional for advanced users)
2026-04-28 17:26:21 +00:00
- **Host access**: Use docker exec or SSH for rocm-smi
2026-04-30 16:07:05 +00:00
- **Commit results**: Push best configs to repo for reference