Files
infra/assets/ai-optimizer/README.md
Hermes Agent 0ec198dec2 feat: convert ai-optimizer from cron job to manual skill
- Update README.md for manual execution workflow
- Change model queue to deepseek-coder-v2:16b (better coding model)
- Remove automated scheduling references
- Add skill usage instructions for post-PR#1 merge
2026-04-30 16:07:05 +00:00

210 lines
5.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AI Model Optimizer - Manual Skill
**Purpose:** Find optimal ollama configurations for maximum context size and GPU utilization on AMD MI50 GPUs.
**Usage:** Run manually via Hermes skill when needed (not automated).
**Hardware:**
- 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
- 128GB system RAM
- ROCm: `HSA_OVERRIDE_GFX_VERSION=9.0.6`, `HIP_VISIBLE_DEVICES=0,1`
---
## File Locations
```
STATE: /opt/data/infra/assets/ai-optimizer/state.json
RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
REPO: /opt/data/infra (persistent - do not reclone)
```
---
## Quick Start
```bash
# From Hermes container or any machine with ollama access
ollama-test-model --model devstral-small-2:24b --ctx 65536
```
Or use the full workflow skill for systematic testing.
---
## Model Queues
### GPU Track (Coding - prioritize speed + context on GPU)
1. `deepseek-coder-v2:16b` - Best coding model, fits on GPU
2. `qwen2.5-coder:32b` - Alternative coding model
3. `codellama:34b-instruct` - Legacy option
### RAM Track (Knowledge - prioritize max context)
1. `qwen2.5:72b` - Large knowledge model
2. `nemotron-3-nano:30b` - Efficient large model
3. `mixtral:8x7b-instruct` - MoE architecture
---
## Context Steps (in order)
```
[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
```
---
## Optimization Strategy
### GPU Track (Coding)
- Start: `num_ctx=32768`, `num_gpu=99`, `flash_attn=true`
- Increase context until OOM or tokens/sec < 5
- Record best config before hitting wall
- Target: >10 tokens/sec with max context
### RAM Track (Knowledge)
- Start: `num_ctx=65536`, `num_gpu=50`, `flash_attn=true`
- Allow heavy RAM offload (up to 100GB system RAM)
- Increase context until OOM
- Speed secondary to context size
---
## Manual Testing Workflow
### 1. Quick Model Test
```bash
# Test a model at specific context size
docker exec ollama ollama run <model>:<tag> "Your prompt here"
```
### 2. Check Current State
```bash
cd /opt/data/infra
cat assets/ai-optimizer/state.json
```
### 3. Pull Model (if needed)
```bash
docker exec ollama ollama pull <model>:<tag>
```
### 4. Create Test Modelfile
```bash
docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
FROM ${model}
PARAMETER num_ctx ${num_ctx}
PARAMETER num_gpu ${num_gpu}
PARAMETER flash_attn true
PARAMETER num_predict 4096
PARAMETER num_keep 1024
PARAMETER repeat_penalty 1.1
EOF"
docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile
```
### 5. Run Benchmark
```bash
# Warm up
docker exec ollama ollama run test-model "Hello" > /dev/null
# Coding prompt
docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."
# Knowledge prompt
docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."
```
### 6. Measure VRAM
```bash
# Try host first
rocm-smi --showmeminfo vram 2>/dev/null || \
# Try via docker
docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
echo "VRAM unavailable"
```
### 7. Record Results
Update `state.json` and append to `results.csv`:
- tokens/sec from ollama output
- VRAM/RAM usage
- Whether this config is the new best
### 8. Commit Changes
```bash
cd /opt/data/infra
git add assets/ai-optimizer/
git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
git push
```
---
## State File Structure
```json
{
"track": "gpu",
"current_model": "deepseek-coder-v2:16b",
"model_index": 0,
"phase": "context_scaling",
"backend": "ollama",
"current_config": {
"num_ctx": 32768,
"num_gpu": 99,
"flash_attn": true
},
"best_configs": {
"gpu": {},
"ram": {}
},
"completed_models": [],
"gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
"ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
"context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
"last_updated": "2026-04-30T00:00:00Z"
}
```
---
## Results CSV Format
```csv
timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
2026-04-30T00:00:00Z,gpu,deepseek-coder-v2:16b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false
```
---
## Skill Usage
Once PR #1 (ai-worker-restricted-access) is merged:
```bash
# From Hermes container, SSH to host for direct ollama access
ssh -i /path/to/key ai-worker@host docker exec ollama ollama run <model>
# Or run the skill directly
ollama-benchmark --model deepseek-coder-v2:16b --track gpu
```
---
## Notes
- **Manual execution only** - No cron job, run when needed
- **Two tracks**: Complete GPU track first (coding models), then RAM track
- **Backend**: ollama (llama.cpp optional for advanced users)
- **Host access**: Use docker exec or SSH for rocm-smi
- **Commit results**: Push best configs to repo for reference