Files
infra/assets/ai-optimizer/README.md
Hermes Agent 0ec198dec2 feat: convert ai-optimizer from cron job to manual skill
- Update README.md for manual execution workflow
- Change model queue to deepseek-coder-v2:16b (better coding model)
- Remove automated scheduling references
- Add skill usage instructions for post-PR#1 merge
2026-04-30 16:07:05 +00:00

5.0 KiB
Raw Blame History

AI Model Optimizer - Manual Skill

Purpose: Find optimal ollama configurations for maximum context size and GPU utilization on AMD MI50 GPUs.

Usage: Run manually via Hermes skill when needed (not automated).

Hardware:

  • 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
  • 128GB system RAM
  • ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1

File Locations

STATE:   /opt/data/infra/assets/ai-optimizer/state.json
RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
REPO:    /opt/data/infra (persistent - do not reclone)

Quick Start

# From Hermes container or any machine with ollama access
ollama-test-model --model devstral-small-2:24b --ctx 65536

Or use the full workflow skill for systematic testing.


Model Queues

GPU Track (Coding - prioritize speed + context on GPU)

  1. deepseek-coder-v2:16b - Best coding model, fits on GPU
  2. qwen2.5-coder:32b - Alternative coding model
  3. codellama:34b-instruct - Legacy option

RAM Track (Knowledge - prioritize max context)

  1. qwen2.5:72b - Large knowledge model
  2. nemotron-3-nano:30b - Efficient large model
  3. mixtral:8x7b-instruct - MoE architecture

Context Steps (in order)

[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]

Optimization Strategy

GPU Track (Coding)

  • Start: num_ctx=32768, num_gpu=99, flash_attn=true
  • Increase context until OOM or tokens/sec < 5
  • Record best config before hitting wall
  • Target: >10 tokens/sec with max context

RAM Track (Knowledge)

  • Start: num_ctx=65536, num_gpu=50, flash_attn=true
  • Allow heavy RAM offload (up to 100GB system RAM)
  • Increase context until OOM
  • Speed secondary to context size

Manual Testing Workflow

1. Quick Model Test

# Test a model at specific context size
docker exec ollama ollama run <model>:<tag> "Your prompt here"

2. Check Current State

cd /opt/data/infra
cat assets/ai-optimizer/state.json

3. Pull Model (if needed)

docker exec ollama ollama pull <model>:<tag>

4. Create Test Modelfile

docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
FROM ${model}
PARAMETER num_ctx ${num_ctx}
PARAMETER num_gpu ${num_gpu}
PARAMETER flash_attn true
PARAMETER num_predict 4096
PARAMETER num_keep 1024
PARAMETER repeat_penalty 1.1
EOF"

docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile

5. Run Benchmark

# Warm up
docker exec ollama ollama run test-model "Hello" > /dev/null

# Coding prompt
docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."

# Knowledge prompt
docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."

6. Measure VRAM

# Try host first
rocm-smi --showmeminfo vram 2>/dev/null || \
# Try via docker
docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
echo "VRAM unavailable"

7. Record Results

Update state.json and append to results.csv:

  • tokens/sec from ollama output
  • VRAM/RAM usage
  • Whether this config is the new best

8. Commit Changes

cd /opt/data/infra
git add assets/ai-optimizer/
git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
git push

State File Structure

{
  "track": "gpu",
  "current_model": "deepseek-coder-v2:16b",
  "model_index": 0,
  "phase": "context_scaling",
  "backend": "ollama",
  "current_config": {
    "num_ctx": 32768,
    "num_gpu": 99,
    "flash_attn": true
  },
  "best_configs": {
    "gpu": {},
    "ram": {}
  },
  "completed_models": [],
  "gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
  "ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
  "context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
  "last_updated": "2026-04-30T00:00:00Z"
}

Results CSV Format

timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
2026-04-30T00:00:00Z,gpu,deepseek-coder-v2:16b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false

Skill Usage

Once PR #1 (ai-worker-restricted-access) is merged:

# From Hermes container, SSH to host for direct ollama access
ssh -i /path/to/key ai-worker@host docker exec ollama ollama run <model>

# Or run the skill directly
ollama-benchmark --model deepseek-coder-v2:16b --track gpu

Notes

  • Manual execution only - No cron job, run when needed
  • Two tracks: Complete GPU track first (coding models), then RAM track
  • Backend: ollama (llama.cpp optional for advanced users)
  • Host access: Use docker exec or SSH for rocm-smi
  • Commit results: Push best configs to repo for reference