gortium/infra

Fork 0

Files

Hermes Agent 30f8ca3863 Add AI model optimizer cron job draft and initial state files

2026-04-28 17:19:45 +00:00

7.2 KiB

Raw Blame History

AI Model Optimization Cron Job

Goal: Find optimal configurations for maximum context size with full hardware utilization.

Hardware:

2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
128GB system RAM
ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1

Model Queue

GPU-Optimized (Coding - prioritize speed + context on GPU)

devstral-small-2:24b - Best coding model
qwen2.5-coder:32b - Strong coder, fits on GPU+offload
codellama:34b-instruct - Legacy but solid

RAM-Optimized (Knowledge - prioritize max context, accept slower)

qwen2.5:72b - Best knowledge, needs heavy offload
nemotron-3-nano:30b - Good general knowledge
mixtral:8x7b-instruct - MoE, efficient for knowledge

Optimization Strategy

Two separate tracks:

Track A: GPU-Focused (Coding)

Baseline: num_ctx=32768, num_gpu=99, flash_attn=true
Steps:
1. Increase context: 32k → 65k → 98k → 131k → 163k
2. At each step, verify VRAM usage < 60GB (leave headroom)
3. If OOM: reduce num_gpu until stable, record best
4. Measure tokens/sec - if < 5 tok/s, consider context too high

Track B: RAM-Focused (Knowledge)

Baseline: num_ctx=65536, num_gpu=50, flash_attn=true
Steps:
1. Increase context: 65k → 131k → 200k → 262k → 327k
2. Allow heavy RAM offload (system RAM up to 100GB)
3. If OOM: reduce context or num_gpu
4. Speed less critical - focus on max stable context

Backend-Specific Configs

Ollama (Modelfile parameters)

PARAMETER num_ctx <value>
PARAMETER num_gpu <layers>
PARAMETER flash_attn true/false
PARAMETER num_predict 4096
PARAMETER num_keep 1024
PARAMETER repeat_penalty 1.1

Llama.cpp (CLI flags)

--ctx-size <value>
--n-gpu-layers <layers>
--flash-attn on/off
--n-predict 4096
--batch-size 4096
--ubatch-size 512
--cache-type-k f16
--cache-type-v f16
--split-mode layer
--no-mmap

Host Test Instructions

The cron runs inside the hermes container. Some tests require host access:

1. VRAM Monitoring (HOST)

# Run on host to check VRAM usage during/after benchmark
sudo rocm-smi --showmeminfo vram

# Or via docker exec if rocm-smi available in container
docker exec --privileged ollama rocm-smi --showmeminfo vram

2. Running Ollama Benchmarks (CONTAINER)

# Pull model
docker exec ollama ollama pull <model>

# Create custom modelfile
docker exec ollama bash -c 'cat <<EOF > /root/.ollama/test.modelfile
FROM <model>
PARAMETER num_ctx 65536
PARAMETER num_gpu 99
PARAMETER flash_attn true
EOF'

# Create model from modelfile
docker exec ollama ollama create test-model -f /root/.ollama/test.modelfile

# Run benchmark (warm model first)
docker exec ollama ollama run test-model "Write a Python async context manager with exponential backoff"

# Cleanup
docker exec ollama ollama rm test-model

3. Running Llama.cpp Benchmarks (CONTAINER - needs llama.cpp container)

# Uncomment llama_cpp_devstral in compose.yml first
# Then rebuild: sudo nh os switch --flake .#lazyworkhorse

# Test via HTTP API
curl http://localhost:8300/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "devstral-2-small-llama_cpp",
    "prompt": "Write a Python function",
    "max_tokens": 100
  }'

4. Deploying Changes (HOST via ai-worker)

# After optimization, commit results
cd /home/ai-worker/infra
git add assets/ai-optimizer/
git commit -m "ai-optimizer: new best config for <model>"
git push

# If config changes needed in ollama_init_custom_models.nix:
# 1. Edit the file
# 2. nixpkgs-fmt .
# 3. Show diff to user
# 4. Wait for confirmation
# 5. sudo nh os switch --flake .#lazyworkhorse

5. Accessing Host from Hermes Container

# SSH to host as ai-worker (key should be mounted)
ssh -i /path/to/key ai-worker@host.docker.internal

# Or via docker socket if mounted
# (not recommended for security)

Benchmark Prompts

Coding (Track A)

"Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints and error handling."

Knowledge (Track B)

"Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication. Include bandwidth considerations for each level."

Measurement

Tokens per second (generation speed)
Time to first token (latency)
VRAM usage (via rocm-smi)
System RAM usage (via free -h)
Context success (did it complete without OOM?)

State File Structure

/opt/data/infra/assets/ai-optimizer/state.json

{
  "track": "gpu",
  "current_model": "devstral-small-2:24b",
  "model_index": 0,
  "phase": "context_scaling",
  "backend": "ollama",
  "current_config": {
    "num_ctx": 65536,
    "num_gpu": 99,
    "flash_attn": true
  },
  "best_configs": {
    "gpu": {
      "devstral-small-2:24b": {
        "backend": "ollama",
        "num_ctx": 131072,
        "num_gpu": 99,
        "flash_attn": true,
        "tokens_per_sec": 12.5,
        "vram_used_gb": 58.2,
        "tested_at": "2026-04-28T17:00:00Z"
      }
    },
    "ram": {}
  },
  "completed_models": [],
  "gpu_queue": ["devstral-small-2:24b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
  "ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"]
}

Results CSV

/opt/data/infra/assets/ai-optimizer/results.csv

timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
2026-04-28T17:00:00Z,gpu,devstral-small-2:24b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false

Cron Job Flow

1. Read state.json
2. If both queues empty → STOP (all models tested)
3. Select next model from current track queue
4. Pull model if needed (docker exec ollama ollama pull)
5. Create Modelfile / llama.cpp config with current test params
6. Run benchmark (both prompts)
7. Measure: tokens/sec, VRAM (rocm-smi), RAM (free -h)
8. If successful:
   - Increase context (next step)
   - Update current_config in state
9. If OOM/error:
   - Record last good config as best_configs[track][model]
   - Move to next model in queue
10. Update state.json
11. Append to results.csv
12. Git commit + push to /opt/data/infra
13. Send Matrix notification if available, else silent

Matrix Notification (Optional)

# If matrix credentials available in environment
if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
    # Send completion notification
    # Room: !ai-optimizer:lazyworkhorse.net (or similar)
    pass
# Else: silent, just commit

Files to Create

/opt/data/infra/assets/ai-optimizer/
├── state.json           # Current progress
├── results.csv          # All test results
├── best_configs.json    # Final best configs (human-readable)
└── CRON_JOB_DRAFT.md    # This file

Notes

No num_parallel: Removed to avoid limiting other settings
Two tracks: GPU (coding/speed) vs RAM (knowledge/context)
Both backends: Test ollama first, then llama.cpp if available
Host tests: rocm-smi must run on host or privileged container
Deploy: ai-worker has sudo for nh/nixos-rebuild, must ask user first

7.2 KiB Raw Blame History Unescape Escape