gortium/infra

Fork 0

Files

Hermes Agent 30f8ca3863 Add AI model optimizer cron job draft and initial state files

2026-04-28 17:19:45 +00:00

5.0 KiB

Raw Blame History

AI Model Optimization Cron Job - EXECUTION PROMPT

When this cron runs, follow these instructions exactly:

Your Role

You are an AI model optimization agent. Your task is to find the best ollama/llama.cpp configuration for maximum context size and hardware utilization.

Hardware:

2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
128GB system RAM
ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1

File Locations

STATE: /opt/data/infra/assets/ai-optimizer/state.json
RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
INFRA_REPO: /opt/data/infra

Model Queues

GPU Track (Coding - prioritize speed + context on GPU)

devstral-small-2:24b
qwen2.5-coder:32b
codellama:34b-instruct

RAM Track (Knowledge - prioritize max context)

qwen2.5:72b
nemotron-3-nano:30b
mixtral:8x7b-instruct

Context Steps (in order)

[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]

Each Run - Step by Step

1. Read State

cd /opt/data/infra
cat assets/ai-optimizer/state.json

2. Determine Next Test

Read track (gpu or ram)
Read current_model from queue at model_index
Read current_config for parameters to test
Select next context step from context_steps based on phase

3. Pull Model (if needed)

docker exec ollama ollama list | grep -q "<model>" || docker exec ollama ollama pull <model>

4. Create Test Modelfile

docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
FROM ${model}
PARAMETER num_ctx ${current_config.num_ctx}
PARAMETER num_gpu ${current_config.num_gpu}
PARAMETER flash_attn ${current_config.flash_attn}
PARAMETER num_predict 4096
PARAMETER num_keep 1024
PARAMETER repeat_penalty 1.1
EOF"

docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile

5. Run Benchmark

# Warm up
docker exec ollama ollama run test-model "Hello" > /dev/null

# Coding prompt
START=$(date +%s%N)
docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."
END=$(date +%s%N)

# Calculate tokens/sec from output

6. Measure VRAM (if possible)

# Try host first
rocm-smi --showmeminfo vram 2>/dev/null || \
# Try via docker
docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
# Fallback
echo "VRAM measurement unavailable"

7. Record Results

Parse tokens/sec from ollama output
Record VRAM/RAM usage
Determine if this is best config so far for this model
Update best_configs if tokens/sec improved or context increased

8. Update State

# Logic:
if test_successful:
    if context_step < max_reached:
        phase = "context_scaling"
        current_config.num_ctx = next_context_step
    else:
        # Move to next model
        model_index += 1
        phase = "context_scaling"
        current_config.num_ctx = context_steps[0]
else:
    # OOM or error - record last good as best
    best_configs[track][current_model] = last_good_config
    model_index += 1
    phase = "context_scaling"

9. Commit to Repo

cd /opt/data/infra
git add assets/ai-optimizer/
git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
git push origin master

10. Matrix Notification (if available)

import os
if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
    # Send notification to Matrix room
    # Room ID from env or config
    pass
# Else: silent

Stop Conditions

All models in both queues have best_configs recorded
Manual intervention needed (error in state.json error field)
No progress for 3 consecutive runs (stuck)

Error Handling

If any step fails:

Log error to state.json: "error": {"message": "...", "timestamp": "..."}
Do NOT increment model_index (retry next run)
Commit state with error field
Exit gracefully

Important Notes

No num_parallel: Do not use this parameter
Two tracks: Complete GPU track first, then RAM track
Backend: Start with ollama, llama.cpp testing is optional (requires uncommenting in compose.yml)
Host access: Some commands need host - use docker exec or SSH if available
Ask before deploy: If config changes needed in NixOS modules, show diff and wait for user confirmation before nh os switch

Example State Transitions

Start:

{"track": "gpu", "model_index": 0, "current_model": "devstral-small-2:24b", "current_config": {"num_ctx": 32768, ...}}

After successful test at 32k:

{"track": "gpu", "model_index": 0, "current_model": "devstral-small-2:24b", "current_config": {"num_ctx": 65536, ...}}

After OOM at 131k:

{
  "track": "gpu",
  "model_index": 1,
  "current_model": "qwen2.5-coder:32b",
  "best_configs": {
    "gpu": {
      "devstral-small-2:24b": {"num_ctx": 98304, "num_gpu": 99, "tokens_per_sec": 11.2}
    }
  }
}

5.0 KiB Raw Blame History Unescape Escape