feat: convert ai-optimizer from cron job to manual skill

- Update README.md for manual execution workflow - Change model queue to deepseek-coder-v2:16b (better coding model) - Remove automated scheduling references - Add skill usage instructions for post-PR#1 merge
2026-04-30 16:07:05 +00:00
parent 7ce0e46670
commit 0ec198dec2
2 changed files with 68 additions and 73 deletions
--- a/assets/ai-optimizer/README.md
+++ b/assets/ai-optimizer/README.md
@@ -1,13 +1,13 @@
-# AI Model Optimization Cron Job
+# AI Model Optimizer - Manual Skill

-**Purpose:** Automatically find optimal ollama/llama.cpp configurations for maximum context size and hardware utilization.
+**Purpose:** Find optimal ollama configurations for maximum context size and GPU utilization on AMD MI50 GPUs.

-**Schedule:** Every hour
+**Usage:** Run manually via Hermes skill when needed (not automated).

 **Hardware:**
 - 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
 - 128GB system RAM
- ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1
+- ROCm: `HSA_OVERRIDE_GFX_VERSION=9.0.6`, `HIP_VISIBLE_DEVICES=0,1`

 ---

@@ -21,21 +21,33 @@ REPO:    /opt/data/infra (persistent - do not reclone)

 ---

+## Quick Start
+
+```bash
+# From Hermes container or any machine with ollama access
+ollama-test-model --model devstral-small-2:24b --ctx 65536
+```
+
+Or use the full workflow skill for systematic testing.
+
+---
+
 ## Model Queues

 ### GPU Track (Coding - prioritize speed + context on GPU)
-1. `devstral-small-2:24b`
-2. `qwen2.5-coder:32b`
-3. `codellama:34b-instruct`
+1. `deepseek-coder-v2:16b` - Best coding model, fits on GPU
+2. `qwen2.5-coder:32b` - Alternative coding model
+3. `codellama:34b-instruct` - Legacy option

 ### RAM Track (Knowledge - prioritize max context)
-1. `qwen2.5:72b`
-2. `nemotron-3-nano:30b`
-3. `mixtral:8x7b-instruct`
+1. `qwen2.5:72b` - Large knowledge model
+2. `nemotron-3-nano:30b` - Efficient large model
+3. `mixtral:8x7b-instruct` - MoE architecture

 ---

 ## Context Steps (in order)
+
 ```
 [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
 ```
@@ -45,45 +57,49 @@ REPO:    /opt/data/infra (persistent - do not reclone)
 ## Optimization Strategy

 ### GPU Track (Coding)
- Start: num_ctx=32768, num_gpu=99, flash_attn=true
+- Start: `num_ctx=32768`, `num_gpu=99`, `flash_attn=true`
 - Increase context until OOM or tokens/sec < 5
 - Record best config before hitting wall
 - Target: >10 tokens/sec with max context

 ### RAM Track (Knowledge)
- Start: num_ctx=65536, num_gpu=50, flash_attn=true
+- Start: `num_ctx=65536`, `num_gpu=50`, `flash_attn=true`
 - Allow heavy RAM offload (up to 100GB system RAM)
 - Increase context until OOM
 - Speed secondary to context size

 ---

-## Each Run - Step by Step
+## Manual Testing Workflow
+
+### 1. Quick Model Test
+
+```bash
+# Test a model at specific context size
+docker exec ollama ollama run <model>:<tag> "Your prompt here"
+```
+
+### 2. Check Current State

-### 1. Read State
 ```bash
 cd /opt/data/infra
 cat assets/ai-optimizer/state.json
 ```

-### 2. Determine Next Test
- Read `track` (gpu or ram)
- Get `current_model` from queue at `model_index`
- Get `current_config` for parameters to test
- Select next context step from `context_steps`
-
 ### 3. Pull Model (if needed)
+
 ```bash
-docker exec ollama ollama list | grep -q "<model>" || docker exec ollama ollama pull <model>
+docker exec ollama ollama pull <model>:<tag>
 ```

 ### 4. Create Test Modelfile
+
 ```bash
 docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
 FROM ${model}
-PARAMETER num_ctx ${current_config.num_ctx}
-PARAMETER num_gpu ${current_config.num_gpu}
-PARAMETER flash_attn ${current_config.flash_attn}
+PARAMETER num_ctx ${num_ctx}
+PARAMETER num_gpu ${num_gpu}
+PARAMETER flash_attn true
 PARAMETER num_predict 4096
 PARAMETER num_keep 1024
 PARAMETER repeat_penalty 1.1
@@ -93,6 +109,7 @@ docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.model
 ```

 ### 5. Run Benchmark
+
 ```bash
 # Warm up
 docker exec ollama ollama run test-model "Hello" > /dev/null
@@ -104,7 +121,8 @@ docker exec ollama ollama run test-model "Write a Python async context manager t
 docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."
 ```

-### 6. Measure VRAM (if possible)
+### 6. Measure VRAM
+
 ```bash
 # Try host first
 rocm-smi --showmeminfo vram 2>/dev/null || \
@@ -114,24 +132,14 @@ echo "VRAM unavailable"
 ```

 ### 7. Record Results
- Parse tokens/sec from ollama output
- Record VRAM/RAM usage
- Update `best_configs` if improved

-### 8. Update State
-```python
-if test_successful:
-    if context_step < max_reached:
-        current_config.num_ctx = next_context_step
-    else:
-        model_index += 1
-        current_config.num_ctx = context_steps[0]
-else:
-    best_configs[track][current_model] = last_good_config
-    model_index += 1
-```
+Update `state.json` and append to `results.csv`:
+- tokens/sec from ollama output
+- VRAM/RAM usage
+- Whether this config is the new best
+
+### 8. Commit Changes

-### 9. Commit to Repo
 ```bash
 cd /opt/data/infra
 git add assets/ai-optimizer/
@@ -139,15 +147,6 @@ git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
 git push
 ```

-### 10. Matrix Notification (if available)
-```python
-import os
-if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
-    # Send notification
-    pass
-# Else: silent
-```
-
 ---

 ## State File Structure
@@ -155,7 +154,7 @@ if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
 ```json
 {
  "track": "gpu",
-  "current_model": "devstral-small-2:24b",
+  "current_model": "deepseek-coder-v2:16b",
  "model_index": 0,
  "phase": "context_scaling",
  "backend": "ollama",
@@ -169,10 +168,10 @@ if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
    "ram": {}
  },
  "completed_models": [],
-  "gpu_queue": ["devstral-small-2:24b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
+  "gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
  "ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
  "context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
-  "last_updated": "2026-04-28T17:00:00Z"
+  "last_updated": "2026-04-30T00:00:00Z"
 }
 ```

@@ -182,33 +181,29 @@ if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):

 ```csv
 timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
-2026-04-28T17:00:00Z,gpu,devstral-small-2:24b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false
+2026-04-30T00:00:00Z,gpu,deepseek-coder-v2:16b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false
 ```

 ---

-## Stop Conditions
+## Skill Usage

-1. All models in both queues have `best_configs` recorded
-2. Manual intervention needed (error in state.json `error` field)
-3. No progress for 3 consecutive runs
+Once PR #1 (ai-worker-restricted-access) is merged:

---
+```bash
+# From Hermes container, SSH to host for direct ollama access
+ssh -i /path/to/key ai-worker@host docker exec ollama ollama run <model>

-## Error Handling
-
-If any step fails:
-1. Log error: `"error": {"message": "...", "timestamp": "..."}`
-2. Do NOT increment model_index (retry next run)
-3. Commit state with error field
-4. Exit gracefully
+# Or run the skill directly
+ollama-benchmark --model deepseek-coder-v2:16b --track gpu
+```

 ---

 ## Notes

- **No num_parallel**: Removed to avoid limiting other settings
- **Two tracks**: Complete GPU track first, then RAM track
- **Backend**: Start with ollama, llama.cpp optional
+- **Manual execution only** - No cron job, run when needed
+- **Two tracks**: Complete GPU track first (coding models), then RAM track
+- **Backend**: ollama (llama.cpp optional for advanced users)
 - **Host access**: Use docker exec or SSH for rocm-smi
- **Ask before deploy**: Show diff before `nh os switch`
+- **Commit results**: Push best configs to repo for reference
--- a/assets/ai-optimizer/state.json
+++ b/assets/ai-optimizer/state.json
@@ -1,6 +1,6 @@
 {
  "track": "gpu",
-  "current_model": "devstral-small-2:24b",
+  "current_model": "deepseek-coder-v2:16b",
  "model_index": 0,
  "phase": "context_scaling",
  "backend": "ollama",
@@ -14,8 +14,8 @@
    "ram": {}
  },
  "completed_models": [],
-  "gpu_queue": ["devstral-small-2:24b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
+  "gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
  "ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
  "context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
-  "last_updated": "2026-04-28T17:00:00Z"
+  "last_updated": "2026-04-30T00:00:00Z"
 }