Add AI model optimization cron job #2
@@ -1,13 +1,13 @@
|
|||||||
# AI Model Optimization Cron Job
|
# AI Model Optimizer - Manual Skill
|
||||||
|
|
||||||
**Purpose:** Automatically find optimal ollama/llama.cpp configurations for maximum context size and hardware utilization.
|
**Purpose:** Find optimal ollama configurations for maximum context size and GPU utilization on AMD MI50 GPUs.
|
||||||
|
|
||||||
**Schedule:** Every hour
|
**Usage:** Run manually via Hermes skill when needed (not automated).
|
||||||
|
|
||||||
**Hardware:**
|
**Hardware:**
|
||||||
- 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
|
- 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
|
||||||
- 128GB system RAM
|
- 128GB system RAM
|
||||||
- ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1
|
- ROCm: `HSA_OVERRIDE_GFX_VERSION=9.0.6`, `HIP_VISIBLE_DEVICES=0,1`
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -21,21 +21,33 @@ REPO: /opt/data/infra (persistent - do not reclone)
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From Hermes container or any machine with ollama access
|
||||||
|
ollama-test-model --model devstral-small-2:24b --ctx 65536
|
||||||
|
```
|
||||||
|
|
||||||
|
Or use the full workflow skill for systematic testing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Model Queues
|
## Model Queues
|
||||||
|
|
||||||
### GPU Track (Coding - prioritize speed + context on GPU)
|
### GPU Track (Coding - prioritize speed + context on GPU)
|
||||||
1. `devstral-small-2:24b`
|
1. `deepseek-coder-v2:16b` - Best coding model, fits on GPU
|
||||||
2. `qwen2.5-coder:32b`
|
2. `qwen2.5-coder:32b` - Alternative coding model
|
||||||
3. `codellama:34b-instruct`
|
3. `codellama:34b-instruct` - Legacy option
|
||||||
|
|
||||||
### RAM Track (Knowledge - prioritize max context)
|
### RAM Track (Knowledge - prioritize max context)
|
||||||
1. `qwen2.5:72b`
|
1. `qwen2.5:72b` - Large knowledge model
|
||||||
2. `nemotron-3-nano:30b`
|
2. `nemotron-3-nano:30b` - Efficient large model
|
||||||
3. `mixtral:8x7b-instruct`
|
3. `mixtral:8x7b-instruct` - MoE architecture
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Context Steps (in order)
|
## Context Steps (in order)
|
||||||
|
|
||||||
```
|
```
|
||||||
[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
|
[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
|
||||||
```
|
```
|
||||||
@@ -45,45 +57,49 @@ REPO: /opt/data/infra (persistent - do not reclone)
|
|||||||
## Optimization Strategy
|
## Optimization Strategy
|
||||||
|
|
||||||
### GPU Track (Coding)
|
### GPU Track (Coding)
|
||||||
- Start: num_ctx=32768, num_gpu=99, flash_attn=true
|
- Start: `num_ctx=32768`, `num_gpu=99`, `flash_attn=true`
|
||||||
- Increase context until OOM or tokens/sec < 5
|
- Increase context until OOM or tokens/sec < 5
|
||||||
- Record best config before hitting wall
|
- Record best config before hitting wall
|
||||||
- Target: >10 tokens/sec with max context
|
- Target: >10 tokens/sec with max context
|
||||||
|
|
||||||
### RAM Track (Knowledge)
|
### RAM Track (Knowledge)
|
||||||
- Start: num_ctx=65536, num_gpu=50, flash_attn=true
|
- Start: `num_ctx=65536`, `num_gpu=50`, `flash_attn=true`
|
||||||
- Allow heavy RAM offload (up to 100GB system RAM)
|
- Allow heavy RAM offload (up to 100GB system RAM)
|
||||||
- Increase context until OOM
|
- Increase context until OOM
|
||||||
- Speed secondary to context size
|
- Speed secondary to context size
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Each Run - Step by Step
|
## Manual Testing Workflow
|
||||||
|
|
||||||
|
### 1. Quick Model Test
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test a model at specific context size
|
||||||
|
docker exec ollama ollama run <model>:<tag> "Your prompt here"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Check Current State
|
||||||
|
|
||||||
### 1. Read State
|
|
||||||
```bash
|
```bash
|
||||||
cd /opt/data/infra
|
cd /opt/data/infra
|
||||||
cat assets/ai-optimizer/state.json
|
cat assets/ai-optimizer/state.json
|
||||||
```
|
```
|
||||||
|
|
||||||
### 2. Determine Next Test
|
|
||||||
- Read `track` (gpu or ram)
|
|
||||||
- Get `current_model` from queue at `model_index`
|
|
||||||
- Get `current_config` for parameters to test
|
|
||||||
- Select next context step from `context_steps`
|
|
||||||
|
|
||||||
### 3. Pull Model (if needed)
|
### 3. Pull Model (if needed)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker exec ollama ollama list | grep -q "<model>" || docker exec ollama ollama pull <model>
|
docker exec ollama ollama pull <model>:<tag>
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. Create Test Modelfile
|
### 4. Create Test Modelfile
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
|
docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
|
||||||
FROM ${model}
|
FROM ${model}
|
||||||
PARAMETER num_ctx ${current_config.num_ctx}
|
PARAMETER num_ctx ${num_ctx}
|
||||||
PARAMETER num_gpu ${current_config.num_gpu}
|
PARAMETER num_gpu ${num_gpu}
|
||||||
PARAMETER flash_attn ${current_config.flash_attn}
|
PARAMETER flash_attn true
|
||||||
PARAMETER num_predict 4096
|
PARAMETER num_predict 4096
|
||||||
PARAMETER num_keep 1024
|
PARAMETER num_keep 1024
|
||||||
PARAMETER repeat_penalty 1.1
|
PARAMETER repeat_penalty 1.1
|
||||||
@@ -93,6 +109,7 @@ docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.model
|
|||||||
```
|
```
|
||||||
|
|
||||||
### 5. Run Benchmark
|
### 5. Run Benchmark
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Warm up
|
# Warm up
|
||||||
docker exec ollama ollama run test-model "Hello" > /dev/null
|
docker exec ollama ollama run test-model "Hello" > /dev/null
|
||||||
@@ -104,7 +121,8 @@ docker exec ollama ollama run test-model "Write a Python async context manager t
|
|||||||
docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."
|
docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."
|
||||||
```
|
```
|
||||||
|
|
||||||
### 6. Measure VRAM (if possible)
|
### 6. Measure VRAM
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Try host first
|
# Try host first
|
||||||
rocm-smi --showmeminfo vram 2>/dev/null || \
|
rocm-smi --showmeminfo vram 2>/dev/null || \
|
||||||
@@ -114,24 +132,14 @@ echo "VRAM unavailable"
|
|||||||
```
|
```
|
||||||
|
|
||||||
### 7. Record Results
|
### 7. Record Results
|
||||||
- Parse tokens/sec from ollama output
|
|
||||||
- Record VRAM/RAM usage
|
|
||||||
- Update `best_configs` if improved
|
|
||||||
|
|
||||||
### 8. Update State
|
Update `state.json` and append to `results.csv`:
|
||||||
```python
|
- tokens/sec from ollama output
|
||||||
if test_successful:
|
- VRAM/RAM usage
|
||||||
if context_step < max_reached:
|
- Whether this config is the new best
|
||||||
current_config.num_ctx = next_context_step
|
|
||||||
else:
|
### 8. Commit Changes
|
||||||
model_index += 1
|
|
||||||
current_config.num_ctx = context_steps[0]
|
|
||||||
else:
|
|
||||||
best_configs[track][current_model] = last_good_config
|
|
||||||
model_index += 1
|
|
||||||
```
|
|
||||||
|
|
||||||
### 9. Commit to Repo
|
|
||||||
```bash
|
```bash
|
||||||
cd /opt/data/infra
|
cd /opt/data/infra
|
||||||
git add assets/ai-optimizer/
|
git add assets/ai-optimizer/
|
||||||
@@ -139,15 +147,6 @@ git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
|
|||||||
git push
|
git push
|
||||||
```
|
```
|
||||||
|
|
||||||
### 10. Matrix Notification (if available)
|
|
||||||
```python
|
|
||||||
import os
|
|
||||||
if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
|
|
||||||
# Send notification
|
|
||||||
pass
|
|
||||||
# Else: silent
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## State File Structure
|
## State File Structure
|
||||||
@@ -155,7 +154,7 @@ if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
|
|||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"track": "gpu",
|
"track": "gpu",
|
||||||
"current_model": "devstral-small-2:24b",
|
"current_model": "deepseek-coder-v2:16b",
|
||||||
"model_index": 0,
|
"model_index": 0,
|
||||||
"phase": "context_scaling",
|
"phase": "context_scaling",
|
||||||
"backend": "ollama",
|
"backend": "ollama",
|
||||||
@@ -169,10 +168,10 @@ if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
|
|||||||
"ram": {}
|
"ram": {}
|
||||||
},
|
},
|
||||||
"completed_models": [],
|
"completed_models": [],
|
||||||
"gpu_queue": ["devstral-small-2:24b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
|
"gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
|
||||||
"ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
|
"ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
|
||||||
"context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
|
"context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
|
||||||
"last_updated": "2026-04-28T17:00:00Z"
|
"last_updated": "2026-04-30T00:00:00Z"
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -182,33 +181,29 @@ if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
|
|||||||
|
|
||||||
```csv
|
```csv
|
||||||
timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
|
timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
|
||||||
2026-04-28T17:00:00Z,gpu,devstral-small-2:24b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false
|
2026-04-30T00:00:00Z,gpu,deepseek-coder-v2:16b,ollama,context_scaling,65536,99,true,15.2,52.1,18.4,success,false
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Stop Conditions
|
## Skill Usage
|
||||||
|
|
||||||
1. All models in both queues have `best_configs` recorded
|
Once PR #1 (ai-worker-restricted-access) is merged:
|
||||||
2. Manual intervention needed (error in state.json `error` field)
|
|
||||||
3. No progress for 3 consecutive runs
|
|
||||||
|
|
||||||
---
|
```bash
|
||||||
|
# From Hermes container, SSH to host for direct ollama access
|
||||||
|
ssh -i /path/to/key ai-worker@host docker exec ollama ollama run <model>
|
||||||
|
|
||||||
## Error Handling
|
# Or run the skill directly
|
||||||
|
ollama-benchmark --model deepseek-coder-v2:16b --track gpu
|
||||||
If any step fails:
|
```
|
||||||
1. Log error: `"error": {"message": "...", "timestamp": "..."}`
|
|
||||||
2. Do NOT increment model_index (retry next run)
|
|
||||||
3. Commit state with error field
|
|
||||||
4. Exit gracefully
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- **No num_parallel**: Removed to avoid limiting other settings
|
- **Manual execution only** - No cron job, run when needed
|
||||||
- **Two tracks**: Complete GPU track first, then RAM track
|
- **Two tracks**: Complete GPU track first (coding models), then RAM track
|
||||||
- **Backend**: Start with ollama, llama.cpp optional
|
- **Backend**: ollama (llama.cpp optional for advanced users)
|
||||||
- **Host access**: Use docker exec or SSH for rocm-smi
|
- **Host access**: Use docker exec or SSH for rocm-smi
|
||||||
- **Ask before deploy**: Show diff before `nh os switch`
|
- **Commit results**: Push best configs to repo for reference
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"track": "gpu",
|
"track": "gpu",
|
||||||
"current_model": "devstral-small-2:24b",
|
"current_model": "deepseek-coder-v2:16b",
|
||||||
"model_index": 0,
|
"model_index": 0,
|
||||||
"phase": "context_scaling",
|
"phase": "context_scaling",
|
||||||
"backend": "ollama",
|
"backend": "ollama",
|
||||||
@@ -14,8 +14,8 @@
|
|||||||
"ram": {}
|
"ram": {}
|
||||||
},
|
},
|
||||||
"completed_models": [],
|
"completed_models": [],
|
||||||
"gpu_queue": ["devstral-small-2:24b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
|
"gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
|
||||||
"ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
|
"ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
|
||||||
"context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
|
"context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
|
||||||
"last_updated": "2026-04-28T17:00:00Z"
|
"last_updated": "2026-04-30T00:00:00Z"
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user