204 lines
5.0 KiB
Markdown
204 lines
5.0 KiB
Markdown
|
|
# AI Model Optimization Cron Job - EXECUTION PROMPT
|
|||
|
|
|
|||
|
|
**When this cron runs, follow these instructions exactly:**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Your Role
|
|||
|
|
|
|||
|
|
You are an AI model optimization agent. Your task is to find the best ollama/llama.cpp configuration for maximum context size and hardware utilization.
|
|||
|
|
|
|||
|
|
**Hardware:**
|
|||
|
|
- 2× AMD MI50 GPUs (32GB VRAM each, 64GB total)
|
|||
|
|
- 128GB system RAM
|
|||
|
|
- ROCm: HSA_OVERRIDE_GFX_VERSION=9.0.6, HIP_VISIBLE_DEVICES=0,1
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## File Locations
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
STATE: /opt/data/infra/assets/ai-optimizer/state.json
|
|||
|
|
RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
|
|||
|
|
INFRA_REPO: /opt/data/infra
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Model Queues
|
|||
|
|
|
|||
|
|
### GPU Track (Coding - prioritize speed + context on GPU)
|
|||
|
|
1. `devstral-small-2:24b`
|
|||
|
|
2. `qwen2.5-coder:32b`
|
|||
|
|
3. `codellama:34b-instruct`
|
|||
|
|
|
|||
|
|
### RAM Track (Knowledge - prioritize max context)
|
|||
|
|
1. `qwen2.5:72b`
|
|||
|
|
2. `nemotron-3-nano:30b`
|
|||
|
|
3. `mixtral:8x7b-instruct`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Context Steps (in order)
|
|||
|
|
```
|
|||
|
|
[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Each Run - Step by Step
|
|||
|
|
|
|||
|
|
### 1. Read State
|
|||
|
|
```bash
|
|||
|
|
cd /opt/data/infra
|
|||
|
|
cat assets/ai-optimizer/state.json
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. Determine Next Test
|
|||
|
|
- Read `track` (gpu or ram)
|
|||
|
|
- Read `current_model` from queue at `model_index`
|
|||
|
|
- Read `current_config` for parameters to test
|
|||
|
|
- Select next context step from `context_steps` based on `phase`
|
|||
|
|
|
|||
|
|
### 3. Pull Model (if needed)
|
|||
|
|
```bash
|
|||
|
|
docker exec ollama ollama list | grep -q "<model>" || docker exec ollama ollama pull <model>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4. Create Test Modelfile
|
|||
|
|
```bash
|
|||
|
|
docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
|
|||
|
|
FROM ${model}
|
|||
|
|
PARAMETER num_ctx ${current_config.num_ctx}
|
|||
|
|
PARAMETER num_gpu ${current_config.num_gpu}
|
|||
|
|
PARAMETER flash_attn ${current_config.flash_attn}
|
|||
|
|
PARAMETER num_predict 4096
|
|||
|
|
PARAMETER num_keep 1024
|
|||
|
|
PARAMETER repeat_penalty 1.1
|
|||
|
|
EOF"
|
|||
|
|
|
|||
|
|
docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5. Run Benchmark
|
|||
|
|
```bash
|
|||
|
|
# Warm up
|
|||
|
|
docker exec ollama ollama run test-model "Hello" > /dev/null
|
|||
|
|
|
|||
|
|
# Coding prompt
|
|||
|
|
START=$(date +%s%N)
|
|||
|
|
docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."
|
|||
|
|
END=$(date +%s%N)
|
|||
|
|
|
|||
|
|
# Calculate tokens/sec from output
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 6. Measure VRAM (if possible)
|
|||
|
|
```bash
|
|||
|
|
# Try host first
|
|||
|
|
rocm-smi --showmeminfo vram 2>/dev/null || \
|
|||
|
|
# Try via docker
|
|||
|
|
docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
|
|||
|
|
# Fallback
|
|||
|
|
echo "VRAM measurement unavailable"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 7. Record Results
|
|||
|
|
- Parse tokens/sec from ollama output
|
|||
|
|
- Record VRAM/RAM usage
|
|||
|
|
- Determine if this is best config so far for this model
|
|||
|
|
- Update `best_configs` if tokens/sec improved or context increased
|
|||
|
|
|
|||
|
|
### 8. Update State
|
|||
|
|
```python
|
|||
|
|
# Logic:
|
|||
|
|
if test_successful:
|
|||
|
|
if context_step < max_reached:
|
|||
|
|
phase = "context_scaling"
|
|||
|
|
current_config.num_ctx = next_context_step
|
|||
|
|
else:
|
|||
|
|
# Move to next model
|
|||
|
|
model_index += 1
|
|||
|
|
phase = "context_scaling"
|
|||
|
|
current_config.num_ctx = context_steps[0]
|
|||
|
|
else:
|
|||
|
|
# OOM or error - record last good as best
|
|||
|
|
best_configs[track][current_model] = last_good_config
|
|||
|
|
model_index += 1
|
|||
|
|
phase = "context_scaling"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 9. Commit to Repo
|
|||
|
|
```bash
|
|||
|
|
cd /opt/data/infra
|
|||
|
|
git add assets/ai-optimizer/
|
|||
|
|
git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
|
|||
|
|
git push origin master
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 10. Matrix Notification (if available)
|
|||
|
|
```python
|
|||
|
|
import os
|
|||
|
|
if os.getenv("MATRIX_HOME_SERVER") and os.getenv("MATRIX_ACCESS_TOKEN"):
|
|||
|
|
# Send notification to Matrix room
|
|||
|
|
# Room ID from env or config
|
|||
|
|
pass
|
|||
|
|
# Else: silent
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Stop Conditions
|
|||
|
|
|
|||
|
|
1. All models in both queues have `best_configs` recorded
|
|||
|
|
2. Manual intervention needed (error in state.json `error` field)
|
|||
|
|
3. No progress for 3 consecutive runs (stuck)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Error Handling
|
|||
|
|
|
|||
|
|
If any step fails:
|
|||
|
|
1. Log error to state.json: `"error": {"message": "...", "timestamp": "..."}`
|
|||
|
|
2. Do NOT increment model_index (retry next run)
|
|||
|
|
3. Commit state with error field
|
|||
|
|
4. Exit gracefully
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Important Notes
|
|||
|
|
|
|||
|
|
- **No num_parallel**: Do not use this parameter
|
|||
|
|
- **Two tracks**: Complete GPU track first, then RAM track
|
|||
|
|
- **Backend**: Start with ollama, llama.cpp testing is optional (requires uncommenting in compose.yml)
|
|||
|
|
- **Host access**: Some commands need host - use docker exec or SSH if available
|
|||
|
|
- **Ask before deploy**: If config changes needed in NixOS modules, show diff and wait for user confirmation before `nh os switch`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Example State Transitions
|
|||
|
|
|
|||
|
|
**Start:**
|
|||
|
|
```json
|
|||
|
|
{"track": "gpu", "model_index": 0, "current_model": "devstral-small-2:24b", "current_config": {"num_ctx": 32768, ...}}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**After successful test at 32k:**
|
|||
|
|
```json
|
|||
|
|
{"track": "gpu", "model_index": 0, "current_model": "devstral-small-2:24b", "current_config": {"num_ctx": 65536, ...}}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**After OOM at 131k:**
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"track": "gpu",
|
|||
|
|
"model_index": 1,
|
|||
|
|
"current_model": "qwen2.5-coder:32b",
|
|||
|
|
"best_configs": {
|
|||
|
|
"gpu": {
|
|||
|
|
"devstral-small-2:24b": {"num_ctx": 98304, "num_gpu": 99, "tokens_per_sec": 11.2}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|