Revert "feat: add ai-optimizer benchmark plan and state tracking for ollama GPU benchmarking"
This reverts commit ff7303cf6a.
This commit is contained in:
@@ -1,194 +0,0 @@
|
|||||||
# AI Model Optimizer - Ollama GPU Benchmark Plan
|
|
||||||
|
|
||||||
**Purpose:** Find optimal ollama configurations for maximum context size and GPU utilization on AMD MI50 GPUs.
|
|
||||||
|
|
||||||
**Hardware:**
|
|
||||||
- 2x AMD MI50 GPUs (32GB VRAM each, 64GB total)
|
|
||||||
- 128GB system RAM
|
|
||||||
- ROCm: `HSA_OVERRIDE_GFX_VERSION=9.0.6`, `HIP_VISIBLE_DEVICES=0,1`
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## File Locations
|
|
||||||
|
|
||||||
```
|
|
||||||
STATE: /opt/data/infra/assets/ai-optimizer/state.json
|
|
||||||
RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
|
|
||||||
REPO: /opt/data/infra (persistent clone)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Model Queues
|
|
||||||
|
|
||||||
### GPU Track (Coding - prioritize speed + context on GPU)
|
|
||||||
1. `deepseek-coder-v2:16b` - Best coding model, fits on GPU
|
|
||||||
2. `qwen2.5-coder:32b` - Alternative coding model
|
|
||||||
3. `codellama:34b-instruct` - Legacy option
|
|
||||||
|
|
||||||
### RAM Track (Knowledge - prioritize max context)
|
|
||||||
1. `qwen2.5:72b` - Large knowledge model
|
|
||||||
2. `nemotron-3-nano:30b` - Efficient large model
|
|
||||||
3. `mixtral:8x7b-instruct` - MoE architecture
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Context Steps (in order)
|
|
||||||
|
|
||||||
```
|
|
||||||
[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Optimization Strategy
|
|
||||||
|
|
||||||
### GPU Track (Coding)
|
|
||||||
- Start: `num_ctx=32768`, `num_gpu=99`, `flash_attn=true`
|
|
||||||
- Increase context until OOM or tokens/sec < 5
|
|
||||||
- Record best config before hitting wall
|
|
||||||
- Target: >10 tokens/sec with max context
|
|
||||||
|
|
||||||
### RAM Track (Knowledge)
|
|
||||||
- Start: `num_ctx=65536`, `num_gpu=50`, `flash_attn=true`
|
|
||||||
- Allow heavy RAM offload (up to 100GB system RAM)
|
|
||||||
- Increase context until OOM
|
|
||||||
- Speed secondary to context size
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
This PR adds the `ai-worker` user with docker group access. After merge:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# SSH from Hermes container to run benchmarks on the host
|
|
||||||
ssh -i /path/to/key ai-worker@host docker exec ollama ollama list
|
|
||||||
|
|
||||||
# Or if running directly on host
|
|
||||||
docker exec ollama ollama list
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Manual Testing Workflow
|
|
||||||
|
|
||||||
### 1. Quick Model Test
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker exec ollama ollama run <model>:<tag> "Your prompt here"
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Check Current State
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd /opt/data/infra
|
|
||||||
cat assets/ai-optimizer/state.json
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Pull Model (if needed)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker exec ollama ollama pull <model>:<tag>
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Create Test Modelfile
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
|
|
||||||
FROM ${model}
|
|
||||||
PARAMETER num_ctx ${num_ctx}
|
|
||||||
PARAMETER num_gpu ${num_gpu}
|
|
||||||
PARAMETER flash_attn true
|
|
||||||
PARAMETER num_predict 4096
|
|
||||||
PARAMETER num_keep 1024
|
|
||||||
PARAMETER repeat_penalty 1.1
|
|
||||||
EOF"
|
|
||||||
|
|
||||||
docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile
|
|
||||||
```
|
|
||||||
|
|
||||||
### 5. Run Benchmark
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Warm up
|
|
||||||
docker exec ollama ollama run test-model "Hello" > /dev/null
|
|
||||||
|
|
||||||
# Coding prompt
|
|
||||||
docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."
|
|
||||||
|
|
||||||
# Knowledge prompt
|
|
||||||
docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."
|
|
||||||
```
|
|
||||||
|
|
||||||
### 6. Measure VRAM
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Try host first
|
|
||||||
rocm-smi --showmeminfo vram 2>/dev/null || \
|
|
||||||
# Try via docker
|
|
||||||
docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
|
|
||||||
echo "VRAM unavailable"
|
|
||||||
```
|
|
||||||
|
|
||||||
### 7. Record Results
|
|
||||||
|
|
||||||
Update `state.json` and append to `results.csv`:
|
|
||||||
- tokens/sec from ollama output
|
|
||||||
- VRAM/RAM usage
|
|
||||||
- Whether this config is the new best
|
|
||||||
|
|
||||||
### 8. Commit Changes
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd /opt/data/infra
|
|
||||||
git add assets/ai-optimizer/
|
|
||||||
git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
|
|
||||||
git push
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## State File Structure
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"track": "gpu",
|
|
||||||
"current_model": "deepseek-coder-v2:16b",
|
|
||||||
"model_index": 0,
|
|
||||||
"phase": "context_scaling",
|
|
||||||
"backend": "ollama",
|
|
||||||
"current_config": {
|
|
||||||
"num_ctx": 32768,
|
|
||||||
"num_gpu": 99,
|
|
||||||
"flash_attn": true
|
|
||||||
},
|
|
||||||
"best_configs": {
|
|
||||||
"gpu": {},
|
|
||||||
"ram": {}
|
|
||||||
},
|
|
||||||
"completed_models": [],
|
|
||||||
"gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
|
|
||||||
"ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
|
|
||||||
"context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
|
|
||||||
"last_updated": "2026-04-30T00:00:00Z"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Results CSV Format
|
|
||||||
|
|
||||||
```csv
|
|
||||||
timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- **Manual execution** - Run benchmarks when needed, no automated cron job
|
|
||||||
- **Two tracks**: Complete GPU track first (coding models), then RAM track
|
|
||||||
- **Backend**: ollama (llama.cpp optional for advanced users)
|
|
||||||
- **Host access**: Use docker exec (or SSH via ai-worker) for rocm-smi
|
|
||||||
- **Commit results**: Push best configs to repo for reference
|
|
||||||
@@ -1 +0,0 @@
|
|||||||
timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
|
|
||||||
|
@@ -1,21 +0,0 @@
|
|||||||
{
|
|
||||||
"track": "gpu",
|
|
||||||
"current_model": "deepseek-coder-v2:16b",
|
|
||||||
"model_index": 0,
|
|
||||||
"phase": "context_scaling",
|
|
||||||
"backend": "ollama",
|
|
||||||
"current_config": {
|
|
||||||
"num_ctx": 32768,
|
|
||||||
"num_gpu": 99,
|
|
||||||
"flash_attn": true
|
|
||||||
},
|
|
||||||
"best_configs": {
|
|
||||||
"gpu": {},
|
|
||||||
"ram": {}
|
|
||||||
},
|
|
||||||
"completed_models": [],
|
|
||||||
"gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
|
|
||||||
"ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
|
|
||||||
"context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
|
|
||||||
"last_updated": "2026-05-09T00:00:00Z"
|
|
||||||
}
|
|
||||||
Reference in New Issue
Block a user