infra/assets/ai-optimizer/README.md

# AI Model Optimizer - Ollama GPU Benchmark Plan

**Purpose:** Find optimal ollama configurations for maximum context size and GPU utilization on AMD MI50 GPUs.

**Hardware:**
- 2x AMD MI50 GPUs (32GB VRAM each, 64GB total)
- 128GB system RAM
- ROCm: `HSA_OVERRIDE_GFX_VERSION=9.0.6`, `HIP_VISIBLE_DEVICES=0,1`

---

## File Locations

```
STATE:   /opt/data/infra/assets/ai-optimizer/state.json
RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv
REPO:    /opt/data/infra (persistent clone)
```

---

## Model Queues

### GPU Track (Coding - prioritize speed + context on GPU)
1. `deepseek-coder-v2:16b` - Best coding model, fits on GPU
2. `qwen2.5-coder:32b` - Alternative coding model
3. `codellama:34b-instruct` - Legacy option

### RAM Track (Knowledge - prioritize max context)
1. `qwen2.5:72b` - Large knowledge model
2. `nemotron-3-nano:30b` - Efficient large model
3. `mixtral:8x7b-instruct` - MoE architecture

---

## Context Steps (in order)

```
[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]
```

---

## Optimization Strategy

### GPU Track (Coding)
- Start: `num_ctx=32768`, `num_gpu=99`, `flash_attn=true`
- Increase context until OOM or tokens/sec < 5
- Record best config before hitting wall
- Target: >10 tokens/sec with max context

### RAM Track (Knowledge)
- Start: `num_ctx=65536`, `num_gpu=50`, `flash_attn=true`
- Allow heavy RAM offload (up to 100GB system RAM)
- Increase context until OOM
- Speed secondary to context size

---

## Prerequisites

This PR adds the `ai-worker` user with docker group access. After merge:

```bash
# SSH from Hermes container to run benchmarks on the host
ssh -i /path/to/key ai-worker@host docker exec ollama ollama list

# Or if running directly on host
docker exec ollama ollama list
```

---

## Manual Testing Workflow

### 1. Quick Model Test

```bash
docker exec ollama ollama run <model>:<tag> "Your prompt here"
```

### 2. Check Current State

```bash
cd /opt/data/infra
cat assets/ai-optimizer/state.json
```

### 3. Pull Model (if needed)

```bash
docker exec ollama ollama pull <model>:<tag>
```

### 4. Create Test Modelfile

```bash
docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile
FROM ${model}
PARAMETER num_ctx ${num_ctx}
PARAMETER num_gpu ${num_gpu}
PARAMETER flash_attn true
PARAMETER num_predict 4096
PARAMETER num_keep 1024
PARAMETER repeat_penalty 1.1
EOF"

docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile
```

### 5. Run Benchmark

```bash
# Warm up
docker exec ollama ollama run test-model "Hello" > /dev/null

# Coding prompt
docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."

# Knowledge prompt
docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."
```

### 6. Measure VRAM

```bash
# Try host first
rocm-smi --showmeminfo vram 2>/dev/null || \
# Try via docker
docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null || \
echo "VRAM unavailable"
```

### 7. Record Results

Update `state.json` and append to `results.csv`:
- tokens/sec from ollama output
- VRAM/RAM usage
- Whether this config is the new best

### 8. Commit Changes

```bash
cd /opt/data/infra
git add assets/ai-optimizer/
git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"
git push
```

---

## State File Structure

```json
{
  "track": "gpu",
  "current_model": "deepseek-coder-v2:16b",
  "model_index": 0,
  "phase": "context_scaling",
  "backend": "ollama",
  "current_config": {
    "num_ctx": 32768,
    "num_gpu": 99,
    "flash_attn": true
  },
  "best_configs": {
    "gpu": {},
    "ram": {}
  },
  "completed_models": [],
  "gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"],
  "ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],
  "context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],
  "last_updated": "2026-04-30T00:00:00Z"
}
```

---

## Results CSV Format

```csv
timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best
```

---

## Notes

- **Manual execution** - Run benchmarks when needed, no automated cron job
- **Two tracks**: Complete GPU track first (coding models), then RAM track
- **Backend**: ollama (llama.cpp optional for advanced users)
- **Host access**: Use docker exec (or SSH via ai-worker) for rocm-smi
- **Commit results**: Push best configs to repo for reference
feat: add ai-optimizer benchmark plan and state tracking for ollama GPU benchmarking 2026-05-09 20:13:08 +00:00			`# AI Model Optimizer - Ollama GPU Benchmark Plan`

			`Purpose: Find optimal ollama configurations for maximum context size and GPU utilization on AMD MI50 GPUs.`

			`Hardware:`
			`- 2x AMD MI50 GPUs (32GB VRAM each, 64GB total)`
			`- 128GB system RAM`
			- ROCm: `HSA_OVERRIDE_GFX_VERSION=9.0.6`, `HIP_VISIBLE_DEVICES=0,1`

			`---`

			`## File Locations`

			```
			`STATE: /opt/data/infra/assets/ai-optimizer/state.json`
			`RESULTS: /opt/data/infra/assets/ai-optimizer/results.csv`
			`REPO: /opt/data/infra (persistent clone)`
			```

			`---`

			`## Model Queues`

			`### GPU Track (Coding - prioritize speed + context on GPU)`
			1. `deepseek-coder-v2:16b` - Best coding model, fits on GPU
			2. `qwen2.5-coder:32b` - Alternative coding model
			3. `codellama:34b-instruct` - Legacy option

			`### RAM Track (Knowledge - prioritize max context)`
			1. `qwen2.5:72b` - Large knowledge model
			2. `nemotron-3-nano:30b` - Efficient large model
			3. `mixtral:8x7b-instruct` - MoE architecture

			`---`

			`## Context Steps (in order)`

			```
			`[32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680]`
			```

			`---`

			`## Optimization Strategy`

			`### GPU Track (Coding)`
			- Start: `num_ctx=32768`, `num_gpu=99`, `flash_attn=true`
			`- Increase context until OOM or tokens/sec < 5`
			`- Record best config before hitting wall`
			`- Target: >10 tokens/sec with max context`

			`### RAM Track (Knowledge)`
			- Start: `num_ctx=65536`, `num_gpu=50`, `flash_attn=true`
			`- Allow heavy RAM offload (up to 100GB system RAM)`
			`- Increase context until OOM`
			`- Speed secondary to context size`

			`---`

			`## Prerequisites`

			This PR adds the `ai-worker` user with docker group access. After merge:

			```bash
			`# SSH from Hermes container to run benchmarks on the host`
			`ssh -i /path/to/key ai-worker@host docker exec ollama ollama list`

			`# Or if running directly on host`
			`docker exec ollama ollama list`
			```

			`---`

			`## Manual Testing Workflow`

			`### 1. Quick Model Test`

			```bash
			`docker exec ollama ollama run <model>:<tag> "Your prompt here"`
			```

			`### 2. Check Current State`

			```bash
			`cd /opt/data/infra`
			`cat assets/ai-optimizer/state.json`
			```

			`### 3. Pull Model (if needed)`

			```bash
			`docker exec ollama ollama pull <model>:<tag>`
			```

			`### 4. Create Test Modelfile`

			```bash
			`docker exec ollama bash -c "cat <<EOF > /root/.ollama/test_${model}.modelfile`
			`FROM ${model}`
			`PARAMETER num_ctx ${num_ctx}`
			`PARAMETER num_gpu ${num_gpu}`
			`PARAMETER flash_attn true`
			`PARAMETER num_predict 4096`
			`PARAMETER num_keep 1024`
			`PARAMETER repeat_penalty 1.1`
			`EOF"`

			`docker exec ollama ollama create test-model -f /root/.ollama/test_${model}.modelfile`
			```

			`### 5. Run Benchmark`

			```bash
			`# Warm up`
			`docker exec ollama ollama run test-model "Hello" > /dev/null`

			`# Coding prompt`
			`docker exec ollama ollama run test-model "Write a Python async context manager that retries a function with exponential backoff, max 5 retries, and logs each attempt using structlog. Include type hints."`

			`# Knowledge prompt`
			`docker exec ollama ollama run test-model "Explain the complete memory hierarchy in modern GPUs, from registers through L1/L2 caches to VRAM, and how data moves between them during matrix multiplication."`
			```

			`### 6. Measure VRAM`

			```bash
			`# Try host first`
			`rocm-smi --showmeminfo vram 2>/dev/null \|\| \`
			`# Try via docker`
			`docker exec --privileged ollama rocm-smi --showmeminfo vram 2>/dev/null \|\| \`
			`echo "VRAM unavailable"`
			```

			`### 7. Record Results`

			Update `state.json` and append to `results.csv`:
			`- tokens/sec from ollama output`
			`- VRAM/RAM usage`
			`- Whether this config is the new best`

			`### 8. Commit Changes`

			```bash
			`cd /opt/data/infra`
			`git add assets/ai-optimizer/`
			`git commit -m "ai-optimizer: tested ${model} at ${num_ctx} ctx - ${status}"`
			`git push`
			```

			`---`

			`## State File Structure`

			```json
			`{`
			`"track": "gpu",`
			`"current_model": "deepseek-coder-v2:16b",`
			`"model_index": 0,`
			`"phase": "context_scaling",`
			`"backend": "ollama",`
			`"current_config": {`
			`"num_ctx": 32768,`
			`"num_gpu": 99,`
			`"flash_attn": true`
			`},`
			`"best_configs": {`
			`"gpu": {},`
			`"ram": {}`
			`},`
			`"completed_models": [],`
			`"gpu_queue": ["deepseek-coder-v2:16b", "qwen2.5-coder:32b", "codellama:34b-instruct"],`
			`"ram_queue": ["qwen2.5:72b", "nemotron-3-nano:30b", "mixtral:8x7b-instruct"],`
			`"context_steps": [32768, 65536, 98304, 131072, 163840, 200704, 262144, 327680],`
			`"last_updated": "2026-04-30T00:00:00Z"`
			`}`
			```

			`---`

			`## Results CSV Format`

			```csv
			`timestamp,track,model,backend,phase,num_ctx,num_gpu,flash_attn,tokens_per_sec,vram_gb,ram_gb,status,is_best`
			```

			`---`

			`## Notes`

			`- Manual execution - Run benchmarks when needed, no automated cron job`
			`- Two tracks: Complete GPU track first (coding models), then RAM track`
			`- Backend: ollama (llama.cpp optional for advanced users)`
			`- Host access: Use docker exec (or SSH via ai-worker) for rocm-smi`
			`- Commit results: Push best configs to repo for reference`