5 changed files with 285 additions and 8 deletions
--- a/.hermes/plans/staging-vm-ci-cd-plan.md
+++ b/.hermes/plans/staging-vm-ci-cd-plan.md
@@ -1,136 +0,0 @@
-# Infrastructure CI/CD + Staging Plan
-
-Date: 2026-05-12
-Status: Draft for review (updated)
-
-## Current State
-
- Gitea Actions workflows exist (PR #21: build-ollama, build-hermes; PR #39: build-nixos)
- act_runner blocked by env var typo (GITEA_RUNNER_REGIS_TOKEN → GITEA_RUNNER_REGISTRATION_TOKEN)
- KVM unavailable currently (VT-x possibly disabled in BIOS)
- NixOS 26.05 on bare metal (Intel Xeon E5-2697 v4, 18 cores, 125GB RAM)
- Docker running: gitea, act_runner, nextcloud, synapse, traefik, etc.
-
-## Architecture Decision: KVM VM (after enabling VT-x in BIOS)
-
-Once Intel VT-x is enabled in BIOS, we run a proper KVM/QEMU virtual machine:
-
-```
-┌─────────────────────────────────────────────────┐
-│ Bare Metal Host (lazyworkhorse)                  │
-│                                                   │
-│  ┌─────────────────┐   ┌─────────────────────┐   │
-│  │ Production       │   │ Staging VM           │   │
-│  │ Docker Compose   │   │ KVM/QEMU             │   │
-│  │ (gitea, nc, ...) │   │ 4 vCPU, 16GB RAM     │   │
-│  │ /mnt/HoardCow/   │   │ 50GB virtual disk    │   │
-│  └─────────────────┘   │ Own NixOS + Docker    │   │
-│                          │ Own volumes (isolated) │   │
-│                          └─────────────────────┘   │
-│                                                   │
-│  ┌─────────────────────────────────────────────┐ │
-│  │ act_runner (Docker)                         │ │
-│  │ → SSH deploy to staging VM                  │ │
-│  │ → Run tests against staging                 │ │
-│  └─────────────────────────────────────────────┘ │
-└─────────────────────────────────────────────────┘
-```
-
-## Data Isolation (Critical)
-
-**Production data is NEVER exposed to staging.**
-
- Staging VM gets its own 50GB virtual disk (QCOW2 image)
- All Docker volumes (DB data, uploads, config) live inside the VM's disk
- Host paths like `/mnt/HoardingCow_docker_data/` are NOT bind-mounted
- VM snapshots before major tests for fast rollback
- Even catastrophic staging failure cannot touch production data
-
-NixOS config approach:
-```nix
-# In hosts/staging/configuration.nix
-let
-  dataRoot = "/var/lib/staging-docker";  # Inside VM disk
-in {
-  virtualisation.oci-containers.containers = {
-    nextcloud = {
-      volumes = [ "${dataRoot}/nextcloud:/var/www/html" ];
-      # Same image, same config, different volume path
-    };
-  };
-}
-```
-
-## Implementation Phases
-
-### Phase 0: Enable KVM
-1. Reboot server, enter BIOS, enable Intel Virtualization Technology (VT-x)
-2. Boot into NixOS
-3. Add to lazyworkhorse configuration.nix:
-   ```nix
-   boot.kernelModules = [ "kvm-intel" "kvm" ];
-   virtualisation.libvirtd.enable = true;
-   users.users.ai-worker.extraGroups = [ "libvirtd" ];
-   ```
-4. nixos-rebuild switch → reboot → verify `ls /dev/kvm`
-
-### Phase 1: Fix CI Runner
-1. Fix env var typo in act_runner config
-2. Merge PR #21 (workflows), #22 (runner), #39 (nixos CI)
-3. Verify runner processes PR builds
-
-### Phase 2: Create Staging VM
-1. Define VM with virsh:
-   - 4 vCPU, 16GB RAM, 50GB QCOW2 disk
-   - Bridge network (192.168.122.0/24 via libvirt default NAT)
-   - Install NixOS via nixos-anywhere or ISO
-2. Deploy NixOS config to staging (imports same modules as production)
-3. Verify Docker and services come up in staging
-
-### Phase 3: CI Deploys to Staging
-1. CI builds config (`nix build .#nixosConfigurations.staging`)
-2. CI deploys: `nixos-rebuild switch --flake .#staging --target-host root@192.168.122.X`
-3. CI runs health checks against staging services
-
-### Phase 4: Accumulate Tests
-1. Create `tests/` directory in infra repo
-2. Each new feature adds its test(s)
-3. All tests run on every PR
-4. Test categories:
-   - Container health (are all services running?)
-   - HTTP response (do endpoints return 200?)
-   - Integration (does feature X still work?)
-   - Regression (did change Y break Z?)
-
-### Phase 5: Auto-Rollback & Deploy
-1. Add auto-rollback to nixos-rebuild:
-   ```nix
-   boot.loader.systemd-boot.autoRollback = true;
-   ```
-2. Or script: switch → health check → rollback on failure
-3. Cron job for automatic nixos-rebuild on merged PRs
-4. Only deploy commits that passed staging CI
-
-## Test Suite Examples
-
-```bash
-# tests/containers_running.sh
-for container in gitea nextcloud synapse traefik; do
-  if ! ssh staging "docker ps --format '{{.Names}}' | grep -q $container"; then
-    echo "FAIL: $container not running"
-    exit 1
-  fi
-done
-
-# tests/endpoints.sh
-curl -sf http://192.168.122.50:3000 > /dev/null || exit 1  # Gitea
-curl -sf http://192.168.122.50:8080 > /dev/null || exit 1  # Nextcloud
-```
-
-## To Be Decided
-
-1. **VM resources**: 4 vCPU / 16GB RAM sufficient?
-2. **Network**: libvirt default NAT (192.168.122.0/24) or dedicated bridge?
-3. **VM disk**: 50GB enough for NixOS + Docker images + volumes?
-4. **Auto-merge**: full auto or with "safe-to-merge" label gate?
-5. **Test runner**: inline bash in Gitea Actions, or separate test script repo?