diff --git a/.hermes/plans/staging-vm-ci-cd-plan.md b/.hermes/plans/staging-vm-ci-cd-plan.md deleted file mode 100644 index aa265d2..0000000 --- a/.hermes/plans/staging-vm-ci-cd-plan.md +++ /dev/null @@ -1,136 +0,0 @@ -# Infrastructure CI/CD + Staging Plan - -Date: 2026-05-12 -Status: Draft for review (updated) - -## Current State - -- Gitea Actions workflows exist (PR #21: build-ollama, build-hermes; PR #39: build-nixos) -- act_runner blocked by env var typo (GITEA_RUNNER_REGIS_TOKEN → GITEA_RUNNER_REGISTRATION_TOKEN) -- KVM unavailable currently (VT-x possibly disabled in BIOS) -- NixOS 26.05 on bare metal (Intel Xeon E5-2697 v4, 18 cores, 125GB RAM) -- Docker running: gitea, act_runner, nextcloud, synapse, traefik, etc. - -## Architecture Decision: KVM VM (after enabling VT-x in BIOS) - -Once Intel VT-x is enabled in BIOS, we run a proper KVM/QEMU virtual machine: - -``` -┌─────────────────────────────────────────────────┐ -│ Bare Metal Host (lazyworkhorse) │ -│ │ -│ ┌─────────────────┐ ┌─────────────────────┐ │ -│ │ Production │ │ Staging VM │ │ -│ │ Docker Compose │ │ KVM/QEMU │ │ -│ │ (gitea, nc, ...) │ │ 4 vCPU, 16GB RAM │ │ -│ │ /mnt/HoardCow/ │ │ 50GB virtual disk │ │ -│ └─────────────────┘ │ Own NixOS + Docker │ │ -│ │ Own volumes (isolated) │ │ -│ └─────────────────────┘ │ -│ │ -│ ┌─────────────────────────────────────────────┐ │ -│ │ act_runner (Docker) │ │ -│ │ → SSH deploy to staging VM │ │ -│ │ → Run tests against staging │ │ -│ └─────────────────────────────────────────────┘ │ -└─────────────────────────────────────────────────┘ -``` - -## Data Isolation (Critical) - -**Production data is NEVER exposed to staging.** - -- Staging VM gets its own 50GB virtual disk (QCOW2 image) -- All Docker volumes (DB data, uploads, config) live inside the VM's disk -- Host paths like `/mnt/HoardingCow_docker_data/` are NOT bind-mounted -- VM snapshots before major tests for fast rollback -- Even catastrophic staging failure cannot touch production data - -NixOS config approach: -```nix -# In hosts/staging/configuration.nix -let - dataRoot = "/var/lib/staging-docker"; # Inside VM disk -in { - virtualisation.oci-containers.containers = { - nextcloud = { - volumes = [ "${dataRoot}/nextcloud:/var/www/html" ]; - # Same image, same config, different volume path - }; - }; -} -``` - -## Implementation Phases - -### Phase 0: Enable KVM -1. Reboot server, enter BIOS, enable Intel Virtualization Technology (VT-x) -2. Boot into NixOS -3. Add to lazyworkhorse configuration.nix: - ```nix - boot.kernelModules = [ "kvm-intel" "kvm" ]; - virtualisation.libvirtd.enable = true; - users.users.ai-worker.extraGroups = [ "libvirtd" ]; - ``` -4. nixos-rebuild switch → reboot → verify `ls /dev/kvm` - -### Phase 1: Fix CI Runner -1. Fix env var typo in act_runner config -2. Merge PR #21 (workflows), #22 (runner), #39 (nixos CI) -3. Verify runner processes PR builds - -### Phase 2: Create Staging VM -1. Define VM with virsh: - - 4 vCPU, 16GB RAM, 50GB QCOW2 disk - - Bridge network (192.168.122.0/24 via libvirt default NAT) - - Install NixOS via nixos-anywhere or ISO -2. Deploy NixOS config to staging (imports same modules as production) -3. Verify Docker and services come up in staging - -### Phase 3: CI Deploys to Staging -1. CI builds config (`nix build .#nixosConfigurations.staging`) -2. CI deploys: `nixos-rebuild switch --flake .#staging --target-host root@192.168.122.X` -3. CI runs health checks against staging services - -### Phase 4: Accumulate Tests -1. Create `tests/` directory in infra repo -2. Each new feature adds its test(s) -3. All tests run on every PR -4. Test categories: - - Container health (are all services running?) - - HTTP response (do endpoints return 200?) - - Integration (does feature X still work?) - - Regression (did change Y break Z?) - -### Phase 5: Auto-Rollback & Deploy -1. Add auto-rollback to nixos-rebuild: - ```nix - boot.loader.systemd-boot.autoRollback = true; - ``` -2. Or script: switch → health check → rollback on failure -3. Cron job for automatic nixos-rebuild on merged PRs -4. Only deploy commits that passed staging CI - -## Test Suite Examples - -```bash -# tests/containers_running.sh -for container in gitea nextcloud synapse traefik; do - if ! ssh staging "docker ps --format '{{.Names}}' | grep -q $container"; then - echo "FAIL: $container not running" - exit 1 - fi -done - -# tests/endpoints.sh -curl -sf http://192.168.122.50:3000 > /dev/null || exit 1 # Gitea -curl -sf http://192.168.122.50:8080 > /dev/null || exit 1 # Nextcloud -``` - -## To Be Decided - -1. **VM resources**: 4 vCPU / 16GB RAM sufficient? -2. **Network**: libvirt default NAT (192.168.122.0/24) or dedicated bridge? -3. **VM disk**: 50GB enough for NixOS + Docker images + volumes? -4. **Auto-merge**: full auto or with "safe-to-merge" label gate? -5. **Test runner**: inline bash in Gitea Actions, or separate test script repo?