cleanup-remove-stray-plan-file

2026-05-15 21:14:28 -04:00
parent 9158a0f93b
commit f1b1e5dc4c
1 changed files with 0 additions and 136 deletions
--- a/.hermes/plans/staging-vm-ci-cd-plan.md
+++ b/.hermes/plans/staging-vm-ci-cd-plan.md
@@ -1,136 +0,0 @@
 # Infrastructure CI/CD + Staging Plan
 Date: 2026-05-12
 Status: Draft for review (updated)
 ## Current State
 - Gitea Actions workflows exist (PR #21: build-ollama, build-hermes; PR #39: build-nixos)
 - act_runner blocked by env var typo (GITEA_RUNNER_REGIS_TOKEN → GITEA_RUNNER_REGISTRATION_TOKEN)
 - KVM unavailable currently (VT-x possibly disabled in BIOS)
 - NixOS 26.05 on bare metal (Intel Xeon E5-2697 v4, 18 cores, 125GB RAM)
 - Docker running: gitea, act_runner, nextcloud, synapse, traefik, etc.
 ## Architecture Decision: KVM VM (after enabling VT-x in BIOS)
 Once Intel VT-x is enabled in BIOS, we run a proper KVM/QEMU virtual machine:
 ```
 ┌─────────────────────────────────────────────────┐
 │ Bare Metal Host (lazyworkhorse)                  │
 │                                                   │
 │  ┌─────────────────┐   ┌─────────────────────┐   │
 │  │ Production       │   │ Staging VM           │   │
 │  │ Docker Compose   │   │ KVM/QEMU             │   │
 │  │ (gitea, nc, ...) │   │ 4 vCPU, 16GB RAM     │   │
 │  │ /mnt/HoardCow/   │   │ 50GB virtual disk    │   │
 │  └─────────────────┘   │ Own NixOS + Docker    │   │
 │                          │ Own volumes (isolated) │   │
 │                          └─────────────────────┘   │
 │                                                   │
 │  ┌─────────────────────────────────────────────┐ │
 │  │ act_runner (Docker)                         │ │
 │  │ → SSH deploy to staging VM                  │ │
 │  │ → Run tests against staging                 │ │
 │  └─────────────────────────────────────────────┘ │
 └─────────────────────────────────────────────────┘
 ```
 ## Data Isolation (Critical)
 **Production data is NEVER exposed to staging.**
 - Staging VM gets its own 50GB virtual disk (QCOW2 image)
 - All Docker volumes (DB data, uploads, config) live inside the VM's disk
 - Host paths like `/mnt/HoardingCow_docker_data/` are NOT bind-mounted
 - VM snapshots before major tests for fast rollback
 - Even catastrophic staging failure cannot touch production data
 NixOS config approach:
 ```nix
 # In hosts/staging/configuration.nix
 let
  dataRoot = "/var/lib/staging-docker";  # Inside VM disk
 in {
  virtualisation.oci-containers.containers = {
    nextcloud = {
      volumes = [ "${dataRoot}/nextcloud:/var/www/html" ];
      # Same image, same config, different volume path
    };
  };
 }
 ```
 ## Implementation Phases
 ### Phase 0: Enable KVM
 1. Reboot server, enter BIOS, enable Intel Virtualization Technology (VT-x)
 2. Boot into NixOS
 3. Add to lazyworkhorse configuration.nix:
   ```nix
   boot.kernelModules = [ "kvm-intel" "kvm" ];
   virtualisation.libvirtd.enable = true;
   users.users.ai-worker.extraGroups = [ "libvirtd" ];
   ```
 4. nixos-rebuild switch → reboot → verify `ls /dev/kvm`
 ### Phase 1: Fix CI Runner
 1. Fix env var typo in act_runner config
 2. Merge PR #21 (workflows), #22 (runner), #39 (nixos CI)
 3. Verify runner processes PR builds
 ### Phase 2: Create Staging VM
 1. Define VM with virsh:
   - 4 vCPU, 16GB RAM, 50GB QCOW2 disk
   - Bridge network (192.168.122.0/24 via libvirt default NAT)
   - Install NixOS via nixos-anywhere or ISO
 2. Deploy NixOS config to staging (imports same modules as production)
 3. Verify Docker and services come up in staging
 ### Phase 3: CI Deploys to Staging
 1. CI builds config (`nix build .#nixosConfigurations.staging`)
 2. CI deploys: `nixos-rebuild switch --flake .#staging --target-host root@192.168.122.X`
 3. CI runs health checks against staging services
 ### Phase 4: Accumulate Tests
 1. Create `tests/` directory in infra repo
 2. Each new feature adds its test(s)
 3. All tests run on every PR
 4. Test categories:
   - Container health (are all services running?)
   - HTTP response (do endpoints return 200?)
   - Integration (does feature X still work?)
   - Regression (did change Y break Z?)
 ### Phase 5: Auto-Rollback & Deploy
 1. Add auto-rollback to nixos-rebuild:
   ```nix
   boot.loader.systemd-boot.autoRollback = true;
   ```
 2. Or script: switch → health check → rollback on failure
 3. Cron job for automatic nixos-rebuild on merged PRs
 4. Only deploy commits that passed staging CI
 ## Test Suite Examples
 ```bash
 # tests/containers_running.sh
 for container in gitea nextcloud synapse traefik; do
  if ! ssh staging "docker ps --format '{{.Names}}' | grep -q $container"; then
    echo "FAIL: $container not running"
    exit 1
  fi
 done
 # tests/endpoints.sh
 curl -sf http://192.168.122.50:3000 > /dev/null || exit 1  # Gitea
 curl -sf http://192.168.122.50:8080 > /dev/null || exit 1  # Nextcloud
 ```
 ## To Be Decided
 1. **VM resources**: 4 vCPU / 16GB RAM sufficient?
 2. **Network**: libvirt default NAT (192.168.122.0/24) or dedicated bridge?
 3. **VM disk**: 50GB enough for NixOS + Docker images + volumes?
 4. **Auto-merge**: full auto or with "safe-to-merge" label gate?
 5. **Test runner**: inline bash in Gitea Actions, or separate test script repo?