cleanup-remove-stray-plan-file
This commit is contained in:
@@ -1,136 +0,0 @@
|
|||||||
# Infrastructure CI/CD + Staging Plan
|
|
||||||
|
|
||||||
Date: 2026-05-12
|
|
||||||
Status: Draft for review (updated)
|
|
||||||
|
|
||||||
## Current State
|
|
||||||
|
|
||||||
- Gitea Actions workflows exist (PR #21: build-ollama, build-hermes; PR #39: build-nixos)
|
|
||||||
- act_runner blocked by env var typo (GITEA_RUNNER_REGIS_TOKEN → GITEA_RUNNER_REGISTRATION_TOKEN)
|
|
||||||
- KVM unavailable currently (VT-x possibly disabled in BIOS)
|
|
||||||
- NixOS 26.05 on bare metal (Intel Xeon E5-2697 v4, 18 cores, 125GB RAM)
|
|
||||||
- Docker running: gitea, act_runner, nextcloud, synapse, traefik, etc.
|
|
||||||
|
|
||||||
## Architecture Decision: KVM VM (after enabling VT-x in BIOS)
|
|
||||||
|
|
||||||
Once Intel VT-x is enabled in BIOS, we run a proper KVM/QEMU virtual machine:
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────┐
|
|
||||||
│ Bare Metal Host (lazyworkhorse) │
|
|
||||||
│ │
|
|
||||||
│ ┌─────────────────┐ ┌─────────────────────┐ │
|
|
||||||
│ │ Production │ │ Staging VM │ │
|
|
||||||
│ │ Docker Compose │ │ KVM/QEMU │ │
|
|
||||||
│ │ (gitea, nc, ...) │ │ 4 vCPU, 16GB RAM │ │
|
|
||||||
│ │ /mnt/HoardCow/ │ │ 50GB virtual disk │ │
|
|
||||||
│ └─────────────────┘ │ Own NixOS + Docker │ │
|
|
||||||
│ │ Own volumes (isolated) │ │
|
|
||||||
│ └─────────────────────┘ │
|
|
||||||
│ │
|
|
||||||
│ ┌─────────────────────────────────────────────┐ │
|
|
||||||
│ │ act_runner (Docker) │ │
|
|
||||||
│ │ → SSH deploy to staging VM │ │
|
|
||||||
│ │ → Run tests against staging │ │
|
|
||||||
│ └─────────────────────────────────────────────┘ │
|
|
||||||
└─────────────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
## Data Isolation (Critical)
|
|
||||||
|
|
||||||
**Production data is NEVER exposed to staging.**
|
|
||||||
|
|
||||||
- Staging VM gets its own 50GB virtual disk (QCOW2 image)
|
|
||||||
- All Docker volumes (DB data, uploads, config) live inside the VM's disk
|
|
||||||
- Host paths like `/mnt/HoardingCow_docker_data/` are NOT bind-mounted
|
|
||||||
- VM snapshots before major tests for fast rollback
|
|
||||||
- Even catastrophic staging failure cannot touch production data
|
|
||||||
|
|
||||||
NixOS config approach:
|
|
||||||
```nix
|
|
||||||
# In hosts/staging/configuration.nix
|
|
||||||
let
|
|
||||||
dataRoot = "/var/lib/staging-docker"; # Inside VM disk
|
|
||||||
in {
|
|
||||||
virtualisation.oci-containers.containers = {
|
|
||||||
nextcloud = {
|
|
||||||
volumes = [ "${dataRoot}/nextcloud:/var/www/html" ];
|
|
||||||
# Same image, same config, different volume path
|
|
||||||
};
|
|
||||||
};
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Implementation Phases
|
|
||||||
|
|
||||||
### Phase 0: Enable KVM
|
|
||||||
1. Reboot server, enter BIOS, enable Intel Virtualization Technology (VT-x)
|
|
||||||
2. Boot into NixOS
|
|
||||||
3. Add to lazyworkhorse configuration.nix:
|
|
||||||
```nix
|
|
||||||
boot.kernelModules = [ "kvm-intel" "kvm" ];
|
|
||||||
virtualisation.libvirtd.enable = true;
|
|
||||||
users.users.ai-worker.extraGroups = [ "libvirtd" ];
|
|
||||||
```
|
|
||||||
4. nixos-rebuild switch → reboot → verify `ls /dev/kvm`
|
|
||||||
|
|
||||||
### Phase 1: Fix CI Runner
|
|
||||||
1. Fix env var typo in act_runner config
|
|
||||||
2. Merge PR #21 (workflows), #22 (runner), #39 (nixos CI)
|
|
||||||
3. Verify runner processes PR builds
|
|
||||||
|
|
||||||
### Phase 2: Create Staging VM
|
|
||||||
1. Define VM with virsh:
|
|
||||||
- 4 vCPU, 16GB RAM, 50GB QCOW2 disk
|
|
||||||
- Bridge network (192.168.122.0/24 via libvirt default NAT)
|
|
||||||
- Install NixOS via nixos-anywhere or ISO
|
|
||||||
2. Deploy NixOS config to staging (imports same modules as production)
|
|
||||||
3. Verify Docker and services come up in staging
|
|
||||||
|
|
||||||
### Phase 3: CI Deploys to Staging
|
|
||||||
1. CI builds config (`nix build .#nixosConfigurations.staging`)
|
|
||||||
2. CI deploys: `nixos-rebuild switch --flake .#staging --target-host root@192.168.122.X`
|
|
||||||
3. CI runs health checks against staging services
|
|
||||||
|
|
||||||
### Phase 4: Accumulate Tests
|
|
||||||
1. Create `tests/` directory in infra repo
|
|
||||||
2. Each new feature adds its test(s)
|
|
||||||
3. All tests run on every PR
|
|
||||||
4. Test categories:
|
|
||||||
- Container health (are all services running?)
|
|
||||||
- HTTP response (do endpoints return 200?)
|
|
||||||
- Integration (does feature X still work?)
|
|
||||||
- Regression (did change Y break Z?)
|
|
||||||
|
|
||||||
### Phase 5: Auto-Rollback & Deploy
|
|
||||||
1. Add auto-rollback to nixos-rebuild:
|
|
||||||
```nix
|
|
||||||
boot.loader.systemd-boot.autoRollback = true;
|
|
||||||
```
|
|
||||||
2. Or script: switch → health check → rollback on failure
|
|
||||||
3. Cron job for automatic nixos-rebuild on merged PRs
|
|
||||||
4. Only deploy commits that passed staging CI
|
|
||||||
|
|
||||||
## Test Suite Examples
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# tests/containers_running.sh
|
|
||||||
for container in gitea nextcloud synapse traefik; do
|
|
||||||
if ! ssh staging "docker ps --format '{{.Names}}' | grep -q $container"; then
|
|
||||||
echo "FAIL: $container not running"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
|
|
||||||
# tests/endpoints.sh
|
|
||||||
curl -sf http://192.168.122.50:3000 > /dev/null || exit 1 # Gitea
|
|
||||||
curl -sf http://192.168.122.50:8080 > /dev/null || exit 1 # Nextcloud
|
|
||||||
```
|
|
||||||
|
|
||||||
## To Be Decided
|
|
||||||
|
|
||||||
1. **VM resources**: 4 vCPU / 16GB RAM sufficient?
|
|
||||||
2. **Network**: libvirt default NAT (192.168.122.0/24) or dedicated bridge?
|
|
||||||
3. **VM disk**: 50GB enough for NixOS + Docker images + volumes?
|
|
||||||
4. **Auto-merge**: full auto or with "safe-to-merge" label gate?
|
|
||||||
5. **Test runner**: inline bash in Gitea Actions, or separate test script repo?
|
|
||||||
Reference in New Issue
Block a user