GCP launch-ready infrastructure¶

Deploy Quill Medical to Google Cloud Platform (GCP) with three isolated environments (production, staging, teaching) across separate GCP projects. Cloud Run for app services, Cloud SQL for databases, a Compute Engine virtual machine (VM) for Fast Healthcare Interoperability Resources (FHIR) and EHRbase, Terraform for infrastructure as code (IaC), and GitHub Actions continuous integration / continuous deployment (CI/CD) pushing images to GitHub Container Registry (GHCR). Domain Name System (DNS) delegated from GoDaddy to Cloud DNS. Budget target: £50-200/month.

Architecture¶

                    GoDaddy (domain registrar)
                           │
                    Cloud DNS (Terraform-managed)
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
     app.quill-medical  staging.    teaching.
              │            │            │
        GCP HTTPS Load Balancer (Google-managed TLS)
              │            │            │
       ┌──────┘     ┌──────┘     ┌──────┘
       ▼             ▼           ▼
   Cloud Run      Cloud Run   Cloud Run
   (prod proj)   (staging)   (teaching)
       │             │           │
   Cloud SQL      Cloud SQL   Cloud SQL
   (3× Postgres) (3× Postgres) (1× Postgres)
       │             │           │
   Compute Engine  Compute Engine  Cloud Storage
   (FHIR+EHRbase) (FHIR+EHRbase)  (images)

quill-medical.com (landing page) is also served via Cloud Run from the existing public_pages build. Teaching has no clinical databases (no FHIR/EHRbase) but needs a single Postgres instance for auth and user scores, plus a Cloud Storage bucket for images.

Estimated monthly cost¶

Service	Prod	Staging	Teaching	Total
Cloud Run (backend + frontend)	£8-15	£3-8	£3-5	£14-28
Cloud SQL (3× db-f1-micro)	£25-40	£25-40	£8-13	£58-93
Compute Engine (e2-small)	£12-15	£12-15	£0	£24-30
Cloud Storage (images)	£0	£0	£1-3	£1-3
Cloud DNS + Load Balancer	£16-21	—	—	£16-21
Total				£113-175

Staging can be stopped when not in use to reduce costs.

Decisions¶

Cloud Run for app/backend/public-pages — scales to zero, cheapest for moderate traffic, auto-TLS
Cloud SQL (PostgreSQL 18) for all databases — managed backups, encryption-at-rest, critical for clinical data
Compute Engine (e2-small) for FHIR + EHRbase — always-on Java apps, Docker Compose on VM
GHCR for container images — built in GitHub Actions, pulled by Cloud Run
Separate GCP projects per environment — strongest isolation boundaries
Cloud DNS — delegated from GoDaddy, Terraform-managed
Google-managed Transport Layer Security (TLS) certificates — free, auto-renewing
Teaching app — part of this monorepo, deployed to its own project, single Postgres DB (auth + scores), Cloud Storage for images
Cloud Storage for teaching images — cheaper and more suitable than storing blobs in a database
Budget target: £50-200/month

Phase 1: Code hardening (no GCP dependency)¶

Step 1.1: Harden Docker images¶

Backend (backend/Dockerfile) — already has non-root appuser:

Add HEALTHCHECK instruction for container orchestration
Pin base image digest (not just tag) for reproducibility
Add --no-install-recommends to apt commands if any
Verify USER appuser is applied in prod stage

Frontend (frontend/Dockerfile) — create production multi-stage build:

Stage 1: Build static assets with yarn build
Stage 2: Serve via Caddy-alpine (~25MB image, no Node.js in final image)
Run as non-root user

Step 1.2: Production Caddyfile¶

Create caddy/prod/Caddyfile:

Security headers: HTTP Strict Transport Security (HSTS), Content Security Policy (CSP), X-Frame-Options, X-Content-Type-Options
Static asset caching (Cache-Control)
Rate limiting on /api/auth/* endpoints — done: slowapi on login (5/min) and register (3/min) in the backend
Health check passthrough

Step 1.3: Production Docker Compose files¶

~~compose.yml — base service definitions shared between dev/prod~~ skipped: dev and prod configs differ significantly (volume mounts, build targets, services, env vars); separate standalone files are simpler and clearer at this project size
compose.prod.cloud-run.yml — production overrides (no volume mounts, resource limits, restart policies, secrets from environment)
compose.prod.fhir-openehr.yml — FHIR + EHRbase for the Compute Engine VM

Step 1.4: Harden backend for public exposure¶

Modify backend/app/config.py, backend/app/main.py, backend/app/push.py:

SECURE_COOKIES=True in production (currently False)
COOKIE_DOMAIN=.quill-medical.com
Cross-origin resource sharing (CORS) allowed origins whitelist (app.quill-medical.com, staging.quill-medical.com)
Rate limiting on login/register endpoints (slowapi) — done: 5/min login, 3/min register
Move push subscriptions from in-memory list → database table — deferred: requires new SQLAlchemy model, Alembic migration, and updating push_send.py; separate piece of work
Create backend/docker/entrypoint.sh to run Alembic migrations before server start

Step 1.5: Environment-aware frontend¶

Modify frontend/vite.config.ts, frontend/public/manifest.webmanifest:

Production build with minification + tree-shaking
Progressive Web App (PWA) manifest with correct production start_url and scope

Phase 2: Terraform infrastructure¶

Scaffold done: all modules created in infra/ (steps 2.2–2.8). Step 2.1 (manual GCP project/API setup) is a prerequisite before terraform apply.

Step 2.1: GCP setup (manual)¶

Create GCP projects: quill-medical-production, quill-medical-staging, quill-medical-teaching
Enable APIs: Cloud Run, Cloud SQL, Compute Engine, Cloud DNS, Secret Manager, Identity and Access Management (IAM)
Create Terraform service account + store key as GitHub secret

Step 2.2: Terraform project structure¶

Create infra/ directory:

infra/
├── modules/
│   ├── cloud-run/       # Reusable Cloud Run service
│   ├── cloud-sql/       # Reusable Cloud SQL instance
│   ├── cloud-storage/   # GCS buckets (teaching images)
│   ├── compute-fhir/    # FHIR+EHRbase VM
│   ├── dns/             # Cloud DNS zone + records
│   ├── networking/      # VPC, subnets, firewall
│   ├── load-balancer/   # HTTPS LB + managed certs
│   └── secrets/         # Secret Manager
├── environments/
│   ├── prod/            # terraform.tfvars per env
│   ├── staging/
│   └── teaching/
├── backend.tf           # Remote state in GCS bucket
└── versions.tf

Step 2.3: Networking module¶

Virtual private cloud (VPC) with private subnets for Cloud SQL and Compute Engine
Cloud network address translation (NAT) for outbound internet from private resources
Firewall rules: Secure Shell (SSH) only from Identity-Aware Proxy (IAP), no direct database access from internet
Private service connection for Cloud SQL (no public IP on databases)
Serverless VPC connector for Cloud Run → Cloud SQL connectivity

Step 2.4: Cloud SQL module¶

3× PostgreSQL 18 instances for prod/staging (auth, fhir, ehrbase)
1× PostgreSQL 18 instance for teaching (auth + scores, no FHIR/EHRbase)
db-f1-micro tier for cost (can scale up)
Private IP only (no public access)
Backup retention (clinical data in prod):
Daily backups: 30-day retention (operational recovery)
Weekly backups: 12-month retention
Monthly snapshots: 10-year retention (NHS compliance baseline)
Point-in-time recovery (PITR): enabled, 7-day window (Cloud SQL supports this natively)
Encryption at rest (Google-managed keys)
Maintenance window: Sunday 03:00 UTC
High availability (HA): Not needed initially (development, showcasing, single clinic). Enable HA on production Cloud SQL once the system holds real patient data and serves multiple clinics — HA provides automatic failover to a standby instance in a different zone (~60s recovery). Roughly doubles Cloud SQL cost.

Step 2.5: Compute Engine module (FHIR + EHRbase)¶

e2-small instance (2 vCPU, 2GB RAM)
Container-optimised operating system (OS) or Debian with Docker
Startup script installs Docker Compose, pulls images, starts services
Private IP only, accessed via VPC from Cloud Run
Health check on FHIR (/fhir/metadata) and EHRbase (/ehrbase/rest/status)

Step 2.6: Cloud Run module¶

Deploy backend (FastAPI), frontend (static + Caddy), public pages
Environment variables from Secret Manager
VPC connector for database access
Min instances: 0 (scales to zero when idle), max: 10
Memory: 512MB backend, 256MB frontend
Concurrency: 80 requests per instance
Ingress: internal + load balancer only (no direct public access to Cloud Run URLs)

Step 2.7: Load Balancer and DNS¶

Global HTTPS load balancer
Backend services pointing to Cloud Run serverless network endpoint groups (NEGs)
URL map routing:
app.quill-medical.com/* → prod Cloud Run (app)
staging.quill-medical.com/* → staging Cloud Run
teaching.quill-medical.com/* → teaching Cloud Run
quill-medical.com/* → prod Cloud Run (public pages)
Google-managed Secure Sockets Layer (SSL) certificates for all domains
Cloud DNS zone with A records pointing to load balancer IP
Optional: Cloud Armor web application firewall (WAF) rules (Open Worldwide Application Security Project (OWASP) top 10 protection)

Step 2.8: Secret Manager¶

Store: JWT_SECRET (JSON Web Token secret), database passwords, Voluntary Application Server Identification (VAPID) keys, EHRbase credentials
Cloud Run services reference secrets as env vars
Terraform creates secrets, values set manually or via CI

Phase 3: CI/CD pipeline¶

Done: deploy-staging.yml (Step 3.1), deploy-production.yml (Step 3.2), terraform.yml (Step 3.4) created. Step 3.3 (release process) is a documented workflow, not code.

Branching strategy¶

feature/*  ──→  main  ──→  release/*  ──→  clinical-live
                  │                            │
            staging + teaching           production app
            + landing page               (clinical)
            + docs (GitHub Pages)

main — integration branch; all feature branches merge here; deploys staging, teaching, landing page, and docs
release/* — cut from main when ready for production; bug-fixes only, PR to clinical-live
clinical-live — clinical production code; only receives merges from release/* branches

Step 3.1: Deploy to staging and teaching (from `main`)¶

Workflow .github/workflows/deploy-staging.yml:

Trigger: push to main, skip if only docs changed
Detect which services changed → only build affected images
Tag images as main-<sha>
Authenticate to GCP via Workload Identity Federation (no keys)
Deploy updated Cloud Run services to staging and teaching
Run Alembic migrations against staging DB and teaching DB
Smoke test staging.quill-medical.com/api/health and teaching.quill-medical.com/api/health
Slack notification

Step 3.2: Deploy to production (from `clinical-live`)¶

Workflow .github/workflows/deploy-production.yml:

Trigger: merge to clinical-live (via PR from release/* branch)
Detect which services changed → only build affected images
Tag images as clinical-live-<sha> and :latest
Authenticate to GCP via Workload Identity Federation (no keys)
Deploy updated Cloud Run services to production
Run Alembic migrations against prod DB
Smoke test app.quill-medical.com/api/health
Slack notification

Step 3.3: Release process¶

Cut release/x.y.z branch from main
Only bug-fixes committed to the release branch
Open PR from release/x.y.z → clinical-live
On merge: CI deploys to production (Step 3.2)
Merge clinical-live back into main to sync fixes

Step 3.4: Terraform CI¶

New workflow .github/workflows/terraform.yml:

On PR: terraform plan, post diff as PR comment
On merge to clinical-live: terraform apply (production infra)
On merge to main: terraform apply (staging/teaching infra)

Phase 4: Observability (parallel with Phase 3) — Done¶

Step 4.1: Logging — Done¶

Cloud Run auto-sends stdout/stderr to Cloud Logging
Structure backend logs as JSON (python-json-logger)
Log request ID, user ID (not protected health information / PHI), response times

Step 4.2: Health checks and uptime monitoring — Done¶

GCP Uptime Checks on each subdomain (free tier: 6 checks)
Alert policy → email/Slack on downtime
Use existing /api/health endpoint

Step 4.3: Error tracking — Deferred (post-launch)¶

Sentry or Cloud Error Reporting for frontend + backend
Source maps for frontend error deobfuscation

Verification¶

Phase 1: docker compose -f compose.yml -f compose.prod.cloud-run.yml build succeeds; health checks pass locally
Phase 2: terraform plan shows expected resources; terraform apply creates staging infra
Phase 3: Push to main → images build → staging deploys → curl staging.quill-medical.com/api/health returns healthy
Phase 4: Structured logs visible in Cloud Logging; uptime checks green

Scope¶

Included: Infrastructure, CI/CD, Docker hardening, security headers, DNS, TLS, basic monitoring

Excluded: Application feature work, clinical data migration, production data seeding, custom domain email, content delivery network (add later), disaster recovery runbook, penetration testing, compliance certifications, staging access control via IAP (add later when staging needs locking down)

Implementation order¶

Phases 1 and 2.1 (manual GCP setup) start in parallel. Phase 2.2+ depends on 2.1. Phase 3 depends on 1 + 2. Phase 4 is parallel with 3. ~20 distinct implementation tasks, each walkthrough-able individually.

Branching model diagram¶

                    feature/*
                       │
                       │ (auto-merge after CI passes)
                       ▼
                      main ──────────────────────┐
                       │                          │
                 ┌─────┴──────┐              release/*
                 ▼            ▼                   │
             staging      teaching          (bug-fixes only)
             + landing    environment             │
             + docs                               ▼
             (GitHub                        clinical-live
              Pages)                              │
                                                  ▼
                                            production
                                          (clinical app)