Hazard¶
DO NOT EDIT BELOW THIS LINE UNLESS YOU KNOW WHAT YOU ARE DOING
Hazard name¶
Docker containers lack resource limits
General utility label¶
[2]
Likelihood scoring¶
TBC
Severity scoring¶
TBC
Description¶
Docker Compose file defines resource limits for database containers but not for backend/frontend containers, allowing runaway processes to consume all host resources.
Causes¶
- Backend service has no mem_limit or cpus configuration
- Runaway process (memory leak, infinite loop) can consume all host RAM/CPU
- No OOM killer threshold or CPU quotas
Effect¶
Backend container consumes all host RAM or CPU, causing entire system to freeze or crash.
Hazard¶
Clinical system becomes unavailable until manual restart by IT staff, clinicians cannot access patient data.
Hazard type¶
- Unavailable
Harm¶
Delayed treatment while system is down. Potential patient harm in emergencies where rapid access to clinical information affects treatment decisions. Clinicians forced to use backup paper system causing data fragmentation.
Existing controls¶
None identified during initial analysis.
Assignment¶
Clinical Safety Officer
Labelling¶
TBC (awaiting scoring)
Project¶
Clinical Risk Management
Hazard controls¶
Design controls (manufacturer)¶
- Add resource limits to backend container in compose.dev.yml: deploy.resources.limits.cpus: "2.0", deploy.resources.limits.memory: "2G". Prevents backend from using >2 CPUs or >2GB RAM. Reserves host resources for other services.
- Configure OOM (Out Of Memory) killer behavior: deploy.resources.reservations.memory: "1G" (guaranteed minimum), limits.memory: "2G" (maximum). If backend exceeds 2GB, Docker OOM killer terminates container. Set restart: unless-stopped to auto-restart after OOM kill.
- Add health check with resource monitoring: backend health check endpoint /api/health returns memory usage percentage, CPU usage. Docker marks container unhealthy if health check fails. Orchestrator (Docker Swarm or Kubernetes) can replace unhealthy container.
- Implement memory leak detection: add middleware that tracks Python memory usage (using tracemalloc). Log warning if memory growth >50MB/hour (indicates potential leak). Alert operations team if sustained memory growth detected.
- Add graceful degradation: when memory usage >80% of limit, backend enters degraded mode: disable non-critical features (push notifications, background tasks), return 503 Service Unavailable for some endpoints, display "System experiencing high load" message. Prevents complete failure.
Testing controls (manufacturer)¶
- Memory limit test: Deploy backend with 500MB memory limit. Trigger memory-intensive operation (e.g., load 10,000 patients). Monitor memory usage. Verify backend killed by OOM killer when exceeds limit. Verify container auto-restarts after OOM kill.
- CPU limit test: Deploy backend with 0.5 CPU limit. Trigger CPU-intensive operation (e.g., cryptographic hash computation). Monitor CPU usage. Assert never exceeds 50% of single core (0.5 CPU limit enforced).
- Resource exhaustion test: Deliberately introduce memory leak (accumulate objects in global list). Run load test for 10 minutes. Verify memory leak detection middleware logs warnings. Verify backend enters degraded mode at 80% memory usage.
- Health check test: Backend consumes 2.5GB RAM (exceeds 2GB limit), gets killed by OOM killer. Verify health check fails after restart. Verify orchestrator marks container unhealthy. Simulate orchestrator replacing unhealthy container.
Training controls (deployment)¶
- Train operations team on resource monitoring: check Docker stats for CPU/memory usage, understand when containers approaching limits, interpret OOM killer logs, restart containers manually if needed.
- Document resource limit rationale: explain why limits chosen (based on capacity planning), how to adjust limits if insufficient, when to scale horizontally (add more backend replicas) vs vertically (increase per-container limits).
Business process controls (deployment)¶
- Capacity planning policy: Review container resource usage monthly. If average usage >70% of limits, increase limits by 25%. If sustained >90%, increase immediately. Document capacity changes in change log.
- Monitoring and alerting: Prometheus monitors container CPU and memory usage. Alert if usage >80% of limits for >5 minutes (warning), >90% for >2 minutes (critical). Alert triggers PagerDuty notification to on-call engineer.
- Incident response: During resource exhaustion incident, immediately increase container limits via compose file update and redeploy. Investigate root cause: memory leak, unexpected load spike, inefficient code. Fix underlying issue to prevent recurrence.
- OOM kill analysis: When container killed by OOM killer, capture heap dump (if available), analyze memory usage patterns. Identify memory leak source. Add test case for leak scenario. Fix leak in code. Deploy fix and verify memory usage stabilizes.
- SystemCrash
- SystemUnavailable
Residual hazard risk assessment¶
TBC — awaiting initial controls implementation.
Hazard status¶
Draft from LLM
Code associated with hazard¶
- compose.dev.yml