System Health Scorecard
Architecture Map (Generated from Code)
Generated by static analysis of the deployed codebase as of engagement start. Service boundaries, call graphs, and external dependencies are derived from imports, route definitions, and infrastructure config, not from documentation.
Critical Findings
- Stripe live secret key in
services/checkout-svc/src/config/stripe.config.ts:14 - AWS IAM access key with
s3:*permissions ininfrastructure/scripts/migrate-data.sh:8 - Production PostgreSQL password in
services/user-svc/.env.productioncommitted ina3f2c9band never rotated - 19 additional API keys, OAuth secrets, and service account tokens across
webhook-svc,admin-svc, andimport-svc - Git history retains all credentials; rotation alone is insufficient.
- Fourteen services consume checkout's current contract directly.
- Nine downstream calls happen synchronously inside the checkout request path with no retry logic, no circuit breaker, and no async queueing.
- Documented incident: April 18 partial outage traced to SendGrid latency spike. Checkout degraded for 47 minutes.
- Refactor sequence: extract notifications to async → extract inventory decrement to event-driven → introduce circuit breakers around Stripe → reduce incoming dependencies via versioned API contract.
- 47,200 lines in service api directories have no inbound route or call site.
- 12 deprecated route handlers remain wired but unused since the Q3 2025 deprecation.
- 31 stale feature flags remain; 19 are permanently on or off.
- Dead code pollutes both human onboarding context and AI agent context windows. Completion samples suggested dead-code patterns 28% of the time.
Detailed in §03 below. Compute right-sizing, RDS downsizing, S3 lifecycle policies, and Elasticsearch reduction together create a recoverable savings pool that covers the engagement and the first phase of the refactor work above.
- Single Redis node, no replication, no automated failover.
- Used as session store, cache, and rate limiter on one instance.
- Failure mode: loss of logged-in sessions, rate-limiting collapse, and cache stampede across dependent services.
- Structured logs absent in 6 of 14 services; access-log retention below 90 days.
- Service account permissions broader than required in 9 cases.
- 23% of production deploys in the last 90 days lacked PR review record.
- Data in transit between internal services is unencrypted in 4 places.
Cloud Cost Analysis
Current annual cloud spend: $560K. Identified recoverable spend: $190K (34%) with no functional changes.
| Service | Current | Optimized | Savings | Assessment |
|---|---|---|---|---|
| EC2 Compute (8 instances) | $228,000 | $116,000 | $112,000 | 3 instances at <4% CPU. Right-size + Reserved Instances. |
| RDS (db.r6g.4xlarge) | $164,000 | $82,000 | $82,000 | Provisioned for ~3x actual workload. Downsize + read replica strategy. |
| S3 Storage (14 buckets) | $76,000 | $44,000 | $32,000 | No lifecycle policies. Move year-old raw data to Glacier. |
| Elasticsearch (9-node) | $52,000 | $40,000 | $12,000 | Sized for projected document volume that did not materialize. |
| CloudWatch / Other | $40,000 | $36,000 | $4,000 | Minor log retention and metric cleanup. |
| Total | $560,000/yr | $318,000/yr | $190,000/yr | 34% cloud cost reduction opportunity |
Refactor Backlog & Sequencing
Sequenced so that no two streams contend for the same files. Three engineers can run streams A, B, and C in parallel from week one.
| # | Item | Stream | Effort | Depends On | Engineering Outcome |
|---|---|---|---|---|---|
| 1 | Rotate 22 committed credentials, stand up Secrets Manager, rewrite git history | A — Security | 1 week | None | Eliminates breach liability and unblocks compliance work. |
| 2 | Right-size EC2 + apply Reserved Instances | B — Cost | 2 days | None | $112K/yr recovered. |
| 3 | RDS downsize + S3 lifecycle policies | B — Cost | 1 week | (2) | $114K/yr recovered. |
| 4 | Extract notifications and inventory from checkout to async | C — Architecture | 3 weeks | None | Reduces checkout coupling and enables retry semantics. |
| 5 | Add circuit breakers around Stripe, SendGrid, Redis | C — Architecture | 1 week | (4) | Eliminates cascade failures from external dependencies. |
| 6 | Redis HA migration (ElastiCache Multi-AZ) | C — Architecture | 2 weeks | None | Removes the highest-probability future incident. |
| 7 | Dead code deletion campaign (deprecated routes → unreachable handlers → feature flags) | D — Tech Debt | 6 weeks | (1) | Recovers velocity. Cleans context for AI-assisted development. |
| 8 | Test coverage floor for new code (60%); retroactive coverage on hot paths | D — Tech Debt | Ongoing | (7) | Required before agentic refactoring is enabled. |
| 9 | SOC 2 readiness gaps (logging, access review, change management) | E — Compliance | 4 weeks | None | Pre-audit posture. Parallelizable with all other streams. |
Recommended Engineering Questions
These questions surface context the codebase can't show. Some are for senior engineers, some for your platform lead, and some for your CISO.
- For senior engineers: The 22 committed credentials have been in source for some time. Do you have evidence of unauthorized access, and is there a rotation process today?
- For your platform lead: The checkout-svc coupling pattern (incoming degree 14): was this an organic accumulation or a deliberate architectural choice?
- For your platform lead: Cloud spend has not been reviewed quarterly. Is there organizational appetite to make this recurring, and who would own it?
- For your CISO or security lead: SOC 2 Type II is on the roadmap. What's the target audit window, and is there an existing GRC platform we should align with?
- For senior engineers: AI-assisted development tooling: what's current adoption, and is there concern about agentic changes before tech debt is reduced?