Backend Software Engineer — Distributed Systems & Event-Driven Architecture
3.7+ years building production-grade backend systems in banking and IoT domains. I propose and drive architectural decisions — Kafka adoption, Saga design, CQRS strategy — within cross-functional teams, reviewed with senior architects. Not just ticket execution: system design, failure ownership, and tradeoff reasoning.
Payment and order flows needed burst tolerance. REST coupling would fail under traffic spikes — consumer slowness propagates back to producers immediately. Kafka decouples ingestion from processing, giving us backpressure handling, replay, and independent scaling.
Chose orchestrated Saga (central coordinator) over choreography for payment flows. In choreography, debugging a stuck saga requires reconstructing state from events across 6 services — operationally painful. Coordinator gives single source of truth for saga state.
High-frequency reads (customer profiles, order history) were hammering transactional MySQL. Introduced Redis read models — denormalized views updated via Kafka events. Result: <20ms reads, 2K+ req/min offloaded, ~40% Redis memory increase vs ~60% DB CPU reduction.
Standard produce-after-commit risks silent event loss on crash between DB write and Kafka publish. Outbox writes event atomically in the same DB transaction as the business record. Relay service polls and publishes — no event is ever skipped.
Kafka's at-least-once delivery means any consumer can receive the same event twice on retry or rebalance. Every consumer checks a processed_events deduplication table before processing. On duplicate, skip and ack — no double-processing.
Read replicas handle complex SQL queries well but still involve replication lag + query parsing overhead. Chose Redis because access patterns were pure key lookups (customer by ID, order by ID). Sub-millisecond, no query planner overhead.
Legacy monolith handling critical banking flows suffered from tight coupling (slow releases), DB contention (P95 ~800ms), and zero fault isolation (cascading failures under load).
Redis CQRS model increased memory usage by ~40% but reduced MySQL CPU load by ~60%, enabling a tier downgrade — net infra cost reduction. Kafka cluster operational overhead justified at 5M+ events/day (polling-based queue would require heavier DB at same throughput). EKS chosen over EC2 for zero-downtime rolling deployments — operational cost offset by eliminating manual deployment risk.
Design a real-world distributed ordering system: handle multi-step transactions (order → payment → delivery), survive partial failures without inconsistent state, scale independently across domains, and support real-time delivery tracking. Designed and validated end-to-end — not just described.
High-frequency device telemetry (~5M+ events/day) with burst spikes during manufacturing shifts. Synchronous ingestion caused ~5% event drops under load. No replay meant failed events were permanently lost. JVM OOM crashes occurred silently after ~6–8 hours — no alerts before crash.
Manual offset commit adds minor coordination overhead (~5ms) vs auto-commit — fully justified by eliminating event loss. Caffeine cache uses bounded heap memory (<200MB) vs unbounded HashMap that was consuming 2–4GB before OOM. Kafka cluster overhead offset by eliminating the 5% event loss rate — data recovery at that scale would be more expensive.
Kafka consumer retried after timeout. DB write had already succeeded. Same payment processed twice → double debit to customer account.
Assumed "at-least-once delivery" wouldn't cause duplicates in practice. No idempotency enforcement at consumer level.
Traffic spike (~3× normal load). Consumer couldn't keep up — lag grew to 100K+ messages. Downstream systems started serving stale data. On-call woke up to Grafana alerts.
Under-partitioned topic (3 partitions). Consumer group size was locked to partition count — couldn't scale horizontally beyond 3 consumers.
Hot keys expired simultaneously. 500+ concurrent requests missed cache and hit MySQL. DB CPU spiked to 95% — latency degraded across all services for ~40 seconds.
Naive TTL — all instances of the same key set with identical expiry time. No protection against concurrent cache miss.
Telemetry service ran fine for ~6–8 hours, then OOMKilled. Zero warnings. Service restarted, ran fine again for 6–8 hours, then died again. Cyclic pattern.
Deduplication logic stored processed event IDs in an unbounded HashMap. Memory grew linearly with event volume over time. No eviction, no TTL, no size bound.
Payment succeeded but Order remained PENDING. Delivery was never dispatched. Customer paid, got no food. Downstream services had conflicting state across three services.
Saga coordinator was stateless — held saga progress in memory. Coordinator crashed mid-flow (k8s pod restart). On restart, had no memory of where the saga was. Didn't retry; didn't compensate.
JWT + OAuth2 (Spring Security) across all services. Short-lived access tokens (15min) + refresh token rotation. Service-to-service calls use mTLS within EKS cluster — no JWT for internal traffic (avoids token propagation overhead). Role-based access control (RBAC) enforced at controller layer.
API Gateway-level rate limiting (per-IP + per-user token bucket). Payment endpoints have stricter limits (10 req/min per user) vs read endpoints (100 req/min). Idempotency keys on payment API prevent abuse via rapid retries. Failed auth attempts trigger exponential backoff.
Banking domain: PII masking in logs (account numbers, names masked at logging layer via log4j PatternLayout filter). Sensitive fields encrypted at rest (AES-256). Kafka messages carrying PII use field-level encryption — consumers decrypt only if authorized. No raw card data in application layer.
Bean Validation (Jakarta) on all request DTOs — no raw input reaches service layer. JPA Criteria API + parameterized queries — no string concatenation in SQL. Kafka message schema validation at consumer entry point — malformed messages routed to DLQ, not processed. Content-type enforcement on all endpoints.
All financial operations (transfers, account changes) emit structured audit events to a dedicated Kafka topic — append-only, tamper-evident log. Audit events include: user_id, timestamp, action, before/after state. Retained for compliance. Not deletable by application layer — separate consumer writes to cold storage.
mTLS certificate rotation adds operational overhead — managed via cert-manager in k8s. Field-level encryption on Kafka adds ~10ms processing overhead per message — acceptable for PII fields, not applied to all fields. Rate limiting at gateway (not service level) means internal abuse possible — acceptable within mTLS-protected cluster.