csp@backend:~/portfolio$ cat about.txt

Chanda Sai Prakash

Backend Software Engineer — Distributed Systems & Event-Driven Architecture

3.7+ years building production-grade backend systems in banking and IoT domains. I propose and drive architectural decisions — Kafka adoption, Saga design, CQRS strategy — within cross-functional teams, reviewed with senior architects. Not just ticket execution: system design, failure ownership, and tradeoff reasoning.

Open to SDE2 / Senior Backend / Software Engineer roles — Hyderabad / Remote India
3.7+Years Exp
5M+Events/Day
800→500msP95 Latency
<0.2%Data Loss (was 5%)
02.

skills.json

skills.json
"core": ["Java 21", "Spring Boot 3.x", "Spring Cloud", "Spring Security"],
"messaging": ["Apache Kafka", "Idempotent Consumers", "DLQ", "Retry Topics", "Outbox Pattern", "Manual Offset Commit"],
"databases": ["MySQL", "MongoDB", "Redis (Caffeine)", "PostgreSQL"],
"cloud": ["AWS EKS", "EC2", "S3", "IAM", "CloudWatch"],
"patterns": ["Orchestrated Saga", "CQRS", "Outbox", "Compensating Transactions", "Circuit Breaker", "Bulkhead"],
"observability": ["OpenTelemetry", "Prometheus", "Grafana", "ELK", "Jaeger"],
"security": ["JWT / OAuth2", "Spring Security", "Rate Limiting", "PII Masking", "RBAC"]
// Languages & Frameworks
Java 21Spring BootSpring CloudSpring SecurityOpenFeign
// Messaging & Async
Apache KafkaIdempotent ConsumersRetry TopicsDLQOutbox Pattern
// Databases
MySQLRedisMongoDBPostgreSQLCaffeine
// Cloud & DevOps
AWS EKSDockerKubernetesTerraformHelmGitHub Actions
// Architecture Patterns
Saga OrchestrationCQRSOutboxResilience4jBulkheadCircuit Breaker
// Observability
OpenTelemetryPrometheusGrafanaELK StackJaeger
03.

architecture/

design-decisions.md — decisions I proposed and drove, reviewed with senior architects
# architecture decisions — 4-member backend team @ Accenture
 
architecture/
 ├── payment-saga-flow/     // orchestration vs choreography — why orchestration won
 ├── kafka-topic-strategy/  // partition design, ordering tradeoffs, exactly-once cost
 ├── cqrs-read-models/     // redis denormalization, memory cost vs DB cost
 ├── outbox-pattern/       // exactly-once publishing, polling overhead
 └── resilience-strategy/   // circuit breaker, bulkhead, timeout tuning
⚡ Kafka over REST — why

Payment and order flows needed burst tolerance. REST coupling would fail under traffic spikes — consumer slowness propagates back to producers immediately. Kafka decouples ingestion from processing, giving us backpressure handling, replay, and independent scaling.

⚖ Tradeoff: accepted eventual consistency on order status (vs immediate). Added polling endpoint for clients needing synchronous confirmation. Kafka cluster adds operational overhead — justified vs DB-queue polling at 5M events/day.
🔀 Orchestration vs Choreography

Chose orchestrated Saga (central coordinator) over choreography for payment flows. In choreography, debugging a stuck saga requires reconstructing state from events across 6 services — operationally painful. Coordinator gives single source of truth for saga state.

⚖ Tradeoff: coordinator is a coupling point and single point of failure. Mitigated by persisting saga state in DB — coordinator can crash and recover. Choreography would have been better if teams owned separate services independently.
📖 CQRS — read/write split

High-frequency reads (customer profiles, order history) were hammering transactional MySQL. Introduced Redis read models — denormalized views updated via Kafka events. Result: <20ms reads, 2K+ req/min offloaded, ~40% Redis memory increase vs ~60% DB CPU reduction.

⚖ Tradeoff: read stale risk (eventual consistency window ~200ms). Mitigated with 5s TTL + cache invalidation on write events. Chosen over DB read replicas because read patterns were key-based — replicas add replication lag without solving the latency problem.
📮 Outbox Pattern — exactly-once

Standard produce-after-commit risks silent event loss on crash between DB write and Kafka publish. Outbox writes event atomically in the same DB transaction as the business record. Relay service polls and publishes — no event is ever skipped.

⚖ Tradeoff: polling relay adds ~50–100ms publishing latency. Acceptable for our async flows. Adds operational complexity (relay must be monitored). At-least-once delivery from relay requires idempotent consumers downstream.
🔄 Idempotent consumers

Kafka's at-least-once delivery means any consumer can receive the same event twice on retry or rebalance. Every consumer checks a processed_events deduplication table before processing. On duplicate, skip and ack — no double-processing.

⚖ Tradeoff: one extra DB read per message. At 5M events/day that's 5M reads — managed by indexing on event_id and batching. Exactly-once delivery (transactions API) was considered but rejected: higher broker overhead and limited to single-partition consumers.
🛡️ Redis vs DB read replicas

Read replicas handle complex SQL queries well but still involve replication lag + query parsing overhead. Chose Redis because access patterns were pure key lookups (customer by ID, order by ID). Sub-millisecond, no query planner overhead.

⚖ Tradeoff: Redis memory cost ~40% higher than replica storage for our dataset. Justified: Redis reduced DB P95 from 100ms → <20ms, cutting DB instance cost by ~35% (downsized RDS tier). Net positive on infra spend.
payment-saga-flow — orchestrated saga architecture
Client API Gateway rate-limit / JWT Order Service [Saga Coordinator] saga_state in DB Apache Kafka 12 partitions · DLQ Payment Svc idempotent · outbox Inventory Svc reserve / release Delivery Svc dispatch / cancel Redis CQRS <20ms reads MySQL transactional · saga state DLQ + Replay 3 retries then isolate event flow kafka publish failure path cache/db write
04.

projects/

Financial Services Modernization Platform
Banking Client @ Accenture — 4-member team — 100K+ req/day
Production

Legacy monolith handling critical banking flows suffered from tight coupling (slow releases), DB contention (P95 ~800ms), and zero fault isolation (cascading failures under load).

Service boundary design→ Account, Payment, Reconciliation domains — I defined event contracts and data ownership
Kafka adoption proposal→ Presented async-vs-REST tradeoff to team; got buy-in; defined topic + partition strategy
Saga pattern selection→ Proposed orchestrated (not choreographed) Saga with explicit compensation logic
Outbox pattern→ Identified dual-write risk; proposed and implemented Outbox with relay service
CQRS + Redis read model→ Defined denormalization strategy; justified memory cost vs DB load reduction to team
Orchestrated Saga → Central coordinator owns state. Tradeoff: coordinator is coupling point + SPOF. Fixed by persisting saga state — crash-recoverable.
At-least-once Kafka → Chose at-least-once + idempotent consumers over exactly-once (Kafka transactions). Exactly-once adds ~30–40% broker overhead; idempotency check is cheaper at our scale.
Redis eventual consistency → CQRS read models are ~200ms stale. Acceptable for profile reads; not acceptable for payment status — payment queries bypass cache and hit MySQL directly.
Duplicate payments on retry: Kafka retry re-triggered payment step — double debit. Fixed: idempotency key (transaction_id) with unique DB constraint + processed_events deduplication table.
Kafka consumer lag explosion: 3× traffic spike, 3-partition topic couldn't scale. Consumer lag hit 100K+ messages. Fixed: repartitioned 3 → 12, scaled consumer group, added Grafana alert at 10K lag threshold.
Cascading timeout storm: Payment gateway slowdown exhausted Account service threads. Fixed: Resilience4j bulkhead on payment client — downstream degradation now isolated, doesn't bleed.
Silent N+1 query killer: Transaction history API was ~800ms P95 — discovered via MySQL slow query log + execution plan. Single JOIN rewrite: 800ms → 500ms, no cache needed.

Redis CQRS model increased memory usage by ~40% but reduced MySQL CPU load by ~60%, enabling a tier downgrade — net infra cost reduction. Kafka cluster operational overhead justified at 5M+ events/day (polling-based queue would require heavier DB at same throughput). EKS chosen over EC2 for zero-downtime rolling deployments — operational cost offset by eliminating manual deployment risk.

100K+Req / Day
500msP95 (was 800ms)
5M+Events / Day
<0.2%Data Loss (was 5%)
<20msRedis Read Latency
85%+Test Coverage
FOODIE — Distributed Food Ordering Platform
Personal Project — end-to-end system design & load tested
Personal Project

Design a real-world distributed ordering system: handle multi-step transactions (order → payment → delivery), survive partial failures without inconsistent state, scale independently across domains, and support real-time delivery tracking. Designed and validated end-to-end — not just described.

Full architecture design→ 6 services + API Gateway, event contracts, Kafka topic strategy
Saga orchestrator→ Order service as coordinator with persisted saga state machine
Polyglot persistence→ MySQL (transactional), MongoDB (catalog), Redis (real-time delivery state)
IaC + CI/CD→ Terraform + Helm on EKS, rolling deployments, zero-downtime
Orchestration, not choreography → Centralized control, explicit compensation. Tradeoff: coordinator complexity handled via persistent saga state + recovery scheduler.
Outbox pattern → Solved DB + Kafka dual-write atomicity. Tradeoff: ~50–100ms publish delay from relay polling — acceptable for async order flow.
CQRS + Redis reads → <20ms read latency. Tradeoff: eventual consistency (~200ms window). Payment status bypasses cache — reads directly from MySQL for correctness.
Probabilistic cache expiry → Fixed stampede on hot keys. Tradeoff: slightly stale data on early expiry — acceptable vs DB CPU spike from simultaneous misses.
Duplicate order/payment events: Kafka retries caused double-processing. Fixed: idempotent consumers using event_id + unique DB constraint. Verified by forcing consumer retry in test.
Payment success, order stuck PENDING: Event publish failed after DB write. Fixed: Outbox pattern + retry publisher. Verified by killing service between DB write and publish.
Saga stuck mid-flow on coordinator crash: Stateless coordinator lost progress. Fixed: persisted saga state in DB + recovery scheduler retries stuck sagas on restart.
Cache stampede on hot keys: Simultaneous expiry sent 500+ requests to MySQL. Fixed: probabilistic early expiration + mutex lock on cache miss.
Throughput→ ~1,200 req/sec sustained
P95 latency→ ~310ms end-to-end order flow
Kafka lag→ <500 messages stable under load
Redis read latency→ 8–18ms p95
Error rate→ ~0.3% (transient retries, not data corruption)
6Microservices
1,200 rpsLoad Tested
310msP95 Latency
<20msRedis Read
chaosFailure Tested
→ View on GitHub
IoT Telemetry Ingestion Platform
Manufacturing Client @ Accenture — 5M+ events/day production pipeline
Production

High-frequency device telemetry (~5M+ events/day) with burst spikes during manufacturing shifts. Synchronous ingestion caused ~5% event drops under load. No replay meant failed events were permanently lost. JVM OOM crashes occurred silently after ~6–8 hours — no alerts before crash.

Kafka ingestion buffer design→ Decoupled producers from consumers; defined device_id-based partitioning strategy
Manual offset management→ Replaced auto-commit; offset committed only after successful processing — no silent loss on crash
DLQ + replay pipeline→ Poison messages isolated after 3 retries; replay without impacting main flow
JVM memory leak debugging→ Led heap dump analysis + CloudWatch metrics; identified unbounded deduplication map
Kafka as shock absorber → Producers write at device speed; consumers process at controlled rate. Tradeoff: Kafka cluster operational overhead — justified vs event loss under burst.
Partitioning by device_id → Guarantees per-device ordering without cross-device constraints. Tradeoff: hot partition risk on high-frequency devices — monitored via partition lag metrics.
Manual offset commit → Zero silent data loss on consumer crash. Tradeoff: at-least-once delivery — requires idempotent processing for duplicate events.
Caffeine over HashMap → Bounded deduplication cache (TTL + max size). Tradeoff: slightly higher miss rate vs unbounded map — acceptable to eliminate OOM risk entirely.
JVM OOM crash (6–8hr cycle): Deduplication logic stored all processed events in unbounded HashMap. Memory grew linearly until OOMKilled. Fixed: replaced with Caffeine (TTL + size bound). Added heap usage alert at 75% — not 100%.
Consumer lag explosion during shift spike: Processing slower than ingestion rate. Fixed: tuned max.poll.records, scaled consumer group horizontally, added lag-based autoscaling trigger.
Out-of-order replay from DLQ: DLQ replay ignored original sequence_number. Caused ordering violations on replay. Fixed: introduced sequence_number header + ordered replay logic in replay service.

Manual offset commit adds minor coordination overhead (~5ms) vs auto-commit — fully justified by eliminating event loss. Caffeine cache uses bounded heap memory (<200MB) vs unbounded HashMap that was consuming 2–4GB before OOM. Kafka cluster overhead offset by eliminating the 5% event loss rate — data recovery at that scale would be more expensive.

5M+Events/Day
<1sIngestion Latency
<0.2%Data Loss (was 5%)
0OOM Crashes Post-Fix
05.

production_incidents.log

real failures — what broke, why, and how I fixed it
# Real engineers talk about what failed. Here's mine.
# All incidents below are production. All are resolved.
🚨 INC-001 — Duplicate Payment Processing
Critical · Resolved
What Happened

Kafka consumer retried after timeout. DB write had already succeeded. Same payment processed twice → double debit to customer account.

Root Cause

Assumed "at-least-once delivery" wouldn't cause duplicates in practice. No idempotency enforcement at consumer level.

Fix
  • Introduced idempotency key (transaction_id) with unique DB constraint
  • Added processed_events table to track consumed event IDs per consumer
  • Made consumer logic safe for infinite retries — dedup on entry
Never trust Kafka delivery semantics — design every consumer as idempotent, or you will corrupt financial data. The delivery guarantee is a broker property, not a business guarantee.
🚨 INC-002 — Kafka Consumer Lag Explosion
High · Resolved
What Happened

Traffic spike (~3× normal load). Consumer couldn't keep up — lag grew to 100K+ messages. Downstream systems started serving stale data. On-call woke up to Grafana alerts.

Root Cause

Under-partitioned topic (3 partitions). Consumer group size was locked to partition count — couldn't scale horizontally beyond 3 consumers.

Fix
  • Increased partitions: 3 → 12 (requires careful key rebalancing)
  • Scaled consumer group to match new partition count
  • Added Grafana lag alert at 10K threshold (not 100K)
Kafka scalability is bounded by partition count — you don't scale consumers, you scale partitions. Design partition count for future peak, not current load. Changing partitions in production is painful.
🚨 INC-003 — Redis Cache Stampede
High · Resolved
What Happened

Hot keys expired simultaneously. 500+ concurrent requests missed cache and hit MySQL. DB CPU spiked to 95% — latency degraded across all services for ~40 seconds.

Root Cause

Naive TTL — all instances of the same key set with identical expiry time. No protection against concurrent cache miss.

Fix
  • Implemented probabilistic early expiration (PER) — expire slightly before TTL based on recomputation cost
  • Added mutex locking on cache miss — only one thread recomputes, rest wait
  • Added jitter to TTL values (±10%) across keys
Caching is not just a performance optimization — it's a distributed systems problem. A cache that expires everything at once is a time bomb pointed at your database.
🚨 INC-004 — JVM OOM Crash in Telemetry Pipeline
Critical · Resolved
What Happened

Telemetry service ran fine for ~6–8 hours, then OOMKilled. Zero warnings. Service restarted, ran fine again for 6–8 hours, then died again. Cyclic pattern.

Root Cause

Deduplication logic stored processed event IDs in an unbounded HashMap. Memory grew linearly with event volume over time. No eviction, no TTL, no size bound.

Fix
  • Replaced HashMap with Caffeine cache (TTL=1h + maxSize=500K entries)
  • Added heap usage alert at 75% — not 100% (too late by then)
  • Enabled heap dump on OOM (-XX:+HeapDumpOnOutOfMemoryError) for future debugging
If memory usage is linear over time, your system is already dead — you just haven't noticed. Alert on trend, not threshold. A 75% heap alert gives you time to act; 100% means you're reading this post-mortem.
🚨 INC-005 — Saga Stuck in Inconsistent State
Critical · Resolved
What Happened

Payment succeeded but Order remained PENDING. Delivery was never dispatched. Customer paid, got no food. Downstream services had conflicting state across three services.

Root Cause

Saga coordinator was stateless — held saga progress in memory. Coordinator crashed mid-flow (k8s pod restart). On restart, had no memory of where the saga was. Didn't retry; didn't compensate.

Fix
  • Persisted saga state machine in DB — every step transition stored transactionally
  • Added recovery scheduler — on startup, find sagas in non-terminal states and retry
  • Made all compensation steps idempotent (safe to call multiple times)
Distributed transactions don't fail cleanly. A saga that doesn't persist state is a saga that will get stuck the moment your coordinator restarts. Design for recovery, not just execution.
06.

security.md

security considerations — banking domain
# Banking domain + no security = 🚨. Here's how I think about it.
🔐 Auth Strategy

JWT + OAuth2 (Spring Security) across all services. Short-lived access tokens (15min) + refresh token rotation. Service-to-service calls use mTLS within EKS cluster — no JWT for internal traffic (avoids token propagation overhead). Role-based access control (RBAC) enforced at controller layer.

🚦 Rate Limiting & Abuse Prevention

API Gateway-level rate limiting (per-IP + per-user token bucket). Payment endpoints have stricter limits (10 req/min per user) vs read endpoints (100 req/min). Idempotency keys on payment API prevent abuse via rapid retries. Failed auth attempts trigger exponential backoff.

🛡️ PII & Sensitive Data Handling

Banking domain: PII masking in logs (account numbers, names masked at logging layer via log4j PatternLayout filter). Sensitive fields encrypted at rest (AES-256). Kafka messages carrying PII use field-level encryption — consumers decrypt only if authorized. No raw card data in application layer.

✅ Input Validation & Injection Prevention

Bean Validation (Jakarta) on all request DTOs — no raw input reaches service layer. JPA Criteria API + parameterized queries — no string concatenation in SQL. Kafka message schema validation at consumer entry point — malformed messages routed to DLQ, not processed. Content-type enforcement on all endpoints.

🔍 Audit Trail

All financial operations (transfers, account changes) emit structured audit events to a dedicated Kafka topic — append-only, tamper-evident log. Audit events include: user_id, timestamp, action, before/after state. Retained for compliance. Not deletable by application layer — separate consumer writes to cold storage.

⚠️ Known Gaps & Trade-offs

mTLS certificate rotation adds operational overhead — managed via cert-manager in k8s. Field-level encryption on Kafka adds ~10ms processing overhead per message — acceptable for PII fields, not applied to all fields. Rate limiting at gateway (not service level) means internal abuse possible — acceptable within mTLS-protected cluster.

07.

experience.log

Accenture
Java Backend Engineer — 4-member backend team, Financial Services & IoT
Sep 2022 — Present
Hyderabad, India
Java 21Spring BootKafkaRedisMySQLAWS EKSResilience4jOpenTelemetrySagaCQRSOutbox
certifications
AZ-400 Microsoft Certified: DevOps Engineer Expert — 2025
AZ-104 Microsoft Certified: Azure Administrator Associate — 2025
AZ-900 Microsoft Certified: Azure Fundamentals — 2024
// Education: B.Tech ECE — Malla Reddy Engineering College (2018–2022) | CGPA: 7.2
08.

contact.sh

bash
# ✅ Open to SDE2 / Senior Backend / Software Engineer roles
# 🎯 Product companies, MNCs — Hyderabad / Remote India
# 💬 Ask me: Kafka design, Saga patterns, distributed systems failures
echo "I own systems end-to-end. Let's build something that actually scales."
⬇ Download Resume (PDF)
// email chandasaiprakash123@gmail.com // linkedin linkedin.com/in/chandasaiprakash // github github.com/Chandasaiprakash // leetcode leetcode.com/u/Chandasaiprakash