Chanda Sai Prakash — Backend Software Engineer

02.

skills.json

"core": ["Java 21", "Spring Boot 3.x", "Spring Cloud", "Spring Security"],

"messaging": ["Apache Kafka", "Idempotent Consumers", "DLQ", "Retry Topics", "Outbox Pattern", "Manual Offset Commit"],

"databases": ["MySQL", "MongoDB", "Redis (Caffeine)", "PostgreSQL"],

"cloud": ["AWS EKS", "EC2", "S3", "IAM", "CloudWatch"],

"patterns": ["Orchestrated Saga", "CQRS", "Outbox", "Compensating Transactions", "Circuit Breaker", "Bulkhead"],

"observability": ["OpenTelemetry", "Prometheus", "Grafana", "ELK", "Jaeger"],

"security": ["JWT / OAuth2", "Spring Security", "Rate Limiting", "PII Masking", "RBAC"]

// Languages & Frameworks

Java 21Spring BootSpring CloudSpring SecurityOpenFeign

// Messaging & Async

Apache KafkaIdempotent ConsumersRetry TopicsDLQOutbox Pattern

// Databases

MySQLRedisMongoDBPostgreSQLCaffeine

// Cloud & DevOps

AWS EKSDockerKubernetesTerraformHelmGitHub Actions

// Architecture Patterns

Saga OrchestrationCQRSOutboxResilience4jBulkheadCircuit Breaker

// Observability

OpenTelemetryPrometheusGrafanaELK StackJaeger

03.

architecture/

design-decisions.md — decisions I proposed and drove, reviewed with senior architects

# architecture decisions — 4-member backend team @ Accenture

architecture/

├── payment-saga-flow/ // orchestration vs choreography — why orchestration won

├── kafka-topic-strategy/ // partition design, ordering tradeoffs, exactly-once cost

├── cqrs-read-models/ // redis denormalization, memory cost vs DB cost

├── outbox-pattern/ // exactly-once publishing, polling overhead

└── resilience-strategy/ // circuit breaker, bulkhead, timeout tuning

⚡ Kafka over REST — why

Payment and order flows needed burst tolerance. REST coupling would fail under traffic spikes — consumer slowness propagates back to producers immediately. Kafka decouples ingestion from processing, giving us backpressure handling, replay, and independent scaling.

⚖ Tradeoff: accepted eventual consistency on order status (vs immediate). Added polling endpoint for clients needing synchronous confirmation. Kafka cluster adds operational overhead — justified vs DB-queue polling at 5M events/day.

🔀 Orchestration vs Choreography

Chose orchestrated Saga (central coordinator) over choreography for payment flows. In choreography, debugging a stuck saga requires reconstructing state from events across 6 services — operationally painful. Coordinator gives single source of truth for saga state.

⚖ Tradeoff: coordinator is a coupling point and single point of failure. Mitigated by persisting saga state in DB — coordinator can crash and recover. Choreography would have been better if teams owned separate services independently.

📖 CQRS — read/write split

High-frequency reads (customer profiles, order history) were hammering transactional MySQL. Introduced Redis read models — denormalized views updated via Kafka events. Result: <20ms reads, 2K+ req/min offloaded, ~40% Redis memory increase vs ~60% DB CPU reduction.

⚖ Tradeoff: read stale risk (eventual consistency window ~200ms). Mitigated with 5s TTL + cache invalidation on write events. Chosen over DB read replicas because read patterns were key-based — replicas add replication lag without solving the latency problem.

📮 Outbox Pattern — exactly-once

Standard produce-after-commit risks silent event loss on crash between DB write and Kafka publish. Outbox writes event atomically in the same DB transaction as the business record. Relay service polls and publishes — no event is ever skipped.

⚖ Tradeoff: polling relay adds ~50–100ms publishing latency. Acceptable for our async flows. Adds operational complexity (relay must be monitored). At-least-once delivery from relay requires idempotent consumers downstream.

🔄 Idempotent consumers

Kafka's at-least-once delivery means any consumer can receive the same event twice on retry or rebalance. Every consumer checks a processed_events deduplication table before processing. On duplicate, skip and ack — no double-processing.

⚖ Tradeoff: one extra DB read per message. At 5M events/day that's 5M reads — managed by indexing on event_id and batching. Exactly-once delivery (transactions API) was considered but rejected: higher broker overhead and limited to single-partition consumers.

🛡️ Redis vs DB read replicas

Read replicas handle complex SQL queries well but still involve replication lag + query parsing overhead. Chose Redis because access patterns were pure key lookups (customer by ID, order by ID). Sub-millisecond, no query planner overhead.

⚖ Tradeoff: Redis memory cost ~40% higher than replica storage for our dataset. Justified: Redis reduced DB P95 from 100ms → <20ms, cutting DB instance cost by ~35% (downsized RDS tier). Net positive on infra spend.

payment-saga-flow — orchestrated saga architecture

04.

projects/

Financial Services Modernization Platform

Banking Client @ Accenture — 4-member team — 100K+ req/day

Production

🔴 Problem

Legacy monolith handling critical banking flows suffered from tight coupling (slow releases), DB contention (P95 ~800ms), and zero fault isolation (cascading failures under load).

🔑 What I Proposed & Drove

Service boundary design→ Account, Payment, Reconciliation domains — I defined event contracts and data ownership

Kafka adoption proposal→ Presented async-vs-REST tradeoff to team; got buy-in; defined topic + partition strategy

Saga pattern selection→ Proposed orchestrated (not choreographed) Saga with explicit compensation logic

Outbox pattern→ Identified dual-write risk; proposed and implemented Outbox with relay service

CQRS + Redis read model→ Defined denormalization strategy; justified memory cost vs DB load reduction to team

⚙️ Key Architecture Decisions & Brutal Tradeoffs

Orchestrated Saga → Central coordinator owns state. Tradeoff: coordinator is coupling point + SPOF. Fixed by persisting saga state — crash-recoverable.

At-least-once Kafka → Chose at-least-once + idempotent consumers over exactly-once (Kafka transactions). Exactly-once adds ~30–40% broker overhead; idempotency check is cheaper at our scale.

Redis eventual consistency → CQRS read models are ~200ms stale. Acceptable for profile reads; not acceptable for payment status — payment queries bypass cache and hit MySQL directly.

💀 What Broke & What I Fixed

✗Duplicate payments on retry: Kafka retry re-triggered payment step — double debit. Fixed: idempotency key (transaction_id) with unique DB constraint + processed_events deduplication table.

✗Kafka consumer lag explosion: 3× traffic spike, 3-partition topic couldn't scale. Consumer lag hit 100K+ messages. Fixed: repartitioned 3 → 12, scaled consumer group, added Grafana alert at 10K lag threshold.

✗Cascading timeout storm: Payment gateway slowdown exhausted Account service threads. Fixed: Resilience4j bulkhead on payment client — downstream degradation now isolated, doesn't bleed.

✗Silent N+1 query killer: Transaction history API was ~800ms P95 — discovered via MySQL slow query log + execution plan. Single JOIN rewrite: 800ms → 500ms, no cache needed.

💰 Cost Awareness

Redis CQRS model increased memory usage by ~40% but reduced MySQL CPU load by ~60%, enabling a tier downgrade — net infra cost reduction. Kafka cluster operational overhead justified at 5M+ events/day (polling-based queue would require heavier DB at same throughput). EKS chosen over EC2 for zero-downtime rolling deployments — operational cost offset by eliminating manual deployment risk.

100K+Req / Day

500msP95 (was 800ms)

5M+Events / Day

<0.2%Data Loss (was 5%)

<20msRedis Read Latency

85%+Test Coverage

FOODIE — Distributed Food Ordering Platform

Personal Project — end-to-end system design & load tested

Personal Project

🎯 Goal

Design a real-world distributed ordering system: handle multi-step transactions (order → payment → delivery), survive partial failures without inconsistent state, scale independently across domains, and support real-time delivery tracking. Designed and validated end-to-end — not just described.

🔑 What I Owned

Full architecture design→ 6 services + API Gateway, event contracts, Kafka topic strategy

Saga orchestrator→ Order service as coordinator with persisted saga state machine

Polyglot persistence→ MySQL (transactional), MongoDB (catalog), Redis (real-time delivery state)

IaC + CI/CD→ Terraform + Helm on EKS, rolling deployments, zero-downtime

⚙️ Architecture Decisions & Tradeoffs

Orchestration, not choreography → Centralized control, explicit compensation. Tradeoff: coordinator complexity handled via persistent saga state + recovery scheduler.

Outbox pattern → Solved DB + Kafka dual-write atomicity. Tradeoff: ~50–100ms publish delay from relay polling — acceptable for async order flow.

CQRS + Redis reads → <20ms read latency. Tradeoff: eventual consistency (~200ms window). Payment status bypasses cache — reads directly from MySQL for correctness.

Probabilistic cache expiry → Fixed stampede on hot keys. Tradeoff: slightly stale data on early expiry — acceptable vs DB CPU spike from simultaneous misses.

💀 Failure Scenarios Validated

✗Duplicate order/payment events: Kafka retries caused double-processing. Fixed: idempotent consumers using event_id + unique DB constraint. Verified by forcing consumer retry in test.

✗Payment success, order stuck PENDING: Event publish failed after DB write. Fixed: Outbox pattern + retry publisher. Verified by killing service between DB write and publish.

✗Saga stuck mid-flow on coordinator crash: Stateless coordinator lost progress. Fixed: persisted saga state in DB + recovery scheduler retries stuck sagas on restart.

✗Cache stampede on hot keys: Simultaneous expiry sent 500+ requests to MySQL. Fixed: probabilistic early expiration + mutex lock on cache miss.

📊 Load Test Results (k6 — 200 concurrent users)

Throughput→ ~1,200 req/sec sustained

P95 latency→ ~310ms end-to-end order flow

Kafka lag→ <500 messages stable under load

Redis read latency→ 8–18ms p95

Error rate→ ~0.3% (transient retries, not data corruption)

6Microservices

1,200 rpsLoad Tested

310msP95 Latency

<20msRedis Read

chaosFailure Tested

→ View on GitHub

IoT Telemetry Ingestion Platform

Manufacturing Client @ Accenture — 5M+ events/day production pipeline

Production

🔴 Problem

High-frequency device telemetry (~5M+ events/day) with burst spikes during manufacturing shifts. Synchronous ingestion caused ~5% event drops under load. No replay meant failed events were permanently lost. JVM OOM crashes occurred silently after ~6–8 hours — no alerts before crash.

🔑 What I Owned

Kafka ingestion buffer design→ Decoupled producers from consumers; defined device_id-based partitioning strategy

Manual offset management→ Replaced auto-commit; offset committed only after successful processing — no silent loss on crash

DLQ + replay pipeline→ Poison messages isolated after 3 retries; replay without impacting main flow

JVM memory leak debugging→ Led heap dump analysis + CloudWatch metrics; identified unbounded deduplication map

⚙️ Architecture Decisions & Tradeoffs

Kafka as shock absorber → Producers write at device speed; consumers process at controlled rate. Tradeoff: Kafka cluster operational overhead — justified vs event loss under burst.

Partitioning by device_id → Guarantees per-device ordering without cross-device constraints. Tradeoff: hot partition risk on high-frequency devices — monitored via partition lag metrics.

Manual offset commit → Zero silent data loss on consumer crash. Tradeoff: at-least-once delivery — requires idempotent processing for duplicate events.

Caffeine over HashMap → Bounded deduplication cache (TTL + max size). Tradeoff: slightly higher miss rate vs unbounded map — acceptable to eliminate OOM risk entirely.

💀 What Broke & What I Fixed

✗JVM OOM crash (6–8hr cycle): Deduplication logic stored all processed events in unbounded HashMap. Memory grew linearly until OOMKilled. Fixed: replaced with Caffeine (TTL + size bound). Added heap usage alert at 75% — not 100%.

✗Consumer lag explosion during shift spike: Processing slower than ingestion rate. Fixed: tuned max.poll.records, scaled consumer group horizontally, added lag-based autoscaling trigger.

✗Out-of-order replay from DLQ: DLQ replay ignored original sequence_number. Caused ordering violations on replay. Fixed: introduced sequence_number header + ordered replay logic in replay service.

💰 Cost Awareness

Manual offset commit adds minor coordination overhead (~5ms) vs auto-commit — fully justified by eliminating event loss. Caffeine cache uses bounded heap memory (<200MB) vs unbounded HashMap that was consuming 2–4GB before OOM. Kafka cluster overhead offset by eliminating the 5% event loss rate — data recovery at that scale would be more expensive.

5M+Events/Day

<1sIngestion Latency

<0.2%Data Loss (was 5%)

0OOM Crashes Post-Fix

05.

production_incidents.log

real failures — what broke, why, and how I fixed it

# Real engineers talk about what failed. Here's mine.

# All incidents below are production. All are resolved.

🚨 INC-001 — Duplicate Payment Processing

Critical · Resolved

What Happened

Kafka consumer retried after timeout. DB write had already succeeded. Same payment processed twice → double debit to customer account.

Root Cause

Assumed "at-least-once delivery" wouldn't cause duplicates in practice. No idempotency enforcement at consumer level.

Fix

Introduced idempotency key (transaction_id) with unique DB constraint
Added processed_events table to track consumed event IDs per consumer
Made consumer logic safe for infinite retries — dedup on entry

Never trust Kafka delivery semantics — design every consumer as idempotent, or you will corrupt financial data. The delivery guarantee is a broker property, not a business guarantee.

🚨 INC-002 — Kafka Consumer Lag Explosion

High · Resolved

What Happened

Traffic spike (~3× normal load). Consumer couldn't keep up — lag grew to 100K+ messages. Downstream systems started serving stale data. On-call woke up to Grafana alerts.

Root Cause

Under-partitioned topic (3 partitions). Consumer group size was locked to partition count — couldn't scale horizontally beyond 3 consumers.

Fix

Increased partitions: 3 → 12 (requires careful key rebalancing)
Scaled consumer group to match new partition count
Added Grafana lag alert at 10K threshold (not 100K)

Kafka scalability is bounded by partition count — you don't scale consumers, you scale partitions. Design partition count for future peak, not current load. Changing partitions in production is painful.

🚨 INC-003 — Redis Cache Stampede

High · Resolved

What Happened

Hot keys expired simultaneously. 500+ concurrent requests missed cache and hit MySQL. DB CPU spiked to 95% — latency degraded across all services for ~40 seconds.

Root Cause

Naive TTL — all instances of the same key set with identical expiry time. No protection against concurrent cache miss.

Fix

Implemented probabilistic early expiration (PER) — expire slightly before TTL based on recomputation cost
Added mutex locking on cache miss — only one thread recomputes, rest wait
Added jitter to TTL values (±10%) across keys

Caching is not just a performance optimization — it's a distributed systems problem. A cache that expires everything at once is a time bomb pointed at your database.

🚨 INC-004 — JVM OOM Crash in Telemetry Pipeline

Critical · Resolved

What Happened

Telemetry service ran fine for ~6–8 hours, then OOMKilled. Zero warnings. Service restarted, ran fine again for 6–8 hours, then died again. Cyclic pattern.

Root Cause

Deduplication logic stored processed event IDs in an unbounded HashMap. Memory grew linearly with event volume over time. No eviction, no TTL, no size bound.

Fix

Replaced HashMap with Caffeine cache (TTL=1h + maxSize=500K entries)
Added heap usage alert at 75% — not 100% (too late by then)
Enabled heap dump on OOM (-XX:+HeapDumpOnOutOfMemoryError) for future debugging

If memory usage is linear over time, your system is already dead — you just haven't noticed. Alert on trend, not threshold. A 75% heap alert gives you time to act; 100% means you're reading this post-mortem.

🚨 INC-005 — Saga Stuck in Inconsistent State

Critical · Resolved

What Happened

Payment succeeded but Order remained PENDING. Delivery was never dispatched. Customer paid, got no food. Downstream services had conflicting state across three services.

Root Cause

Saga coordinator was stateless — held saga progress in memory. Coordinator crashed mid-flow (k8s pod restart). On restart, had no memory of where the saga was. Didn't retry; didn't compensate.

Fix

Persisted saga state machine in DB — every step transition stored transactionally
Added recovery scheduler — on startup, find sagas in non-terminal states and retry
Made all compensation steps idempotent (safe to call multiple times)

Distributed transactions don't fail cleanly. A saga that doesn't persist state is a saga that will get stuck the moment your coordinator restarts. Design for recovery, not just execution.

06.

security.md

security considerations — banking domain

# Banking domain + no security = 🚨. Here's how I think about it.

🔐 Auth Strategy

JWT + OAuth2 (Spring Security) across all services. Short-lived access tokens (15min) + refresh token rotation. Service-to-service calls use mTLS within EKS cluster — no JWT for internal traffic (avoids token propagation overhead). Role-based access control (RBAC) enforced at controller layer.

🚦 Rate Limiting & Abuse Prevention

API Gateway-level rate limiting (per-IP + per-user token bucket). Payment endpoints have stricter limits (10 req/min per user) vs read endpoints (100 req/min). Idempotency keys on payment API prevent abuse via rapid retries. Failed auth attempts trigger exponential backoff.

🛡️ PII & Sensitive Data Handling

Banking domain: PII masking in logs (account numbers, names masked at logging layer via log4j PatternLayout filter). Sensitive fields encrypted at rest (AES-256). Kafka messages carrying PII use field-level encryption — consumers decrypt only if authorized. No raw card data in application layer.

✅ Input Validation & Injection Prevention

Bean Validation (Jakarta) on all request DTOs — no raw input reaches service layer. JPA Criteria API + parameterized queries — no string concatenation in SQL. Kafka message schema validation at consumer entry point — malformed messages routed to DLQ, not processed. Content-type enforcement on all endpoints.

🔍 Audit Trail

All financial operations (transfers, account changes) emit structured audit events to a dedicated Kafka topic — append-only, tamper-evident log. Audit events include: user_id, timestamp, action, before/after state. Retained for compliance. Not deletable by application layer — separate consumer writes to cold storage.

⚠️ Known Gaps & Trade-offs

mTLS certificate rotation adds operational overhead — managed via cert-manager in k8s. Field-level encryption on Kafka adds ~10ms processing overhead per message — acceptable for PII fields, not applied to all fields. Rate limiting at gateway (not service level) means internal abuse possible — acceptable within mTLS-protected cluster.

07.

experience.log

Accenture

Java Backend Engineer — 4-member backend team, Financial Services & IoT

Sep 2022 — Present
Hyderabad, India

Proposed and drove Kafka adoption for async payment flows — presented REST vs async tradeoffs to team, defined topic + partition strategy; decision reviewed with senior architects in a 4-member backend team.
Designed orchestrated Saga pattern with explicit compensation logic — chose orchestration over choreography for debuggability, owned idempotency guarantees and failure recovery design.
Identified and introduced Outbox Pattern to eliminate dual-write data loss risk — designed relay service, justified operational overhead vs data integrity guarantee.
Owned CQRS + Redis read model strategy — justified ~40% memory increase to team with ~60% DB CPU reduction data; enabled MySQL tier downgrade (net cost reduction).
Debugged and fixed N+1 query bottleneck in transaction history APIs via MySQL execution plan analysis — P95 latency 800ms → 500ms, no caching required.
Led JVM OOM debugging in IoT pipeline — heap dump analysis identified unbounded deduplication HashMap; replaced with Caffeine; eliminated recurring 6–8hr crash cycle.
Configured Resilience4j circuit breakers + bulkheads — isolated payment gateway slowdowns from cascading into account and transaction services.
Built end-to-end observability: OpenTelemetry traces, Grafana dashboards correlating Kafka consumer lag with API latency; caught INC-002 via lag alert before customer impact escalated.
Achieved 85%+ test coverage on critical fund-transfer flows (TDD, JUnit 5 + Mockito) — reduced production regressions across 3 projects.

Java 21Spring BootKafkaRedisMySQLAWS EKSResilience4jOpenTelemetrySagaCQRSOutbox

certifications

✓ AZ-400 Microsoft Certified: DevOps Engineer Expert — 2025

✓ AZ-104 Microsoft Certified: Azure Administrator Associate — 2025

✓ AZ-900 Microsoft Certified: Azure Fundamentals — 2024

// Education: B.Tech ECE — Malla Reddy Engineering College (2018–2022) | CGPA: 7.2

08.

contact.sh

bash

# ✅ Open to SDE2 / Senior Backend / Software Engineer roles

# 🎯 Product companies, MNCs — Hyderabad / Remote India

# 💬 Ask me: Kafka design, Saga patterns, distributed systems failures

echo "I own systems end-to-end. Let's build something that actually scales."

⬇ Download Resume (PDF)

// email chandasaiprakash123@gmail.com // linkedin linkedin.com/in/chandasaiprakash // github github.com/Chandasaiprakash // leetcode leetcode.com/u/Chandasaiprakash