UPI Migration: DC to AWS Cloud
A comprehensive technical deep-dive into migrating India's critical UPI payment infrastructure from PPBL (Paytm Payments Bank) physical data centers to AWS Cloud with zero downtime.
Executive Summary
The entire UPI stack at PPBL (Paytm Payments Bank) required migration from on-premises data centers to AWS Cloud. This initiative demanded meticulous planning across networking, security, data synchronization, deployment orchestration, and compliance certifications (PCI-DSS, ISO 27001, SOC 2) to ensure zero customer impact and regulatory adherence.
▸The Challenge
Technical Constraints
- •Entire UPI stack running on legacy on-prem infrastructure
- •Complex networking requirements (VPC, subnets, peering, NAT/IGW, PrivateLink, DirectConnect for NPCI and Bank connectivities)
- •Large-scale data migration with consistency requirements
- •Multi-service dependencies and tight coupling
Business Requirements
- •Absolutely zero downtime tolerance for UPI transactions
- •Strict regulatory compliance and audit requirements
- •Minimal customer impact during migration
- •Rollback capability at every migration phase
- •Replace obsolete and old versions of tools with modern, supported alternatives
- •Architecture design to have maximum automation and minimum manual intervention
▸Migration Approach
Phase 1: Foundation
- AWS account setup with landing zone best practices
- VPC design: Multi-AZ architecture with public/private subnets
- Network connectivity: VPN, Direct Connect, peering configurations
- Security groups, NACLs, and IAM roles/policies
- EKS cluster provisioning with Istio service mesh
Phase 2: Data Migration
- Database replication setup (Aerospike, others)
- Kafka topic migration and consumer group synchronization
- S3 bucket creation with lifecycle policies
- Data validation and consistency checks
- Performance baseline establishment
Phase 3: Application Deployment
- Containerization of all UPI microservices
- Helm chart creation and GitOps setup with Argo CD
- Blue/green deployment preparation
- Service mesh configuration (Istio routing rules)
- Observability stack deployment (Prometheus, Grafana)
Phase 4: Cutover & Validation
- Progressive traffic shift using Argo Rollouts canaries
- Real-time monitoring of latencies, error rates, throughput
- DR runbook execution and failover testing
- Final cutover during low-traffic window
- Post-migration optimization and tuning
▸Risks & Mitigations
Data inconsistency during replication
Mitigation: Implemented dual-write pattern with reconciliation jobs; automated consistency checks pre-cutover
Network latency increase
Mitigation: Direct Connect for low-latency connectivity; performance benchmarking at each phase
Service dependencies failure
Mitigation: Circuit breakers, retries with exponential backoff, comprehensive monitoring dashboards
Security compliance gaps
Mitigation: IRSA for pod-level IAM, encryption at rest/in transit, audit logging, CIS benchmarks
Rollback complexity
Mitigation: Automated rollback scripts, traffic shift granularity with Argo Rollouts, tested rollback procedures
▸Outcomes & Impact
Key Achievements
- Successfully migrated 100% of UPI services without any customer-facing incidents
- Established GitOps practices with Argo CD for declarative infrastructure
- Implemented progressive delivery with canary deployments and automated rollbacks
- Achieved regulatory compliance with comprehensive audit trails and security controls
- Reduced infrastructure costs through right-sizing and efficient resource utilization
- Built comprehensive observability with SLO/SLA dashboards and alerting