DevOps

How a Service Outage Sparked Growth: The Route 53 Routing Failover Journey

BACKND

22 Nov 2024 — 5 min read

Written by Devops Traveler Tales, 22 November 2024

Today, I’d like to take a reflective approach. As a B2B SaaS infrastructure operator, every system failure carries significant responsibility, potentially affecting numerous clients. Let me walk you through a past incident involving a service outage, the challenges we faced, and how we improved as a result.

From EC2 to EKS: The Start of Cloud Modernization

BACKND's API servers, providing a B2B game server SaaS service, had long operated reliably on an EC2-based container environment.

However, two critical requirements emerged over time:

Cost optimization through flexible resource allocation
Enhanced observability for better system monitoring

To address these needs, we initiated a modernization process, transitioning from EC2 to Elastic Kubernetes Service (EKS).

Along the way, we also implemented structural improvements, such as replacing the AWS Application Load Balancer (ALB) with Traefik.

Once the infrastructure was prepared, we planned for a zero-downtime migration to EKS.

Since our service provides game server SaaS, even a momentary service interruption would mean that all our clients would need to request their gamers’ understanding for a temporary halt in game services—a heavy burden for everyone involved.

Planning a Zero-Downtime Migration

To ensure a smooth transition, we adopted the following strategies,

Operate EC2 and EKS environments simultaneously.
Use Route 53 weighted routing to gradually shift traffic.
1. e.g., starting with ratios like 99% to EC2 and 1% to EKS.
Begin with services that have lower traffic to test the transition step by step.

Considering the nature of gaming services, we accounted for traffic spikes during midnight reward times and moved cautiously, beyond merely analyzing weekly traffic patterns. This meticulous approach initially made the transition appear stable.

The Unexpected Service Failure and Immediate Recovery

Despite careful planning, we encountered an unexpected issue:
a sudden HTTP 503 (Service Unavailable) error. What was particularly puzzling was that the error originated from EC2 instances, which had already been set to 0% traffic allocation in Route 53

Following actions taken during the incident,

Recovery Efforts
- Because traffic was being routed to EC2, we first restarted the EC2 instances to temporarily resume traffic handling and restore the service.
Root Cause Analysis and Resolution
- Detailed monitoring revealed that the root cause was an overload on the ingress server associated with the load balancer, which caused the load balancer to restart.
- To address this, we adjusted the load balancer’s auto-scaling settings, completing the initial system improvements.

However, the fact that traffic flowed to EC2 despite being set to 0% in Route 53 raised further questions, leading to deeper investigations into Route 53’s behavior.

Digging Deeper into Route 53’s Behavior to Prevent Recurrence

The incident prompted BACKND to implement stronger contingency measures for scaling failures.

Concerns were raised about the newly introduced L7 proxy server, which had replaced AWS load balancer, and we reviewed the additional management layer (management hop) it introduced.

Two main directions emerged within the team,

Reverting to AWS Load Balancer
- Abandon the L7 proxy and return to using the AWS load balancer.
- Leverage AWS's proven services to ensure stability.
Enhancing the L7 Proxy
- Retain the L7 proxy while strengthening its safeguards.
- Highlight the advantages of infrastructure management flexibility and advanced capabilities.

I, Tales, supported the second approach, valuing the L7 proxy's convenience and scalability. However, addressing load balancer scaling failures proved challenging. During these discussions, a fundamental question arose.

Why did Route 53 route traffic to EC2 instances set to 0% traffic allocation?

This question became the starting point for a new round of analysis.

Identifying the Root Cause Through Route 53 Behavior Analysis

At the time of the incident, Route 53 was configured to route all traffic (100%) to EKS. However, traffic was unexpectedly routed to EC2, which had been set to 0% traffic allocation.

I, Tales, recognized that while it was crucial to devise measures to mitigate the incident's impact, understanding the root cause of this unexpected routing behavior was even more critical. This led to an in-depth investigation into how Route 53's mechanism works.

In reviewing AWS Route 53 documentation, we uncovered a critical detail:

When a weighted record group includes health checks, and some records have a weight greater than 0 while others are set to 0, Route 53 behaves as though all records are zero-weighted. However, exceptions occur in the following cases:

1. Route 53 first considers records with weights greater than 0. (if applicable).
2. If all records with weights greater than 0 are unhealthy, Route 53 redirects to zero-weighted records.

This discovery not only explained the unexpected traffic routing during the incident but also provided crucial insights for designing future systems.

Designing a High-Availability Architecture Through Active-Passive Failover

Understanding Route 53’s behavior inspired a new architectural design. By leveraging the routing characteristics of zero-weighted records in Route 53, we developed a dual-routing structure to enable seamless failover configurations.

Key Features of Route 53 Failover Routing

Route 53's failover routing operates as follows.

Explicitly specify Primary and Secondary routing targets.
Under normal conditions, all traffic is routed to the Primary target.
In case of a health check failure, traffic is automatically routed to the Secondary target.

One particularly noteworthy aspect is that this failover occurs at the DNS level. DNS-level failover offers advantages over load balancer-level mechanisms, such as:

Handling Availability Zone failures.
Resilience against Region-level outages.
Broad DNS-based disaster recovery.

Enhanced High-Availability Architecture

Building on these insights, we developed a more advanced architecture that combines AWS Application Load Balancer (ALB) and Traefik routing, creating a dual records structure to enhance reliability.

Primary Routing Layer
- Utilizes Traefik Ingress for service delivery.
- Offers precise routing control and advanced traffic management capabilities.
Secondary Routing Layer
- Serves as a backup with AWS ALB routing.
- Automatically handles failover in the event of an ingress server failure.

This active-passive failover structure ensures that each layer can serve as a backup for the other, significantly enhancing both system reliability and stability.

Epilogue: Lessons and Insights

#1

The modernization of BACKND's infrastructure has resulted in significant improvements. For example, during a recent game launch by one of our clients that caused a massive traffic surge, other clients experienced minimal disruption—a testament to the system's robustness.

#2

Although we encountered an overload issue with Traefik ingress during the transition, subsequent zero-downtime deployments were executed successfully and with stability. For teams planning zero-downtime infrastructure migrations, we highly recommend leveraging the weighted routing capabilities of Route 53.

#3

Reflecting on this incident, I, Tales, believe it was an opportunity to design a stronger, more resilient architecture. While the service outage was a painful experience 😢, the insights gained from it provided a solid foundation for elevating the system’s reliability to the next level.

How a Service Outage Sparked Growth: The Route 53 Routing Failover Journey

BACKND

From EC2 to EKS: The Start of Cloud Modernization

The Unexpected Service Failure and Immediate Recovery

Digging Deeper into Route 53’s Behavior to Prevent Recurrence

Identifying the Root Cause Through Route 53 Behavior Analysis

Designing a High-Availability Architecture Through Active-Passive Failover

Key Features of Route 53 Failover Routing

Enhanced High-Availability Architecture

Epilogue: Lessons and Insights

#1

#2

#3

Read more

LLM Theory and Prompt Utilization Strategy Through the Lens of Probability - 2

LLM Theory and Prompt Utilization Strategy Through the Lens of Probability - 1

Talk to Users and Make Decisions in Just 5 Days🖐️:What’s a Sprint? (Part 1)

A Guide to Entering the South Korean Game Market for Global Developers [Series 3] - Korean Game-Related Laws and Regulations

From EC2 to EKS: The Start of Cloud Modernization

The Unexpected Service Failure and Immediate Recovery

Digging Deeper into Route 53’s Behavior to Prevent Recurrence

Identifying the Root Cause Through Route 53 Behavior Analysis

Designing a High-Availability Architecture Through Active-Passive Failover

Key Features of Route 53 Failover Routing

Enhanced High-Availability Architecture

Epilogue: Lessons and Insights

#1

#2

#3

Sign up for BACKND

Read more

LLM Theory and Prompt Utilization Strategy Through the Lens of Probability - 2

LLM Theory and Prompt Utilization Strategy Through the Lens of Probability - 1

Talk to Users and Make Decisions in Just 5 Days🖐️:What’s a Sprint? (Part 1)

A Guide to Entering the South Korean Game Market for Global Developers [Series 3] - Korean Game-Related Laws and Regulations