Journey to Optimize Network Usage (Part 2)

Journey to Optimize Network Usage (Part 2)
Created with Canva AI Image Generator
Written by Devops Traveler Tales, 05 March 2025

In Part 1, we explored when network costs occur and what factors contribute to their high expenses. However, at BACKND, optimizing network usage isn't just about reducing internal infrastructure costs. So, what benefits do customers gain when BACKND optimizes its network usage?


How SaaS Companies’ Network Optimization Benefits Customers

When a SaaS provider optimizes network usage, its customers directly benefit from cost savings.

Photo by Karolina Grabowska on Unsplash

Many SaaS providers pass network costs incurred during service processing directly onto their customers. These costs are often embedded in API call pricing or, in cases where services charge based on traffic usage, network optimizations directly translates into lower costs for customers.

Beyond cost savings, performance and stability improvements are also significant advantages. Reducing network traffic size leads to lower latency, resulting in faster response times. Additionally, lower traffic volume means a lower chance of throttling during traffic spikes, ultimately enhancing service stability.

Enhanced Security Benefits

As explained in Part 1, network costs increase when traffic exits the cloud. Optimizing traffic to stay in the cloud minimizes external exposure of cloud traffic, reducing both costs and security risks.

More importantly, these optimizations enhance the reliability of a SaaS provider's service, allowing customers to continue using it with confidence for the long term.

BACKND’s Network Optimization Cases

Today's story is about how BACKND has approached network cost optimization on a case-by-case basis. They can be broken down into four key areas:

  1. Optimizing the Game Resource Server’s Network Usage
  2. Authentication Server and DNS Throttling Issues
  3. Optimizing NAT gateway network usage
  4. Eliminating Over-Provisioned Servers

Now, let’s explore each case and examine how we effectively reduced network costs.


Optimizing the Game Resource Server’s Network Usage

In games and mobile applications, resource servers are frequently utilized. While playing a game, you may have encountered an update where the initial screen displays a message like "Downloading resources...", requiring the necessary files to be downloaded before proceeding with the game.

Typically, when a client application is modified, it must go through an app store review process before users can download the update. However, by leveraging a game resource server, developers can bypass this process and update specific files through a simple download, eliminating the need to modify the app itself. This type of system is commonly referred to as a game resource server or a resource patch server, primarily used to deliver game assets such as images and numerical data.

In this section, we will explore how network optimization was applied to improve the efficiency of game resource servers.

Initial Challenges We Faced in Implementation

BACKND’s resource server was initially designed to deliver only numerical game data, leading us to assume that network usage would remain low. At the time, we prioritized security considerations and opted not to use a CDN, instead delivering data directly from the API server.

However, several unexpected issues arose:

  • 1. Text data can still be large in volumeCertain game elements, such as NPC dialogues and in-game announcements, are text-based but can accumulate into substantial data sizes over time.
  • 2. The resource server generates burst workloads that cause sudden spikes in network traffic.In many cases, all of the resources need to be downloaded at the start of a game.Especially after maintenance periods or during daily midnight (12am) check-in events when large traffic surges occur.

As a result, the API server experienced a significant increase in load, leading to higher-than-expected network costs.

Network Optimization Approach

To address this issue, we undertook improvements to optimize the network architecture.

  1. Implemented a CDN with Signed URLs
    • Previously, the API server directly delivered resource data. However, we transitioned to a CDN-based approach using CloudFront.
    • This allowed clients to retrieve resources directly from CloudFront instead of passing through the API server.
    • Additionally, we implemented Signed URLs to encrypt and authenticate each request, preventing unauthorized access to other customers’ data and enhancing security.

2.Separated and Optimized the Resource Server API

    • The resource server does not impose a heavy CPU load, but it does generate high bursts of network traffic.
    • To mitigate this, we separated the resource server’s API and allocated dedicated infrastructure resources optimized for handling these traffic spikes.
    • This adjustment distributed network load more efficiently, reducing traffic bottlenecks on specific servers and improving the overall stability of the resource server.

Through these optimization efforts, we successfully reduced the load on the API server and lowered network costs. Additionally, faster resource downloads improved the overall player experience. Most importantly, we enhanced traffic management efficiency while maintaining security, creating a scalable network architecture that can support future growth.

Authentication Server and DNS Throttling Issues

The burst workload caused by the traffic spikes on the game server did not only have an impact on the resource server - it also had an impact on the authentication server.

As mentioned above, a large number of users try to log in at the same time right after maintenance or at midnight when the daily check-in events are reset. This caused a spike in the load on the authentication server.

Challenges Caused by Burst Workloads in Game Servers (Created with Canva AI Image Generator)

Challenges with Authentication Traffic

BACKND’s authentication server handles user authentication and request validation:

  • During login, it issues an Access Token to authenticate users.
  • When a client accesses a specific resource, the server verifies the request before granting access.

During this process, clients need to locate the authentication server, but during burst periods, all servers within the cluster simultaneously send requests to the authentication server, leading to network throttling.

Previously, we scaled up both the authentication and DNS servers to distribute the traffic load, which provided a temporary fix. However, this approach resulted in higher costs and was not a sustainable long-term solution. Ultimately, a more effective, long-term strategy was required.

Solving DNS Throttling: Implementing NodeLocal DNSCache

However, network throttling still persisted, requiring additional optimization.

First, we implemented NodeLocal DNSCache, as recommended in the Kubernetes documentation, to prevent DNS query throttling.

Kubernetes service resources use ClusterIP, meaning that IP addresses do not change, based on our experience with BACKND. Based on this, we determined that DNS caching would be beneficial because it would reduce the number of unnecessary DNS requests and reduce the load caused by the search for the authentication server.

However, even after applying DNS caching, network throttling still persisted.

Solution: Optimizing the Authentication Server Structure

BACKND's authentication server validates requests by inspecting the entire request body. As a result, when clients send requests with large body data, it significantly increases network bandwidth consumption in a cascading manner.

To address this issue, we explored multiple solutions and ultimately decided to introduce a token server as a sidecar.

Implementation of Authentication Sidecar and Its Benefits

Introducing a token server as a sidecar brought several key advantages:

  1. Independent authentication code management & rust integration
    • Authentication logic could be managed in a separate repository, reducing dependency on the resource server.
    • Rust was adopted to implement a faster and more secure authentication system.
  1. OS kernel-level processing eliminates network bandwidth usage
    • In the previous setup, authentication requests required communication between physically separate EC2 instances.
    • With the sidecar approach, authentication processing occurs directly within the OS kernel layer, eliminating network traffic usage.
  1. Minimal code modification
    • Since existing requests were already routed to the authentication server, modifying them to be sent to the sidecar required minimal changes.
    • This allowed for seamless adoption without modifying the entire service architecture.
  1. Increased modularity and scalability
    • Although there is some HTTP communication overhead,
    • the modular authentication logic made maintenance easier and ensured better scalability for future expansions.

Results and Impact

  • Reduction in Network Usage and Cost Savings
    • Since authentication requests were processed at the OS kernel level, network usage on EC2 instances dropped significantly.
    • This reduction allowed us to scale down the number of over-provisioned servers, optimizing resource allocation.
  • Decrease in Network Latency
    • By eliminating physical network communication and handling authentication directly within the kernel layer,
    • we reduced network latency by 5–10ms.
  • Performance Improvement in P50 Response Time
    • The P50 average response time improved consistently by 5–10ms.
    • The performance gains were especially noticeable in latency-sensitive requests.

As a result of these optimizations, we successfully reduced authentication server network traffic, mitigated DNS throttling issues, and improved overall response speed.


This transformation made BACKND’s authentication system faster, more stable, and highly scalable, ensuring better performance and reliability for our users.

BACKND's authentication system became faster, more stable, and highly scalable. (Created with Canva AI Image Generator)

Optimizing NAT Gateway Network Usage

One day, we noticed unexpectedly high costs associated with our NAT gateway usage.

NAT gateways bridge internal and external networks, primarily incurring costs when data flows from the internal network to the internet. Initially, since we were using Kubernetes, we assumed that the traffic exiting the cluster was simply normal outbound traffic.

However, we needed to determine whether this was indeed legitimate traffic or if unnecessary costs were being incurred. If misconfigurations were causing excessive and avoidable NAT gateway traffic, it was crucial to identify and resolve the issue immediately.

NAT Gateway Traffic Analysis

To accurately measure NAT gateway traffic, we enabled VPC Flow Logs and performed queries using CloudWatch Logs Insights.

  1. Investigating Outbound Traffic from VPC
    • Filtered traffic where srcAddr (source IP) belonged to the VPC internal IP range.
    • Sorted results by bytesTransferred (amount of data transferred) in descending order.
    • Examined dstAddr (destination IPs)

  1. Identify the unnecessary public network usage
    • The logs revealed that the destination IPs belonged to AWS services.
    • This meant that, even when AWS internal systems transferred data to other AWS services, the traffic still went through the NAT gateway. This resulted in unnecessary costs as it was routed via the public network.

Since all AWS services operate within the AWS internal network, they should have communicated directly via the internal VPC network.
However, due to our existing configuration, AWS service-to-service communication was going through the internet, leading to avoidable NAT gateway charges.

AWS provides a feature called PrivateLink, which enables direct communication with AWS services within the internal network.

By utilizing PrivateLink, AWS services are assigned dedicated internal IPs, allowing communication without routing through the public network.

As a result, 1) Security is enhanced by keeping traffic within the AWS internal network, avoiding exposure to the public internet. 2) After transitioning AWS service traffic to PrivateLink, NAT gateway costs were reduced to one-twentieth of the original amount. 3) By eliminating unnecessary internet traffic, data transfer speeds between AWS services improved.

Eliminating Over-Provisioned Servers

After undergoing the previous optimization processes, BACKND successfully reduced its network usage significantly. As a result, not only did this lead to lower network costs, but it also allowed us to scale down the number of over-provisioned instances.

Which Servers Were Scaled Down First? Authentication Server and Resource Patch Server

  • Authentication Server
    • Previously, despite low CPU and memory usage, the authentication server was over-provisioned due to high network traffic.
    • By handling authentication traffic through a sidecar approach, network load was reduced.
    • As a result, we were able to allocate an appropriate number of instances.
  • Resource Server
    • Improved the setup so that the resource server no longer directly serves data by leveraging CloudFront.
    • Reduced network traffic, leading to a decrease in the required number of instances.

Additional Optimization: Reducing DNS Cache Server Load

  • Optimized DNS cache servers within the cluster.
    • Implemented NodeLocal DNSCache to reduce the number of DNS queries.
    • As a result, DNS cache server load decreased, leading to a reduction in the number of instances.

As a result, 1) Infrastructure cost savings by eliminating unnecessary instances. 2) Enhanced overall service efficiency through resource optimization. 3) Improved service stability by reducing network overhead.

This demonstrates that network optimization is not just about cutting traffic costs—it is a crucial process for ensuring an efficient and scalable infrastructure.

Final Thoughts: The Broader Impact of Network Optimization

Through this network optimization initiative, we have not only reduced costs but also established a stronger foundation for service stability and operational efficiency.

By alleviating the load on API servers, refining the network architecture of authentication and resource servers, and eliminating unnecessary traffic, we have gone beyond mere cost reduction. Instead, we have built a more scalable and resilient infrastructure.

More importantly, network optimization is not just a technical effort to cut cloud costs—it is a fundamental process to enhance service performance and user experience. Maintaining a fast and stable network infrastructure is directly tied to delivering a service that customers can trust.

We, at BACKND, remains committed to continuous optimization, ensuring that game developers can operate their games in a more efficient and stable environment.


Epilogue

However, there are still unresolved challenges ahead.

In the past, we encountered an issue where ElastiCache's network bandwidth hit throttling limits, causing service faults. At the time, rather than attempting right-sizing (proper capacity allocation), we opted for over-provisioning as an immediate solution to mitigate the crisis.

Since then, various optimizations have helped alleviate excessive network bandwidth usage. However, our three-year Reserved Instance (RI) contract remains a lingering reminder of a key lesson learned; "Resolving network bandwidth issues by simply scaling up instance sizes can be extremely costly."

Although this post does not cover it, we have also faced challenges with rapidly increasing network usage and high costs in our chat and logging systems. To address these issues, we have been implementing various optimizations such as data compression and aggregation.

We hope that our experiences provide insights for others facing similar challenges.
This concludes our story.

References

[1] How do I find the top contributors to NAT gateway traffic in my Amazon VPC?

backnd

© 2025 AFI, INC. All Rights Reserved. All Pictures cannot be copied without permission.

Read more