DNS Failover and Load Balancing

4 min read

## DNS as a Traffic Management Layer DNS is often described as "just a phone book," but modern DNS infrastructure is a sophisticated traffic management layer. Because every client must resolve a hostname before making a connection, the Authoritative DNS Server server has an opportunity to influence where that traffic goes — directing it to the fastest server, the geographically closest endpoint, or a healthy failover node when the primary is down. This works because DNS responses are not required to return a single IP. A nameserver can return different answers based on who is asking, what time it is, what health checks report, and how it wants to balance load across a pool of servers. ## Round-Robin DNS: The Simplest Approach Round-robin DNS returns multiple A records in rotating order. Each successive query gets the same records but in a different sequence. Clients typically connect to the first IP in the response. ``` ; Zone file: round-robin across three servers www 60 IN A 203.0.113.1 www 60 IN A 203.0.113.2 www 60 IN A 203.0.113.3 ``` The short TTL (Time To Live) of 60 seconds ensures clients rotate through servers over time. Round-robin is zero-configuration load balancing but has a critical weakness: it does not detect failures. If `203.0.113.2` goes down, queries still land there and users get errors. Round-robin also does not guarantee even distribution. Resolvers and clients cache responses; heavy-traffic clients may pin to one IP for the duration of their cache TTL. ## Health-Check-Based Failover Managed DNS Hosting providers (Route 53, Cloudflare, NS1, Dyn) monitor your endpoints and automatically remove unhealthy IPs from DNS responses. The workflow: 1. Define a health check: HTTP GET to `https://203.0.113.1/health`, expect HTTP 200 within 2 seconds 2. Associate the health check with the DNS record 3. The provider's global health-check network probes your endpoint every 10–30 seconds 4. If the endpoint fails N consecutive checks, the provider removes its IP from DNS responses 5. When the endpoint recovers, the IP is automatically re-added For active-passive failover, the configuration is: - Primary record: `www.example.com → 203.0.113.1` (with health check) - Failover record: `www.example.com → 203.0.113.2` (activated when primary is unhealthy) The TTL (Time To Live) on failover-enabled records should be very low (10–60 seconds) so clients re-resolve quickly after a failover event. With a 300-second TTL, clients that cached the unhealthy IP before the failover fires will continue hitting the dead endpoint for up to 5 minutes. ## Weighted Routing Weighted routing assigns a numeric weight to each record. Responses are returned in proportion to weights: ``` ; Route 80% of traffic to new server, 20% to old www 30 IN A 203.0.113.10 ; weight 80 www 30 IN A 203.0.113.11 ; weight 20 ``` In Route 53, weights are set as integers (0–255). A record with weight 0 receives no traffic (useful for maintenance). Weighted routing is excellent for canary deployments: send 1% of traffic to the new version, monitor error rates, then ramp up. ## Latency-Based Routing Latency-based routing (Route 53's term) returns the record corresponding to the AWS region with the lowest measured latency to the requesting resolver's IP. Amazon maintains a continuously updated latency database mapping IP ranges to regional latency estimates. For a global application hosted in us-east-1, eu-west-1, and ap-northeast-1: - A user in Tokyo gets the ap-northeast-1 endpoint - A user in Frankfurt gets the eu-west-1 endpoint - A user in Virginia gets the us-east-1 endpoint This is distinct from geodns-location-routing (which routes based on geographic IP lookup) and from Anycast DNS (which routes at the BGP layer). Latency-based routing uses measured round-trip time data, which often produces better decisions than pure geography. See GeoDNS: Location-Based DNS Routing for geographic routing approaches. ## TTL Strategy for DNS Failover TTL (Time To Live) is the single most important tuning parameter for DNS-based failover. It controls how long resolver caches (and clients) hold a DNS answer before re-querying. | Scenario | Recommended TTL | |---|---| | Critical production record with health-check failover | 10–60 seconds | | Standard production record without failover | 300–600 seconds | | Static content / CDN endpoint (rarely changes) | 3600–86400 seconds | | Pre-planned maintenance window | Lower 24h before, restore after | The trade-off is direct: lower TTLs increase resolver query load (more queries per second to your authoritative servers) and slightly increase user-perceived DNS latency (more cache misses). Higher TTLs reduce query load but slow failover response. ## DNS Load Balancing vs. Application Load Balancing DNS load balancing is a coarse tool. Because of caching, it cannot distribute load as evenly as an application load balancer (ALB/NLB) operating at the connection or request level. DNS load balancing works at the session level — once a client has an IP, all its connections go there until the TTL expires. For high-precision load distribution, use a cloud load balancer (AWS ALB, GCP Load Balancer, Cloudflare Load Balancer) as the target of your DNS record. DNS resolves to the load balancer's anycast IP; the load balancer then distributes individual requests across a backend pool with millisecond-level health awareness. This combines DNS-layer global routing with application-layer precision. ## Combining Multiple Routing Policies Real-world configurations combine multiple policies: ``` Geographic routing: → US users: weighted round-robin across us-east-1 + us-west-2 (active-active) → EU users: eu-west-1 (active) + us-east-1 (failover) → APAC users: ap-northeast-1 (active) + ap-southeast-1 (failover) Each regional endpoint: → Health check on primary → Automatic failover to secondary if health check fails → 30-second TTL on all records ``` This architecture provides both global traffic distribution and regional fault tolerance. Modern Managed DNS providers expose these policies through their APIs, enabling infrastructure-as-code approaches with Terraform or Pulumi. For the anycast infrastructure underlying these routing decisions, see Anycast DNS: How Global DNS Networks Work. For geographic routing rules that complement health-check failover, see GeoDNS: Location-Based DNS Routing. DNS Record Helper

Related Guides