DNS Failover and Load Balancing
4 min read
## DNS as a Traffic Management Layer
DNS is often described as "just a phone book," but modern DNS infrastructure is a sophisticated traffic management layer. Because every client must resolve a hostname before making a connection, the Authoritative DNS Server server has an opportunity to influence where that traffic goes — directing it to the fastest server, the geographically closest endpoint, or a healthy failover node when the primary is down.
This works because DNS responses are not required to return a single IP. A nameserver can return different answers based on who is asking, what time it is, what health checks report, and how it wants to balance load across a pool of servers.
## Round-Robin DNS: The Simplest Approach
Round-robin DNS returns multiple A records in rotating order. Each successive query gets the same records but in a different sequence. Clients typically connect to the first IP in the response.
```
; Zone file: round-robin across three servers
www 60 IN A 203.0.113.1
www 60 IN A 203.0.113.2
www 60 IN A 203.0.113.3
```
The short TTL (Time To Live) of 60 seconds ensures clients rotate through servers over time. Round-robin is zero-configuration load balancing but has a critical weakness: it does not detect failures. If `203.0.113.2` goes down, queries still land there and users get errors.
Round-robin also does not guarantee even distribution. Resolvers and clients cache responses; heavy-traffic clients may pin to one IP for the duration of their cache TTL.
## Health-Check-Based Failover
Managed DNS Hosting providers (Route 53, Cloudflare, NS1, Dyn) monitor your endpoints and automatically remove unhealthy IPs from DNS responses.
The workflow:
1. Define a health check: HTTP GET to `https://203.0.113.1/health`, expect HTTP 200 within 2 seconds
2. Associate the health check with the DNS record
3. The provider's global health-check network probes your endpoint every 10–30 seconds
4. If the endpoint fails N consecutive checks, the provider removes its IP from DNS responses
5. When the endpoint recovers, the IP is automatically re-added
For active-passive failover, the configuration is:
- Primary record: `www.example.com → 203.0.113.1` (with health check)
- Failover record: `www.example.com → 203.0.113.2` (activated when primary is unhealthy)
The TTL (Time To Live) on failover-enabled records should be very low (10–60 seconds) so clients re-resolve quickly after a failover event. With a 300-second TTL, clients that cached the unhealthy IP before the failover fires will continue hitting the dead endpoint for up to 5 minutes.
## Weighted Routing
Weighted routing assigns a numeric weight to each record. Responses are returned in proportion to weights:
```
; Route 80% of traffic to new server, 20% to old
www 30 IN A 203.0.113.10 ; weight 80
www 30 IN A 203.0.113.11 ; weight 20
```
In Route 53, weights are set as integers (0–255). A record with weight 0 receives no traffic (useful for maintenance). Weighted routing is excellent for canary deployments: send 1% of traffic to the new version, monitor error rates, then ramp up.
## Latency-Based Routing
Latency-based routing (Route 53's term) returns the record corresponding to the AWS region with the lowest measured latency to the requesting resolver's IP. Amazon maintains a continuously updated latency database mapping IP ranges to regional latency estimates.
For a global application hosted in us-east-1, eu-west-1, and ap-northeast-1:
- A user in Tokyo gets the ap-northeast-1 endpoint
- A user in Frankfurt gets the eu-west-1 endpoint
- A user in Virginia gets the us-east-1 endpoint
This is distinct from geodns-location-routing (which routes based on geographic IP lookup) and from Anycast DNS (which routes at the BGP layer). Latency-based routing uses measured round-trip time data, which often produces better decisions than pure geography. See GeoDNS: Location-Based DNS Routing for geographic routing approaches.
## TTL Strategy for DNS Failover
TTL (Time To Live) is the single most important tuning parameter for DNS-based failover. It controls how long resolver caches (and clients) hold a DNS answer before re-querying.
| Scenario | Recommended TTL |
|---|---|
| Critical production record with health-check failover | 10–60 seconds |
| Standard production record without failover | 300–600 seconds |
| Static content / CDN endpoint (rarely changes) | 3600–86400 seconds |
| Pre-planned maintenance window | Lower 24h before, restore after |
The trade-off is direct: lower TTLs increase resolver query load (more queries per second to your authoritative servers) and slightly increase user-perceived DNS latency (more cache misses). Higher TTLs reduce query load but slow failover response.
## DNS Load Balancing vs. Application Load Balancing
DNS load balancing is a coarse tool. Because of caching, it cannot distribute load as evenly as an application load balancer (ALB/NLB) operating at the connection or request level. DNS load balancing works at the session level — once a client has an IP, all its connections go there until the TTL expires.
For high-precision load distribution, use a cloud load balancer (AWS ALB, GCP Load Balancer, Cloudflare Load Balancer) as the target of your DNS record. DNS resolves to the load balancer's anycast IP; the load balancer then distributes individual requests across a backend pool with millisecond-level health awareness. This combines DNS-layer global routing with application-layer precision.
## Combining Multiple Routing Policies
Real-world configurations combine multiple policies:
```
Geographic routing:
→ US users: weighted round-robin across us-east-1 + us-west-2 (active-active)
→ EU users: eu-west-1 (active) + us-east-1 (failover)
→ APAC users: ap-northeast-1 (active) + ap-southeast-1 (failover)
Each regional endpoint:
→ Health check on primary
→ Automatic failover to secondary if health check fails
→ 30-second TTL on all records
```
This architecture provides both global traffic distribution and regional fault tolerance. Modern Managed DNS providers expose these policies through their APIs, enabling infrastructure-as-code approaches with Terraform or Pulumi. For the anycast infrastructure underlying these routing decisions, see Anycast DNS: How Global DNS Networks Work. For geographic routing rules that complement health-check failover, see GeoDNS: Location-Based DNS Routing.
DNS Record Helper