DNS Performance Optimization at Scale

6 min read

## Why DNS Performance Matters More Than You Think DNS is the first step in almost every network connection. A user navigating to your website triggers DNS resolution before the TCP connection, TLS handshake, and HTTP request can even begin. At scale — millions of users per day — even single-digit millisecond improvements in DNS resolution time compound into measurable page load improvement and reduced infrastructure cost. More critically, DNS performance degrades under conditions that also stress the rest of your infrastructure: high traffic, DDoS attacks, sudden traffic spikes. An authoritative server that cannot keep up with query volume becomes a single point of failure for every service it serves. Optimizing DNS performance is not just about speed — it is about resilience. ## TTL: The Most Impactful Knob TTL (Time To Live) is the most influential DNS performance variable and also the most commonly misunderstood. Every DNS resource record's TTL controls how long resolver caches and clients hold that answer before re-querying. **Low TTL (10–60 seconds)**: Every client re-queries frequently. Cache hit rate is low. Your authoritative servers receive high query volume. Changes propagate quickly. Appropriate for: health-check-based failover records, records that change frequently. **High TTL (3600–86400 seconds)**: Cache hit rates are high. Authoritative servers receive low query volume per record. Changes propagate slowly. Appropriate for: stable records (nameservers, MX for established mail servers, CDN endpoint CNAMEs). **The mathematics**: An A record with TTL=300 serving 1 million unique users per day results in approximately 3,333 queries per second to your authoritative server (1M ÷ 300 = 3,333/s). The same record with TTL=3600 produces only 278 queries per second — a 12× reduction in authoritative query load. **TTL strategy by record type**: | Record Type | Recommended TTL | Rationale | |---|---|---| | NS | 86400 (24h) | Rarely changes; high cache value | | MX | 3600 (1h) | Changes infrequently | | A (stable server) | 3600 (1h) | Good balance | | A (failover record) | 30–60s | Fast failover response | | CNAME (CDN) | 300–3600s | CDN endpoints are stable | | TXT (SPF, DKIM) | 3600 (1h) | Rarely changes | | TXT (verification tokens) | 300s | May need to change during verification | ## Resolver-Side Caching and Prefetching Understanding how DNS Cache (TTL) works in recursive DNS Resolver software is essential for reasoning about real-world performance. **Prefetching**: Unbound and BIND 9 support prefetch mode, where the resolver proactively refreshes a cached record 10% before its TTL expires if it is actively queried. This prevents cache-miss spikes when popular records expire: ```conf # Unbound prefetch config prefetch: yes prefetch-key: yes ``` With prefetching, a record with TTL=300 that is queried 1,000 times per second will never cause a cache miss — the resolver refreshes it at second 270 and serves fresh data continuously. **Negative caching**: NXDOMAIN responses are cached for the SOA Minimum TTL (see SOA Records: Zone Authority Configuration for details). Negative cache entries consume resolver memory and can mask legitimate new records. Monitor negative cache size under high-traffic conditions. **Cache size tuning**: Resolver cache size limits how many entries fit in memory. An undersized cache evicts entries too aggressively, reducing hit rates. ```conf # Unbound cache size msg-cache-size: 256m rrset-cache-size: 512m ``` ## EDNS0 and Large UDP Responses EDNS0 (Extension Mechanisms for DNS, RFC 6891) extends DNS beyond the original 512-byte UDP payload limit. Modern DNS responses — especially those including DNSSEC signatures, multiple records, or large TXT records — frequently exceed 512 bytes. EDNS0 allows clients and servers to advertise their maximum UDP payload size (typically 4096 bytes today). Without EDNS0, the DNS server falls back to TCP for large responses, adding connection overhead. ```bash # Check EDNS0 support and advertised payload size dig example.com +edns=0 +dnssec # Disable EDNS0 to simulate legacy clients (troubleshooting) dig example.com +noedns # Check if a server correctly handles EDNS0 dig +edns=4096 example.com @ns1.example.com ``` Ensure your authoritative servers and firewalls allow UDP payloads up to 4096 bytes. Firewall rules that fragment or drop large DNS UDP packets are a common performance and DNSSEC breakage source. ## Authoritative Server Performance Tuning For self-hosted authoritative servers (BIND, PowerDNS — see Running Your Own DNS Server: BIND vs PowerDNS): **BIND 9 tuning**: ```conf options { // Thread count: match logical CPU count threads 8; // Dedicated task threads tasks 4; // UDP listeners per interface udp-receive-buffer 0; // Recursive clients limit (irrelevant for auth-only) recursive-clients 0; // TCP client limit tcp-clients 150; // Minimal responses (omit authority/additional sections unless needed) minimal-responses yes; }; ``` **PowerDNS tuning**: ```conf # Number of distributor threads distributor-threads=3 # Number of receiver threads receiver-threads=3 # Max queries per connection (TCP) max-tcp-connection-duration=60 # Query cache size query-cache-ttl=20 ``` ## Benchmarking DNS with dnsperf `dnsperf` (from BIND's test suite) is the standard tool for DNS load testing: ```bash # Install apt install dnsperf # Create query file (one query per line) cat > queries.txt < 50ms: investigate resolver or upstream issues - Cache hit rate < 80%: TTLs may be too low or cache undersized - SERVFAIL rate > 0.1%: zone configuration or DNSSEC issue ## Anycast DNS and PoP Selection For operators running Managed DNS, query performance is largely determined by which Anycast DNS PoP serves each query. Clients in regions with no nearby PoP experience higher latency. When selecting a DNS provider, evaluate: 1. Number and geographic distribution of PoPs 2. PoP presence at major Internet exchange points in your users' regions 3. Published latency data or Catchpoint/Cedexis measurement data For DIY DNS Hosting, deploying anycast across multiple cloud regions (using BGP anycast with providers like Vultr, Equinix Metal, or Hetzner) is achievable with modest effort and dramatically reduces global latency vs. a single-region deployment. ## TCP Connection Reuse and DNS over TLS Traditional DNS uses UDP with fallback to TCP for large responses. Each TCP DNS connection has handshake overhead. DNS over TLS and DNS-over-HTTPS (DoH) maintain persistent connections, amortizing TLS handshake cost across many queries. For high-volume DoH/DoT clients (resolvers serving many users), connection pooling is critical: ```python # Python example: persistent DoH session import httpx with httpx.Client(http2=True) as client: # All queries reuse the same HTTP/2 connection for domain in domains: resp = client.get( f"https://cloudflare-dns.com/dns-query", params={"name": domain, "type": "A"}, headers={"accept": "application/dns-json"} ) ``` HTTP/2 multiplexing allows hundreds of concurrent DNS queries over a single TLS connection. Without connection reuse, each query pays a full TLS handshake penalty (~50–100ms), making DoH dramatically slower than UDP DNS for naive implementations. DNS Record Helper

Related Guides