DNS Troubleshooting Guide: Fix Resolution Issues in Minutes
It is Monday morning, 8:47 AM. Half the office cannot open the company's CRM. Slack is working. Google loads fine. But the CRM at crm.yourcompany.com returns "This site can't be reached - DNS_PROBE_FINISHED_NXDOMAIN." The helpdesk queue fills with tickets. Someone suggests rebooting the router. Someone else blames the ISP. Forty-five minutes later, a junior admin discovers that a DNS record was accidentally deleted during Friday's infrastructure maintenance. The fix takes 30 seconds - adding back one A record. But finding the cause took 45 minutes because no one followed a systematic DNS troubleshooting process.
DNS - the Domain Name System - translates human-readable domain names into IP addresses. It is the foundation that every network service depends on. When DNS fails, nothing works - websites, email, authentication, file shares, VPN, and cloud applications all rely on name resolution. Yet DNS troubleshooting is not something most IT teams practice systematically. They know DNS exists, they know it matters, and they scramble when it breaks.
This guide provides a structured approach to diagnosing and fixing DNS issues, from basic client-side problems to complex enterprise DNS architectures.
How DNS Works: A 60-Second Refresher
Understanding the DNS resolution process is essential for troubleshooting. When a user types a domain name into their browser, here is what happens:
- Local cache check. The operating system checks its local DNS cache for a previously resolved entry. If found and not expired (TTL has not elapsed), the cached IP is returned immediately. No network traffic occurs.
- Hosts file check. The OS checks the local hosts file (
C:\Windows\System32\drivers\etc\hostson Windows,/etc/hostson Linux and macOS). Entries in this file override DNS for the specified hostnames. - Recursive resolver query. The OS sends a DNS query to its configured recursive resolver - typically your internal DNS server or your ISP's DNS. The resolver checks its own cache first.
- Iterative resolution. If the resolver does not have the answer cached, it performs iterative resolution: querying root name servers, then TLD (top-level domain) servers, then the authoritative name servers for the domain. Each level directs the resolver to the next level until the authoritative server returns the actual IP address.
- Response and caching. The resolver returns the answer to the client and caches it for the duration specified by the record's TTL (Time To Live). The client also caches the result locally.
At any step in this chain, something can go wrong. Systematic troubleshooting means identifying exactly which step is failing.
Diagnostic Tools Every IT Admin Needs
nslookup
The most widely available DNS diagnostic tool, installed by default on Windows, macOS, and most Linux distributions. It queries DNS servers directly and shows you exactly what response you get.
# Basic lookup - uses the system's default DNS server nslookup example.com # Query a specific DNS server nslookup example.com 8.8.8.8 # Query a specific record type nslookup -type=MX example.com nslookup -type=TXT example.com nslookup -type=CNAME app.example.com # Reverse lookup (IP to hostname) nslookup 192.168.1.10 # Query with debug output nslookup -debug example.com
The key troubleshooting technique with nslookup is comparing results from different DNS servers. If nslookup crm.company.com fails (using your internal DNS) but nslookup crm.company.com 8.8.8.8 succeeds (using Google's public DNS), the issue is with your internal DNS server, not with the domain itself.
dig (Domain Information Groper)
More powerful than nslookup and preferred by experienced administrators. Available natively on Linux and macOS, and installable on Windows through BIND tools or WSL.
# Basic lookup dig example.com # Query specific record type dig example.com MX dig example.com TXT # Query a specific DNS server dig @8.8.8.8 example.com # Trace the full resolution path dig +trace example.com # Short output (just the answer) dig +short example.com # Check SOA record (useful for zone transfer issues) dig example.com SOA # Check DNSSEC validation dig +dnssec example.com
The +trace option is particularly valuable. It shows every step of the resolution process from root servers to the authoritative answer, making it immediately obvious where in the chain the resolution is failing.
PowerShell Resolve-DnsName
On Windows, PowerShell provides a modern alternative to nslookup with richer output formatting:
# Basic resolution Resolve-DnsName example.com # Query specific record type Resolve-DnsName -Name example.com -Type MX # Query specific DNS server Resolve-DnsName -Name example.com -Server 8.8.8.8 # Check all record types Resolve-DnsName -Name example.com -Type ANY # DNS cache on the local machine Get-DnsClientCache | Where Name -like "*example*"
tracert and pathping
When you suspect the issue is network connectivity to the DNS server rather than DNS configuration, tracert (Windows) or traceroute (Linux/macOS) shows the network path to the DNS server. pathping on Windows combines tracert with packet loss statistics at each hop, which is invaluable for identifying intermittent DNS timeout issues caused by lossy network links.
# Trace route to your DNS server tracert 10.0.0.1 # Pathping for detailed loss statistics pathping 10.0.0.1
Common DNS Issues and Fixes
NXDOMAIN - Domain Not Found
NXDOMAIN means the authoritative DNS server for the domain reports that the requested name does not exist. This is a definitive "no" - not a timeout or error, but an explicit statement that no record exists.
Common causes and fixes:
- Typo in the domain name. Verify spelling. Check for common mistakes like missing hyphens, doubled letters, or wrong TLDs (.com vs .co, .io vs .com).
- Deleted DNS record. If you manage the DNS zone, check the authoritative DNS server for the missing record. Someone may have deleted it accidentally during maintenance. Check DNS server audit logs for recent changes.
- Expired domain. Run a WHOIS lookup. If the domain registration has expired, DNS records are removed. Renew the domain immediately - most registrars provide a grace period.
- Negative caching. DNS servers cache NXDOMAIN responses (negative caching). If a record was missing temporarily and then added, clients and recursive resolvers may still have the NXDOMAIN cached. Flush the client cache with
ipconfig /flushdnsand, if you control the recursive resolver, flush its cache for the specific domain. The negative cache TTL is defined by the SOA record's minimum TTL field. - Split DNS mismatch. In environments using split DNS (different responses for internal vs external queries), the client may be querying the wrong DNS server. Internal hostnames queried against external DNS will return NXDOMAIN because the external server has no knowledge of internal zones.
DNS Timeout - Server Not Responding
A DNS timeout means the client sent a query but received no response within the timeout period (typically 2-5 seconds per attempt, with 2-3 retry attempts).
Diagnostic steps:
- Verify connectivity to the DNS server. Ping the DNS server IP. If it is unreachable, the issue is network connectivity, not DNS. Check firewall rules - DNS uses UDP port 53 (and TCP port 53 for zone transfers and large responses).
- Check if the DNS service is running. On Windows Server, open the DNS Manager console or run
Get-Service DNSin PowerShell. On Linux, check withsystemctl status named(BIND) orsystemctl status unbound. - Check DNS server resource utilization. An overloaded DNS server (high CPU, exhausted memory, or maxed-out file descriptors) may not respond to queries even though the service is technically running. Check server performance metrics during the timeout window.
- Check forwarder availability. If your DNS server forwards queries to upstream resolvers, test those forwarders directly. A timeout at the forwarder level causes your server to timeout to its clients. Run
nslookup example.com [forwarder-IP]for each configured forwarder. - Check for network-level DNS filtering. Some firewalls, proxy servers, or security appliances inspect DNS traffic and may silently drop queries they consider malicious, causing timeouts rather than explicit denials.
DNS Cache Poisoning
Cache poisoning occurs when an attacker injects fraudulent DNS responses into a resolver's cache, causing the resolver to return incorrect IP addresses for legitimate domain queries. Users type the correct domain name but are redirected to the attacker's server.
Indicators of cache poisoning:
- Multiple users reporting SSL certificate warnings for well-known websites
- DNS responses that resolve known domains to unexpected IP addresses
- The suspicious IP addresses belong to networks unrelated to the legitimate domain owner
- Clearing the DNS cache temporarily fixes the issue, but it recurs
Response steps:
- Immediately flush the DNS server cache to remove poisoned entries
- Enable DNSSEC validation on your recursive resolver to cryptographically verify responses
- Ensure your DNS server uses randomized source ports and transaction IDs (modern DNS software does this by default, but legacy configurations may not)
- Restrict recursive DNS service to internal clients only - do not allow your DNS server to resolve queries from the internet
- Consider deploying DNS-over-HTTPS (DoH) or DNS-over-TLS (DoT) for queries to upstream resolvers to prevent manipulation in transit
Stale DNS Records
DNS records that point to IP addresses no longer assigned to the correct resource cause intermittent connectivity issues. This is especially common in environments with dynamic IP assignments, cloud infrastructure, or frequent server migrations.
- Enable DNS scavenging. On Windows DNS Server, enable aging and scavenging to automatically remove dynamic DNS records that have not been refreshed. Set the no-refresh interval to 7 days and the refresh interval to 7 days. Configure scavenging to run daily. This prevents stale records from accumulating.
- Audit static records quarterly. Export all static DNS records and verify that each one still points to an active resource. Delete records for decommissioned servers. This is tedious but prevents the gradual accumulation of orphaned records that cause confusing resolution issues months later.
- Use CNAMEs for services that may move. Instead of A records pointing directly to IP addresses, create CNAME records for services that point to the server's hostname A record. When the server's IP changes, you only update one A record instead of every service CNAME.
Internal DNS vs Public DNS
Most enterprise environments operate two DNS layers: internal DNS servers that resolve both internal and external names, and external DNS that is authoritative for the organization's public domain names. Understanding the distinction is critical for troubleshooting.
Internal DNS
Internal DNS servers (typically Active Directory-integrated DNS on Windows, or BIND/Unbound on Linux) serve two functions: resolving internal hostnames that are not published to the internet, and forwarding external name resolution queries to upstream resolvers on behalf of internal clients.
Common internal DNS issues:
- Zone replication failures. In Active Directory-integrated DNS, zone data replicates as part of AD replication. If AD replication is broken between domain controllers, DNS zones diverge, causing different clients to get different answers depending on which DC they query. Diagnose with
repadmin /showreplanddcdiag /test:dns. - Missing reverse lookup zones. Internal applications, logging systems, and security tools often perform reverse DNS lookups (IP to hostname). If reverse lookup zones are not configured or maintained, these lookups fail, causing application timeouts and incomplete log entries. Create reverse lookup zones for every internal IP subnet.
- SRV record issues. Active Directory relies heavily on SRV records for locating domain controllers, global catalog servers, and Kerberos KDCs. Corrupted or missing SRV records cause authentication failures, domain join failures, and group policy application problems. Restart the Netlogon service on affected DCs to re-register SRV records.
Public DNS
Public DNS records are managed at your domain registrar or a hosted DNS provider (Cloudflare, AWS Route 53, Azure DNS). Issues with public DNS affect external access to your services - website, email delivery, and externally-facing applications.
- Propagation delays. When you change a public DNS record, the change does not take effect instantly worldwide. DNS resolvers around the world cache the old record until its TTL expires. For time-sensitive changes, lower the TTL to 300 seconds (5 minutes) at least 48 hours before the planned change, make the change, and raise the TTL back to 3600 seconds (1 hour) or higher after confirming the change works.
- Registrar vs hosted DNS mismatch. If your domain is registered at one provider but your authoritative DNS is hosted at another, ensure the registrar's nameserver records (NS records) point to the correct hosted DNS provider. Mismatched NS records are a common cause of "works for some people but not others" DNS issues.
Conditional Forwarding
Conditional forwarding directs your DNS server to forward queries for specific domains to designated DNS servers rather than using the default resolution path. This is essential in several enterprise scenarios.
When to Use Conditional Forwarding
- Multi-forest Active Directory. When your organization has multiple AD forests (from acquisitions, mergers, or business unit separation), each forest's DNS server must know how to resolve hostnames in the other forest. Configure conditional forwarders for each forest's DNS domain pointing to that forest's domain controllers.
- Hybrid cloud. Azure Private DNS zones, AWS Route 53 private hosted zones, and GCP Cloud DNS private zones are only resolvable by DNS resolvers within those cloud environments. Configure conditional forwarders on your on-premises DNS to forward queries for cloud private zones to DNS resolver endpoints in those cloud environments.
- Partner or vendor connectivity. When connected to a partner network via VPN or dedicated link, conditional forwarders allow your users to resolve the partner's internal hostnames by forwarding queries for their domain to their DNS servers.
Configuration
On Windows DNS Server:
# PowerShell - add conditional forwarder Add-DnsServerConditionalForwarderZone ` -Name "partner.example.com" ` -MasterServers 10.20.30.1,10.20.30.2 ` -ReplicationScope "Forest"
On BIND, add to named.conf:
zone "partner.example.com" {
type forward;
forward only;
forwarders { 10.20.30.1; 10.20.30.2; };
};
Always configure at least two target servers for redundancy. Test with nslookup host.partner.example.com from a client to verify the conditional forwarder is working.
DNSSEC Basics
DNSSEC adds a layer of cryptographic authentication to DNS responses. Without DNSSEC, a resolver has no way to verify that the response it received actually came from the authoritative DNS server and was not modified in transit.
How DNSSEC Works
DNSSEC uses public key cryptography to sign DNS records. The authoritative DNS server for a zone holds a private key and uses it to sign every record set in the zone. Resolvers use the corresponding public key (published as DNSKEY records) to verify signatures. A chain of trust extends from the root DNS servers through TLD servers to the authoritative server for each domain, with each level signing the keys for the level below.
Enabling DNSSEC Validation
Enabling DNSSEC validation on your recursive resolver is the most impactful step. This tells your DNS server to verify DNSSEC signatures on responses from the internet and reject responses that fail validation.
On Windows DNS Server (Server 2012 and later), DNSSEC validation is enabled by default for the root trust anchor. Verify it is active in DNS Manager under Trust Points.
On Unbound (a common Linux recursive resolver):
# In unbound.conf
server:
auto-trust-anchor-file: "/var/lib/unbound/root.key"
val-clean-additional: yes
On BIND:
# In named.conf options
options {
dnssec-validation auto;
};
DNS Monitoring and Alerting
Proactive DNS monitoring catches issues before users report them. Configure monitoring for these metrics:
- Query response time. Monitor the time your DNS server takes to respond to queries. Establish a baseline (typically 1-5 milliseconds for cached responses, 20-100 milliseconds for recursive resolution) and alert when response times exceed double the baseline for more than 5 minutes.
- Query failure rate. Track the percentage of queries that result in SERVFAIL or timeout responses. A baseline failure rate above 1% indicates a configuration issue. A sudden spike in failures indicates an acute problem - zone corruption, forwarder failure, or resource exhaustion.
- DNS server availability. Monitor each DNS server with synthetic queries every 30-60 seconds. If the server fails to respond to 3 consecutive queries, alert immediately. DNS server downtime has a blast radius far beyond the server itself.
- Zone transfer success. For environments using secondary DNS servers with zone transfers, monitor that transfers complete successfully. A failed zone transfer means the secondary server has stale data that will diverge further from the primary over time.
- Cache hit ratio. A healthy DNS resolver should have a cache hit ratio above 80%. A ratio below 50% indicates either an undersized cache, excessively short TTL values, or an unusual query pattern that may indicate malware performing DNS-based C2 communication.
Tools for DNS monitoring include Nagios with check_dns plugins, PRTG DNS Sensor, Zabbix DNS monitoring templates, and purpose-built solutions like BlueCat DNS Edge or Infoblox DNS monitoring. For smaller environments, a simple PowerShell or Bash script that runs Resolve-DnsName or dig against critical internal and external domains every minute and logs response times is surprisingly effective.
A Systematic DNS Troubleshooting Workflow
When a DNS issue is reported, follow this structured process:
- Scope the issue. Is it one user, one department, one site, or everyone? One domain or all domains? This immediately narrows the problem space.
- Check the client. On the affected machine, run
ipconfig /allto verify DNS server configuration. Flush the local cache withipconfig /flushdns. Check the hosts file for overrides. Trynslookup [domain] [DNS-server-IP]to test resolution directly. - Test against multiple DNS servers. Query your internal DNS, your ISP's DNS, and a public DNS (8.8.8.8 or 1.1.1.1). This isolates whether the issue is your internal DNS infrastructure, your ISP, or the domain itself.
- Check the DNS server. Is the service running? Are forwarders reachable? Are zone files intact? Is the server overloaded? Check event logs for DNS-related errors.
- Check the authoritative source. If the domain is yours, verify the record exists on the authoritative DNS server. If the domain is external, check propagation tools and the domain's WHOIS record for expiration.
- Document and prevent. After resolving the issue, document the root cause and implement monitoring to catch the same issue proactively next time.
Get IT Support Insights Delivered Weekly
Practical tips for IT teams - troubleshooting guides, cost-saving strategies, and tool reviews. No spam, unsubscribe anytime.
Ready to automate your IT support?
HelpBot resolves 60-70% of Tier 1 tickets automatically. 14-day free trial - no credit card required.
Start Free TrialAuto-Resolve DNS Tickets Before They Pile Up
HelpBot routes network and DNS troubleshooting tickets with automatic triage, knowledge base suggestions, and escalation to the right network admin. Reduce resolution time from hours to minutes with structured diagnostic workflows built in.
Try Network Troubleshooter