What metrics should I monitor on my network?

The essential network metrics every IT team should monitor are bandwidth utilization (percentage of available capacity in use on each link), latency (round-trip time for packets between points, with thresholds of under 1ms for LAN, under 50ms for WAN, and under 150ms for internet), packet loss (percentage of packets that fail to reach their destination - anything above 0.1% on a LAN or 1% on a WAN indicates a problem), uptime and availability (percentage of time each device and link is operational, targeting 99.9% or higher for critical infrastructure), jitter (variation in latency, critical for VoIP and video where jitter above 30ms causes quality degradation), and device health metrics including CPU utilization, memory usage, and temperature on switches, routers, firewalls, and access points. Monitor these at 1-minute intervals for critical devices and 5-minute intervals for standard infrastructure.

What is SNMP and how does it work for network monitoring?

SNMP (Simple Network Management Protocol) is the standard protocol that network monitoring tools use to collect data from network devices. It works through a manager-agent model: your monitoring server (the manager) sends requests to SNMP agents running on your switches, routers, firewalls, and other devices. The agent responds with the requested data - interface traffic counters, CPU load, memory usage, error counts, and hundreds of other metrics stored in a structured database called a MIB (Management Information Base). SNMP operates in three main modes: GET requests where the manager polls the agent for specific values, WALK requests where the manager retrieves a range of values, and TRAP messages where the agent proactively sends alerts to the manager when predefined conditions occur. Always use SNMPv3 with authentication and encryption. SNMPv1 and v2c transmit community strings in plaintext, which is a security risk.

Which network monitoring tool should I choose in 2026?

The best network monitoring tool depends on your team size, budget, and environment complexity. PRTG Network Monitor is ideal for small to mid-sized IT teams that want a polished interface with minimal setup - it auto-discovers devices and configures sensors automatically, with a free tier for up to 100 sensors. Zabbix is the strongest open-source option for teams with Linux expertise who need enterprise-scale monitoring without licensing costs. Nagios Core remains popular for teams that want maximum customization and have the technical depth to configure it. Datadog is the leading choice for cloud-native and hybrid environments, providing unified infrastructure, application, and network monitoring with powerful analytics, though it is the most expensive option. LibreNMS is an excellent open-source alternative that balances ease of use with comprehensive SNMP-based monitoring. For most IT teams managing 50 to 500 network devices, PRTG or LibreNMS provides the best balance of capability, usability, and cost.

How do I set alert thresholds without getting overwhelmed by false alarms?

Start with conservative thresholds and tighten them over 30 to 60 days based on observed baselines. Use a two-tier alerting model: warning alerts at 70-80% of capacity (email notification, no wake-up) and critical alerts at 90-95% of capacity (SMS/pager, immediate response required). Set bandwidth alerts based on sustained utilization over 5 to 15 minutes rather than instantaneous spikes - a brief spike to 95% during a backup window is normal, but sustained 85% utilization for 15 minutes indicates a real capacity issue. Implement alert deduplication so that a single device failure does not generate separate alerts for every dependent check. Use maintenance windows to suppress alerts during planned changes. Review and tune thresholds monthly for the first quarter, then quarterly thereafter. The goal is that every critical alert represents a genuine issue requiring human attention.

How often should network devices be polled for monitoring data?

Polling frequency should match the criticality of the device and the metric being monitored. For critical infrastructure like core switches, firewalls, and WAN links, poll every 60 seconds for availability (ping/SNMP) and traffic metrics. For standard infrastructure like access switches and access points, poll every 5 minutes. For non-critical devices like printers and environmental sensors, poll every 10 to 15 minutes. High-frequency polling provides faster detection but increases load on both the monitoring server and the monitored devices. Balance detection speed against resource consumption. For SNMP traps, there is no polling interval since the device sends the alert immediately when the condition occurs, which is why trap-based alerting provides the fastest detection for critical events like link failures and device reboots.

Network Monitoring for IT Teams: Essential Metrics, Tools, and Alert Configuration

Published March 22, 2026 - 19 min read

A regional healthcare provider with 14 clinics and a central hospital discovered their electronic health records system had been experiencing intermittent slowdowns for three weeks. Doctors reported that patient charts took 8 to 12 seconds to load instead of the usual 2 seconds. The IT team investigated application servers, database performance, and storage latency before discovering the root cause: a single core switch at their data center had a failing port channel member, reducing aggregate bandwidth between their server VLAN and the distribution layer by 50 percent. The switch had been logging errors for 22 days. Nobody was monitoring those logs.

Network monitoring is the practice of continuously observing network infrastructure - switches, routers, firewalls, access points, WAN links, and connected services - to detect performance degradation, capacity constraints, and failures before they impact users. Effective monitoring transforms your IT team from reactive firefighters into proactive operators who detect and resolve problems before anyone submits a help desk ticket.

This guide covers what to monitor, how the underlying protocols work, which tools to consider, how to configure alerts that matter, how to design dashboards that provide genuine situational awareness, and how to build escalation workflows that ensure the right person responds at the right time.

Essential Network Metrics: What to Monitor

Network monitoring generates enormous volumes of data. The challenge is not collecting data - modern tools make collection straightforward. The challenge is knowing which metrics matter, what their values mean, and when a change in those values requires attention. Focus on these categories.

Bandwidth Utilization

Bandwidth utilization measures the percentage of available capacity currently in use on a network link. A 1 Gbps uplink running at 750 Mbps is at 75 percent utilization. Monitor bandwidth on WAN links (your connection to the internet, MPLS circuits, SD-WAN tunnels), inter-switch uplinks (trunk links between access and distribution or distribution and core layers), server farm links (connections between servers and the network core), and any link that serves as a single point of connectivity for multiple users or services.

Bandwidth thresholds depend on the link type and traffic pattern. For WAN links, sustained utilization above 70 percent during business hours indicates you are approaching capacity and should plan an upgrade. For LAN uplinks, sustained utilization above 50 percent warrants investigation since LAN links should rarely be the bottleneck. For internet circuits, track both peak and average utilization - peak matters for user experience, average matters for billing on metered connections.

Monitor bandwidth in both directions independently. An uplink may show low overall utilization but have an asymmetric problem - saturated in one direction while nearly idle in the other. This pattern is common with backup traffic, file synchronization, and cloud uploads.

Latency

Latency is the time it takes for a packet to travel from source to destination and back (round-trip time). It directly affects user experience for every network-dependent application. High latency makes applications feel sluggish, causes video conferencing audio delays, disrupts VoIP call quality, and slows file transfers.

Baseline your latency measurements so you know what normal looks like for your environment:

Path	Normal Latency	Warning Threshold	Critical Threshold
LAN (same switch)	Under 1 ms	2-5 ms	Above 10 ms
LAN (cross-campus)	1-5 ms	10-20 ms	Above 50 ms
WAN (regional)	10-30 ms	50-80 ms	Above 100 ms
WAN (continental)	30-80 ms	100-150 ms	Above 200 ms
Internet (general)	20-80 ms	100-200 ms	Above 300 ms
VoIP quality threshold	Under 150 ms	150-250 ms	Above 250 ms

Measure latency from multiple points in your network, not just from the monitoring server. A latency measurement from the server room to the internet tells you nothing about the experience of a user on the third floor whose traffic must traverse two access switches and a distribution layer before reaching the core. Deploy lightweight monitoring agents or use synthetic monitoring to measure latency from the user perspective.

Packet Loss

Packet loss occurs when network packets fail to reach their destination. On a healthy LAN, packet loss should be effectively zero - any measurable packet loss on a wired LAN indicates a hardware problem (failing cable, bad port, overloaded switch buffer) or a configuration error (duplex mismatch, MTU mismatch, spanning tree misconfiguration). On wireless networks, packet loss up to 1 to 2 percent may be acceptable depending on environmental conditions.

On WAN links, packet loss above 0.5 percent degrades TCP performance noticeably because TCP retransmits lost packets, and the retransmission mechanism reduces throughput exponentially as loss increases. VoIP and video conferencing are more sensitive - packet loss above 1 percent causes audible artifacts in voice calls and visible degradation in video.

When packet loss is detected, investigate the physical layer first. Swap cables, check interface error counters for CRC errors (indicating physical layer problems), and verify duplex settings on both ends of the link. Software and configuration problems are less common causes of packet loss than physical issues.

Uptime and Availability

Uptime monitoring is the most fundamental network check: is the device responding or not? Monitor uptime using ICMP ping (most common), SNMP availability checks, or TCP port checks to specific services. Track availability as a percentage over time - a device that experienced two 5-minute outages in a month has 99.98 percent availability, which sounds excellent but represents 10 minutes of downtime that may have affected hundreds of users.

For critical infrastructure, track both device availability and service availability separately. A firewall may be responding to ICMP ping (device is up) while its VPN service has crashed (service is down). Check each critical service running on each device, not just the device itself.

Device Health: CPU, Memory, and Temperature

Network devices have finite processing resources. A switch forwarding traffic at line rate in hardware uses minimal CPU, but enable features like access control lists, quality of service, NetFlow export, or SNMP polling at high frequency, and CPU utilization climbs. A router performing complex routing calculations, VPN encryption, or deep packet inspection can consume significant CPU and memory.

Monitor CPU utilization with a warning threshold at 70 percent sustained for 5 minutes and a critical threshold at 90 percent sustained for 5 minutes. Memory utilization thresholds should be set at 80 percent warning and 95 percent critical. High CPU on a network device causes delayed packet processing, increased latency, dropped SNMP responses (making your monitoring unreliable exactly when you need it most), and potential control plane instability that can affect routing convergence.

Temperature monitoring matters for devices in non-climate-controlled locations - wiring closets without dedicated cooling, outdoor enclosures, and warehouse environments. Most enterprise network devices operate safely up to 40-45 degrees Celsius ambient temperature. Set alerts at the manufacturer's recommended threshold, typically 5 degrees below the maximum operating temperature.

SNMP Fundamentals: How Network Monitoring Works Under the Hood

SNMP (Simple Network Management Protocol) is the standard protocol that monitoring tools use to collect data from network devices. Understanding SNMP at a conceptual level helps you configure monitoring effectively, troubleshoot collection failures, and make informed decisions about what to monitor.

The Manager-Agent Model

SNMP uses a manager-agent architecture. Your monitoring server acts as the SNMP manager, sending requests to SNMP agents running on your network devices. Every managed switch, router, firewall, access point, UPS, and many servers run an SNMP agent that responds to queries from the manager. The agent maintains a database of device information organized in a tree structure called a MIB (Management Information Base). Each piece of information - an interface traffic counter, a CPU utilization percentage, a temperature reading - has a unique address in the MIB tree called an OID (Object Identifier).

SNMP Operations

Three SNMP operations handle most monitoring tasks. GET retrieves a specific value from the agent, such as the current inbound traffic counter on interface GigabitEthernet0/1. WALK retrieves a range of values, such as the traffic counters for all interfaces on the device - this is how monitoring tools discover what interfaces exist and their current status. TRAP is an unsolicited message from the agent to the manager, triggered when a predefined condition occurs - a link goes down, a fan fails, a threshold is exceeded. Traps provide immediate notification without waiting for the next polling cycle.

SNMP Versions and Security

SNMPv1 and v2c use community strings (essentially passwords) transmitted in plaintext. Anyone who can capture SNMP traffic on your network can read the community string and use it to query or (worse) modify device configurations. Despite this, many organizations still use SNMPv2c because it is simpler to configure. SNMPv3 adds authentication (verifying the identity of the manager) and encryption (protecting the data in transit). Use SNMPv3 for all production monitoring. Configure unique credentials per device or device group, not a single shared credential across your entire infrastructure.

If you are still using the default SNMP community string "public" on any device in your network, change it immediately. Attackers and automated scanning tools try default community strings as a standard reconnaissance technique. A device responding to the "public" community string reveals its complete configuration, firmware version, interface list, and traffic patterns to anyone on the network.

Network Monitoring Tools: Detailed Comparison

The monitoring tool landscape in 2026 offers options for every budget and team size. Here is a detailed comparison of the tools most IT teams should evaluate.

PRTG Network Monitor

PRTG from Paessler is a Windows-based monitoring platform known for its polished interface and rapid deployment. PRTG uses a sensor-based model where each monitored data point (an interface, a ping check, a CPU metric) counts as one sensor. The free tier includes 100 sensors, which covers a small network of 10-15 devices. Paid tiers scale to thousands of sensors. PRTG excels at auto-discovery - point it at a network range, and it will discover devices, identify their types, and create appropriate sensors automatically. The web interface, desktop application, and mobile apps provide consistent visibility across platforms. Best for teams that want comprehensive monitoring with minimal configuration effort.

Nagios Core and Nagios XI

Nagios is the grandfather of open-source monitoring, first released in 1999 and still actively maintained. Nagios Core is free and open-source, providing a flexible monitoring framework that can monitor virtually anything through its plugin architecture. The trade-off is complexity - Nagios Core is configured through text files, has a dated web interface, and requires significant Linux system administration knowledge to deploy and maintain. Nagios XI is the commercial version with a modern web interface, configuration wizards, and reporting. The Nagios community has produced thousands of plugins for monitoring specific devices, applications, and services. Best for teams with strong Linux skills who want maximum flexibility and customization.

Zabbix

Zabbix is a fully open-source monitoring platform (no paid tiers, no feature restrictions) that competes directly with commercial tools in capability. Zabbix supports SNMP, agent-based monitoring, IPMI, JMX, and custom checks. Its template system allows you to import pre-built monitoring configurations for thousands of device types - import a Cisco switch template, assign it to your switches, and monitoring is configured. Zabbix handles large-scale deployments with proxy servers that distribute the monitoring load. The web interface is functional if not beautiful. Zabbix requires a Linux server and a database (PostgreSQL or MySQL). Best for teams that want enterprise-grade monitoring without licensing costs and have the Linux expertise to deploy and maintain it.

Datadog

Datadog is a cloud-native SaaS monitoring platform that provides unified visibility across infrastructure, applications, logs, and network traffic. Its network monitoring capabilities include SNMP device monitoring, NetFlow analysis, and network performance monitoring (NPM) that maps traffic flows between services. Datadog shines in hybrid and multi-cloud environments where traditional on-premises monitoring tools struggle. The interface is modern, the analytics are powerful, and the integration library covers over 750 technologies. The downside is cost - Datadog charges per host per month, and costs accumulate quickly as you add modules. Best for cloud-heavy organizations that want unified observability and have the budget for a premium SaaS platform.

LibreNMS

LibreNMS is a community-driven, open-source monitoring platform forked from Observium. It provides automatic discovery, SNMP-based monitoring, alerting, and a clean web interface. LibreNMS stands out for its ease of setup compared to Nagios and Zabbix - a Docker deployment can be running in under 30 minutes. It supports over 1,800 device types out of the box, with community-contributed support for new devices added regularly. The alerting system supports email, Slack, PagerDuty, and dozens of other notification channels. Best for small to mid-sized teams that want open-source monitoring with a gentler learning curve than Nagios or Zabbix.

Tool	Cost	Setup Effort	Scale	Best For
PRTG	Free (100 sensors), then paid	Low - auto-discovery	1-5,000 devices	Quick deployment, polished UI
Nagios Core	Free (open-source)	High - text config	Unlimited	Maximum customization
Zabbix	Free (open-source)	Medium - templates help	Unlimited	Enterprise-scale, zero cost
Datadog	$15-23/host/month	Low - SaaS	Unlimited	Cloud-native, unified observability
LibreNMS	Free (open-source)	Low-Medium	1-10,000 devices	Easy open-source, great SNMP

Alert Threshold Configuration: Reducing Noise, Catching Real Problems

Poorly configured alerts are worse than no alerts. An inbox flooded with hundreds of non-actionable notifications trains your team to ignore alerts entirely, which means they also ignore the critical alert buried in the noise at 3 AM on a Saturday. Alert configuration is where monitoring either delivers value or becomes background noise.

The Two-Tier Alert Model

Implement two severity levels for every monitored metric. Warning alerts indicate a condition that needs attention during business hours but does not require an immediate response. Critical alerts indicate a condition that is currently affecting users or will affect them imminently and requires immediate action regardless of time of day.

For bandwidth utilization, a warning fires at 70 percent sustained for 15 minutes and a critical fires at 90 percent sustained for 5 minutes. For latency, a warning fires when round-trip time exceeds 100 ms sustained for 5 minutes and a critical fires above 250 ms sustained for 2 minutes. For packet loss, a warning fires above 0.5 percent sustained for 5 minutes and a critical fires above 2 percent sustained for 2 minutes. For device availability, any ping failure on a critical device is an immediate critical alert - there is no warning level for a device that is down.

Alert Deduplication and Dependency

When a core switch fails, every device behind it becomes unreachable. Without alert deduplication, you receive separate alerts for every device, every interface, and every service that depends on the failed switch - potentially hundreds of alerts for a single root cause. Configure your monitoring tool to suppress downstream alerts when a parent device fails. This is called dependency mapping or parent-child alerting. Define the physical and logical topology in your monitoring tool so it understands that if the core switch is down, alerts for access switches connected to that core should be suppressed until the core switch recovers.

Maintenance Window Suppression

Schedule maintenance windows in your monitoring tool before performing planned changes. A firmware upgrade on a switch requires a reboot - without a maintenance window, the reboot triggers a device-down alert, which pages the on-call engineer, who investigates, determines it is the planned maintenance, and goes back to sleep irritated. Maintenance windows suppress alerts for specified devices during specified time periods, eliminating this unnecessary noise.

Alert Fatigue Metrics

Track the volume of alerts generated per day, the percentage of alerts that required human action, and the average time to acknowledge alerts. If more than 30 percent of alerts are non-actionable, your thresholds need tuning. If acknowledgment times are increasing, your team is experiencing alert fatigue. Review alert tuning monthly for the first quarter of a new monitoring deployment, then quarterly.

Dashboard Design: Seeing What Matters

A monitoring dashboard should answer one question immediately: is anything wrong right now? If the answer requires scrolling through multiple screens, clicking into subpages, or interpreting ambiguous visualizations, the dashboard is not serving its purpose.

The NOC Dashboard

The primary network operations dashboard should display on a wall-mounted screen or be the default view for your IT team. It should contain a topology map with device status (green for up, red for down, yellow for degraded), a current alerts panel showing all active warnings and critical alerts sorted by severity, a top-N panel showing the top 5 most utilized links and the top 5 highest-CPU devices, and uptime summary showing the number of devices in each state (up, down, warning, maintenance).

Use color deliberately. Green means normal. Yellow means warning - needs attention but not urgent. Red means critical - requires immediate action. Grey means the device is in a maintenance window or is unmonitored. Do not use other colors that create ambiguity. Every visual element should be interpretable at a glance from across the room.

Capacity Planning Dashboard

A separate dashboard for capacity planning shows historical trends over 30, 90, and 365-day windows. Display bandwidth utilization trends for WAN links (are they growing toward capacity?), device CPU and memory trends (are any devices approaching their processing limits?), wireless client density trends (are access points becoming overloaded?), and port utilization on switches (are you running out of physical ports?). This dashboard is reviewed weekly or monthly, not in real-time. Its purpose is to inform infrastructure investment decisions - when to upgrade a WAN link, add switch capacity, or deploy additional access points.

Escalation Workflows: Ensuring the Right Response

An alert without a defined response is a notification, not a workflow. Every alert severity should map to a specific escalation path that defines who is notified, through what channel, and what action they are expected to take.

Escalation Tier Structure

Tier 1 - Automated response (0-5 minutes): For predefined conditions, trigger automated remediation before human notification. Examples: restart a failed service, clear a full disk by rotating old logs, bounce a port that shows error-disabled status. If automated remediation resolves the issue, log it and close the alert. If it fails, escalate to Tier 2.
Tier 2 - On-call network engineer (5-15 minutes): The on-call engineer receives the alert via SMS and push notification, acknowledges within 10 minutes, and begins investigation. For critical alerts, the engineer has authority to take immediate corrective action including device reboots, failover activation, and traffic rerouting.
Tier 3 - Senior engineer / team lead (15-30 minutes): If the on-call engineer cannot resolve the issue within 15 minutes or if the issue affects multiple systems, escalate to the senior engineer or team lead. This tier coordinates with other teams (server, application, security) when the issue spans multiple domains.
Tier 4 - Management (30-60 minutes): For outages affecting business operations, escalate to IT management for coordination with business stakeholders. This tier manages communication to the organization, coordinates with vendors for hardware failures, and makes decisions about invoking disaster recovery procedures.

On-Call Rotation

Establish a fair on-call rotation that distributes after-hours responsibility across the team. A typical rotation is one week on-call per engineer, rotating weekly. Provide an on-call stipend or compensatory time to acknowledge the burden of after-hours availability. Document on-call expectations explicitly: maximum acknowledgment time (10 minutes for critical alerts), required access (VPN, monitoring tool access, credentials for critical devices), and escalation authority (what actions the on-call engineer can take without approval).

Use a tool like PagerDuty, Opsgenie, or VictorOps to manage on-call schedules, alert routing, and escalation tracking. These tools provide automatic escalation if the primary on-call does not acknowledge within the defined window, ensuring that no critical alert goes unaddressed because someone is asleep or unreachable.

Get IT Support Insights Delivered Weekly

Practical tips for IT teams - network monitoring guides, tool reviews, and troubleshooting workflows. No spam, unsubscribe anytime.

Ready to automate your IT support?

HelpBot resolves 60-70% of Tier 1 tickets automatically. 14-day free trial - no credit card required.

Start Free Trial

Automate Network Troubleshooting Tickets with HelpBot

HelpBot handles the repetitive network support tickets automatically - Wi-Fi connectivity issues, VPN troubleshooting, DNS resolution problems, and printer network errors - freeing your network team to focus on infrastructure optimization instead of Tier 1 support.

Start Your Free Trial

Back to Home

Network Monitoring for IT Teams: Essential Metrics, Tools, and Alert Configuration

Essential Network Metrics: What to Monitor

Bandwidth Utilization

Latency

Packet Loss

Uptime and Availability

Device Health: CPU, Memory, and Temperature

SNMP Fundamentals: How Network Monitoring Works Under the Hood

The Manager-Agent Model

SNMP Operations

SNMP Versions and Security

Network Monitoring Tools: Detailed Comparison

PRTG Network Monitor

Nagios Core and Nagios XI

Zabbix

Datadog

LibreNMS

Alert Threshold Configuration: Reducing Noise, Catching Real Problems

The Two-Tier Alert Model

Alert Deduplication and Dependency

Maintenance Window Suppression

Alert Fatigue Metrics

Dashboard Design: Seeing What Matters

The NOC Dashboard

Capacity Planning Dashboard

Escalation Workflows: Ensuring the Right Response

Escalation Tier Structure

On-Call Rotation

Automate Network Troubleshooting Tickets with HelpBot

Related Articles

Free Network Monitoring Checklist