hub MarionetteOps Monitor orchestration
arrow_back Blog

Server Monitoring Metrics That Actually Matter

Most server monitoring dashboards show too much. Here are the metrics that reliably predict trouble before it reaches users.

More metrics is not better signal

A fresh server monitoring setup is tempting to over-instrument. Dozens of graphs, dozens of alert thresholds, and within a week the team is ignoring everything because every alert feels like noise. The goal is not comprehensive coverage — it is early, actionable signal.

These are the metrics that actually predict incidents.

CPU: trend matters more than peaks

Transient CPU spikes are normal. A web server handling a burst of requests, a cron job running nightly indexing, a deployment triggering a cache warm — these cause brief spikes that resolve on their own.

What matters is sustained high CPU: a thread stuck in a tight loop, a runaway process, or gradual degradation as load grows. Alert on averages over 5 or 10 minutes, not instantaneous peaks. Sustained 85% is a real problem. A 30-second spike to 100% usually is not.

Memory: watch the trend, not the number

A server at 90% memory utilization is not necessarily in trouble. Many applications — databases, JVMs, caches — intentionally use all available memory as a performance strategy. The useful signals are:

  • Memory that is consistently growing over hours or days (a leak)
  • Swap utilization increasing (the OS has started borrowing disk performance for memory)
  • OOM kills visible in system logs

Set a gradual-growth alert, not just a high-watermark threshold.

Disk: the quiet killer

Disk fills silently. Unlike CPU or memory, a nearly-full disk rarely slows anything down — until the moment it is 100% full and the app starts throwing errors it was never designed to handle. Databases corrupt. Log rotation fails. Write queues stall.

Alert at 80% with enough runway to respond. Alert again at 90%. The cost of a false positive here is close to zero; the cost of missing it is high.

Load average: the composite health signal

CPU percentage can be misleading on multi-core systems. Load average — the count of runnable and waiting processes — gives a better picture of whether the system is saturated. A 4-core system with a 1-minute load average of 8 is working twice as hard as it can sustainably manage.

Load average is one of the fastest ways to tell if something unexpected is consuming resources, even before you know what it is.

Process monitoring: keep the important ones alive

Many services depend on a set of processes that must be running. A database, a message broker, a background worker, an nginx process. These can crash silently without triggering an HTTP alert if the process manager tries to restart them or if the affected path isn't monitored.

Lightweight agent-based monitoring can watch named processes and alert immediately when they disappear, without waiting for an uptime check to detect the downstream effect.

Start narrow

Pick one server, set four thresholds — CPU trend, memory growth, disk fill, and load average — and let them run for a week before adding anything else. You will learn what normal looks like before you decide what to alert on. A baseline you trust is worth more than a dashboard you do not.