hub MarionetteOps Monitor orchestration
arrow_back Blog

Cron Job Monitoring: Why Silent Failures Are Worse Than Crashes

A crashed web server triggers an alert. A cron job that silently stopped running three weeks ago does not. That asymmetry is worth fixing.

The problem with scheduled tasks

Uptime monitoring, server agents, and log alerting all work by detecting something that happens: a request fails, a process crashes, a threshold is crossed. Cron job failures often work in reverse: the thing that was supposed to happen did not.

Nothing fires an alert when nothing happens. That is the gap.

What can go wrong silently

Scheduled tasks fail silently in more ways than most teams track:

  • The process exits with code 0 but skips work due to a logic error
  • A database lock causes the task to complete instantly without processing any rows
  • The cron expression is wrong and the task never fires after a server migration
  • A dependency (S3 bucket, external API, database table) was changed and the task now fails quietly with a caught exception that gets swallowed
  • Disk full causes a task to error and exit, but the exit code is caught by a wrapper script that exits 0

None of these cause a server to go down. None appear in uptime checks. All of them cause data to be stale, reports to be wrong, emails to not go out, or queues to grow unbounded.

Dead man's switch monitoring

The pattern for detecting missed scheduled runs is a dead man's switch (also called a heartbeat or check-in monitor). The scheduled job itself calls a unique monitoring URL at the end of each successful run. The monitoring platform expects that ping within a defined window — if it does not arrive, an alert fires.

This inverts the detection model. Instead of detecting a crash, you detect absence.

Setup is minimal:

# At the end of your cron job script
curl -fsS https://your-monitor-url/check-in > /dev/null

If the job runs successfully, the ping arrives. If the job does not run, runs too slowly, or crashes before the ping, the monitor fires.

Choosing the right window

The check-in window should be slightly larger than the expected job duration plus scheduling variance. A job that normally takes 10 minutes and runs hourly should have a window of 15–20 minutes, not 60 minutes. A 60-minute window only detects a completely missed run; a tighter window catches a job that started but got stuck.

Set the window conservatively at first. It is easier to widen a noisy window than to explain a missed alert after a production incident.

What to monitor

Any scheduled task whose failure has user-visible or data-integrity consequences is a candidate:

  • Payment reconciliation jobs
  • Email sending queues
  • Database backup scripts
  • Report generation
  • Data import/export pipelines
  • Cache warming jobs

Low-stakes cleanup tasks probably do not need alerting. Anything that directly affects what users see or what your business depends on does.

Combine with output checks

A check-in ping confirms the job ran and completed. It does not confirm the job did something useful. For critical jobs, add a secondary uptime check that verifies the downstream result — a file was written, a record was updated, a count changed — as a sanity check that the job's work actually landed.