All posts

by siteRabbit Team

Here's a scenario that plays out quietly in engineering teams everywhere:

A nightly backup job that's been running fine for 18 months encounters a disk space issue, fails silently, and exits with error code 1. Nobody gets an email. Nothing appears in your uptime dashboard. The job just... stops running.

Three weeks later, you need to restore from backup. You discover there are none. The last successful backup was three weeks ago.

This is the cron job monitoring problem, and it's surprisingly hard to solve with traditional uptime tools.

Why traditional uptime monitoring misses cron failures

Standard uptime monitoring works by making requests to your site and checking the response. That works great for web services. But cron jobs don't have a URL. They're not servers — they're processes that run, do something, and exit.

If a cron job fails, nothing changes from the outside. Your HTTP monitor sees a healthy website. Your TCP monitor sees open ports. Your DNS monitor sees valid records. But your backup job hasn't run in a month.

This is the fundamental mismatch: traditional monitoring is designed for services that should always be running, not tasks that should run occasionally.

The heartbeat approach

The solution is to invert the problem. Instead of your monitoring tool checking whether your job is running, your job tells the monitoring tool that it ran.

Here's how it works:

  1. You create a heartbeat monitor in your monitoring tool. It gets a unique URL like https://ping.siterabbit.app/hb/abc123xyz.
  2. You add a single line to the end of your cron script: a ping to that URL.
  3. Your monitoring tool expects to see a ping on a configured schedule. If a ping doesn't arrive by the expected time (plus a grace period), it fires an alert.

Your script goes from:

#!/bin/bash
mysqldump -u root mydb > /backups/db_$(date +%Y%m%d).sql

To:

#!/bin/bash
mysqldump -u root mydb > /backups/db_$(date +%Y%m%d).sql
curl -fsS https://ping.siterabbit.app/hb/abc123xyz

One line. If the job runs successfully and reaches the end, the ping fires. If the job fails — exits early, throws an error, gets killed — the ping doesn't fire, and you get alerted.

What to configure

When setting up a heartbeat monitor, you typically configure:

Period: How often should the job run? Every hour, every day, every week?

Grace period: How late can the job be before alerting? A job scheduled to run at midnight might not start until 12:02 due to system load. A grace period of 5 minutes prevents false alarms.

Alert after N misses: Do you want to alert on the first missed heartbeat, or after two or three consecutive misses? For critical jobs, alert immediately. For jobs that are occasionally delayed, give it a couple of chances.

Patterns for different use cases

Nightly backups: Period 24h, grace 30 minutes, alert on first miss.

Hourly data sync: Period 1h, grace 10 minutes, alert after 2 consecutive misses (to avoid noise from occasional delays).

Weekly reports: Period 7d, grace 2 hours, alert on first miss.

Continuous queue worker: This is a process, not a cron job — it should be pinging every minute. Alert on first miss.

What you can't catch with basic heartbeats

A basic heartbeat monitor tells you the job ran (at least far enough to send the ping). It doesn't tell you:

  • Whether the job's output was correct
  • Whether the job ran in an acceptable amount of time
  • Whether the backup it produced is actually valid

For those concerns, you need more sophisticated checks — application health monitors that query a database to verify the backup record exists, or custom assertions on your /health endpoint.

But for most teams, knowing the job ran at all is a significant improvement over flying blind.

The practical impact

Teams that add heartbeat monitoring consistently report the same thing: they discover jobs that have been silently failing for weeks, sometimes months. Not because their infrastructure is unusually unreliable, but because these failures were simply invisible before.

Common finds:

  • Backup jobs that fail whenever disk space gets low (happens when backups aren't pruned)
  • Report generation jobs that fail when an API they depend on changes its response format
  • Data sync jobs that fail whenever a third-party rate limit is hit
  • Cleanup jobs that fail on the first of every month due to date formatting edge cases

All of these were failing silently. All of them would have been caught immediately with heartbeat monitoring.

Setting up heartbeat monitoring with siteRabbit

siteRabbit's heartbeat monitors work like this:

  1. Create a new monitor, select type Heartbeat
  2. Configure the expected period and grace window
  3. Copy the unique heartbeat URL
  4. Add curl -fsS YOUR_URL (or the equivalent for your language) to the end of your script

The -fsS curl flags make it fail silently if the network is unavailable (which is usually what you want — the job ran, that's what matters). If you want curl to fail the script if the ping fails, use -fsSo /dev/null without the silent flag.

Language examples:

# Python
import urllib.request
urllib.request.urlopen("https://ping.siterabbit.app/hb/abc123xyz")
// Node.js
fetch("https://ping.siterabbit.app/hb/abc123xyz").catch(() => {})
// PHP
file_get_contents("https://ping.siterabbit.app/hb/abc123xyz");
# Ruby
require 'net/http'
Net::HTTP.get(URI("https://ping.siterabbit.app/hb/abc123xyz"))

Each heartbeat monitor in siteRabbit shows a history of every ping received, the time between pings, and any missed heartbeats — so you can see both that the job is running and that it's running on schedule.

Set up your first heartbeat monitor →