Monitor types · 8 min read

Cron Job Monitoring with Heartbeats: a Practical Tutorial

June 19, 2026

A wall of analog clocks at slightly different times in a modern office.

Cron jobs that run successfully are easy. Cron jobs that fail loudly are also easy: the error email lands in your inbox. The hard ones are cron jobs that stop running altogether. The server reboots, the container gets evicted, the user account gets disabled, the systemd timer was masked by a config change. The job never runs, no error is produced, and you find out three weeks later when the backups you assumed were happening are gone. Heartbeat monitoring catches exactly this failure mode.

How heartbeat monitoring works

A heartbeat monitor inverts the normal probe relationship. Instead of an external prober calling your service, your job calls the monitor. You define a grace window (e.g. 'this job should ping every hour, give it five minutes of slack'). When the monitor stops hearing pings, it opens a downtime event and alerts you.

The result is a dead-man's-switch. As long as the job runs and reports success, the monitor stays silent. The moment it fails to ping (because it crashed, because the server is off, because someone disabled the timer), the alert fires. Three weeks of silent backup failure becomes a five-minute incident.

When to use heartbeats versus regular checks

Heartbeats are not a replacement for HTTP monitoring. They cover a different class of job: anything that runs out-of-band on a schedule.

Backup jobs: nightly dumps, S3 sync, off-site rsync.
Data pipeline runs: ETL jobs, daily report generation, billing reconciliation.
Marketing emails on a schedule: weekly newsletter, drip campaigns.
SSL renewal jobs: certbot, acme.sh.
Anything that runs in cron, systemd timers, GitHub Actions schedules, Cloudflare Workers cron, or AWS EventBridge.

How to add a heartbeat to a bash cron job

The simplest pattern is a single curl call at the end of your script. The heartbeat URL is unique to the monitor; do not check it into version control.

A pattern that handles failures gracefully: `do_the_work && curl -fsS --retry 3 https://your-monitor.example/heartbeat`. The && only fires the heartbeat if the actual work succeeded. The --retry 3 handles transient network blips. The -fsS makes curl quiet on success and noisy on failure, which is what cron wants.

Heartbeats from Python and Node

The same pattern in two common languages. Both wrap the work in a try/except, only report success on a clean run.

Python: `try: do_the_work(); requests.post('https://your-monitor.example/heartbeat', timeout=10); except Exception: logger.exception('job failed')`
Node: `try { await doTheWork(); await fetch('https://your-monitor.example/heartbeat', { method: 'POST' }); } catch (err) { console.error('job failed', err); }`
Important: do not send the heartbeat in a finally block. Finally fires on both success and failure. You want the heartbeat to be a positive signal, not a 'I ran' signal.

Setting the grace window

The grace window is the most common knob to get wrong. Too short and you get false alarms when the job runs a few seconds late. Too long and a real failure goes unnoticed for hours.

A good rule of thumb is 2x the normal job duration plus 5 minutes. A nightly backup that normally takes 8 minutes wants a grace window of 8 * 2 + 5 = 21 minutes. A weekly job that takes 2 hours wants a grace window of 2 * 2 + 0.5 = 4.5 hours. The grace window protects you from harmless variance, not from real failures.

Common heartbeat patterns to set up first

If you are starting from zero, three heartbeats catch most of the silent failures that bite people. A nightly backup monitor with a 6-hour grace window. An SSL renewal monitor on the cron that runs certbot or acme.sh, with a one-week grace window. A daily report or billing reconciliation monitor with a 90-minute grace window. Together these three catch the failure modes that have the longest unnoticed runs and the biggest consequences. Once those are in place, add a heartbeat to every other scheduled job over time.