Background Job Monitoring

Know what every async task actually did.

Background jobs run when something else triggers them — a webhook, an upload, a queue message, a user click. Drumbeats tracks each run via `/start` + `/log` + terminal pings tied by `run_id`, surfaces multi-step workflow progress, raises DURATION_HIGH / DURATION_LOW on duration drift, and auto-FAILS orphaned starts via the basic-watchdog.

50 monitors freeSetup in 60 secondsNo SDK required

How it works

Three steps. Less than a minute.

1

Create an event-driven monitor

No schedule needed. Drumbeats records each run individually as you ping it. Set `min_duration_seconds` / `max_duration_seconds` if your job has a healthy time window.

2

Send /start, then /success or /failure

One outbound HTTPS call when the run begins, one when it ends. Pass a `run_id` (your job id, an idempotency key, or anything unique) so the two are linked in the dashboard.

3

Auto-FAILED catches the rest

If a START ping never gets a matching SUCCESS or FAILURE within `max_duration_seconds`, the basic-watchdog closes the run as FAILED. No silent stuck jobs.

The actual product

Four views into a webhook handler

Real components, virtual data. A Stripe webhook handler — fires when Stripe sends events, runs for whatever the event needs.

Ping Analytics

Volume and success rate per 30-minute bucket

The actual chart from the monitor detail page — bucketed pings with success / failure split, hover for per-bucket counts, success rate, and average duration. Bursty webhook traffic is exactly what JOB_BASIC is shaped for.

Stats & thresholds

The duration knob is what makes this useful

Background jobs aren't on a schedule, so MISSED detection doesn't apply. The signals you care about are per-run: did it fail, did it run too long, did it run suspiciously fast. JOB_BASIC ships with three native events.

  • FAILED — explicit /failure ping or non-zero exit code
  • DURATION_HIGH — run took longer than max_duration_seconds
  • DURATION_LOW — run finished faster than min_duration_seconds (catches "succeeded with empty output")
  • Auto-FAILED — basic-watchdog closes orphaned START pings past max_duration_seconds

Workflow monitoring

Track multi-step jobs with /log

Between /start and /success, send a /log ping at each step. Drumbeats keeps them grouped under the same run_id and flags the run if any single step regresses or the whole pipeline goes long.

# Multi-step workflow — same run_id across pings
RUN_ID="invoice_evt_$(date +%s)"
BASE="https://api.drumbeats.io/v1/ping/MONITOR_ID"

curl -s "$BASE/start?run_id=$RUN_ID"
pull_line_items
curl -s "$BASE/log?run_id=$RUN_ID&message=Pulled+48+entries"

render_pdf
curl -s "$BASE/log?run_id=$RUN_ID&message=PDF+rendered"

upload_to_s3
curl -s "$BASE/log?run_id=$RUN_ID&message=Uploaded"

send_receipt && curl -s "$BASE?run_id=$RUN_ID"
  • Each /log entry is one Beat — keep them on actual milestones
  • Optional `message` query param attaches a short label per step
  • Bigger context goes in the body; up to 10 KB stored per ping
  • A regressed step shows up as an unusual gap on the run timeline

Ping history

Every ping, every payload, exportable

START + SUCCESS / FAILURE events tied by run_id. Payloads show the event type and key context; failure messages show the downstream error. CSV export is the same one click as the dashboard.

Why background jobs are different

Async tasks fail in ways your APM can’t see

Background jobs sit in a coverage gap that almost every monitoring tool misses. APMs (Datadog, New Relic, Honeycomb) instrument the request path beautifully — but a job that runs once an hour from a worker process produces almost no APM signal, and a job that runs ten thousand times an hour produces so much noise that the failures disappear into it. Log aggregators see what your code chose to print; they have nothing to say about jobs that never started, were killed by the OOM killer, or returned exit 0 with empty results.

Drumbeats takes a per-run perspective. Every async task sends a /start ping when it begins (with a `run_id` you choose or that we generate), zero or more /log pings to record progress, and exactly one terminal ping — /success or /failure — when it finishes. The `run_id` ties the lifecycle together. From the dashboard you can drill from "this run" to its full ping timeline, search payload bodies, and see the duration distribution across the last hundred runs at a glance.

The pattern works for any async task whose timing isn’t fixed: webhook handlers (Stripe, GitHub, Shopify), upload processors (image, video, PDF), AI / ML inference jobs, GDPR data exports, queue-worker iterations, Lambda / Cloud Function handlers, GitHub Actions deployment steps. The only requirement is outbound HTTPS. There is no SDK, no agent, and no required runtime — which is precisely the point: the failure mode of "we forgot to upgrade the SDK in this one container" is exactly the kind of thing this category of monitoring should never reproduce.

Run started but never finished

A 30-second job that has been "running" for 18 minutes — stuck on a downstream call, OOM-killed, or just returned without sending the close ping. Set `max_duration_seconds` and the basic-watchdog auto-FAILS the run the moment the START ping ages out. No more silent stuck jobs.

Success ping that was actually empty

The job completed cleanly, but processed zero rows because of a silently-broken upstream. Set `min_duration_seconds` and Drumbeats raises DURATION_LOW for "succeeded but suspiciously fast" runs. Or attach the result count in the success-ping payload and assert against it via a webhook channel.

Run that fired three times instead of once

Double-execution from a flapping cron, retries on a stuck queue, an idempotency-key bug. The /start pings show every invocation; the run_id timeline makes duplicate runs immediately visible. Hard to spot in logs, trivial in Drumbeats.

Pair the lifecycle pings with public status pages on customer-facing batches (invoice generation, email sends, payout reconciliation) and your async-task observability rivals what most teams pay six figures a year for.

One ping. You're done.

No SDK, no agent, no library. A single HTTP request is all it takes.

webhook-handler.ts
// Stripe webhook handler — track each event delivery as one run
import { randomUUID } from 'node:crypto';

const MONITOR_ID = 'YOUR_MONITOR_ID';
const base = `https://api.drumbeats.io/v1/ping/${MONITOR_ID}`;

app.post('/webhooks/stripe', async (req, res) => {
  const runId = req.body.id ?? randomUUID();
  await fetch(`${base}/start?run_id=${runId}`);

  try {
    await processStripeEvent(req.body);
    await fetch(`${base}?run_id=${runId}`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ event: req.body.type }),
    });
    res.sendStatus(200);
  } catch (err) {
    await fetch(`${base}/failure?run_id=${runId}`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ error: err.message, event: req.body.type }),
    });
    res.sendStatus(500);
  }
});
Any language
No SDK needed
30 second setup
Works anywhere

What does it cost?

What 200K Beats actually buys you for async jobs

Each lifecycle-tracked run uses 2 Beats (START + terminal). Free includes 200,000 Beats per month — about 100,000 fully tracked runs. Even a busy webhook handler at 1,000 events/day stays at 60K Beats/month (30-day month), well inside Free. You only graduate to Pro once you’re past ~16,000 runs/day on a single monitor.

200,000 Beats / month = $0 forever

Outgrowing Free? Pro at $20/mo gets 1,000,000 Beats — about 500,000 lifecycle-tracked runs. Same plan covers cron, heartbeat, and uptime monitors too.

Drumbeats vs. APM-shaped async-task monitoring

APM platforms (Sentry, Datadog APM, New Relic) instrument the request path beautifully and the async path almost not at all. Drumbeats is purpose-built for the per-run model — most teams use both.

APM (Sentry / Datadog / New Relic)
Drumbeats
Tracks each individual run
Workers produce few/no spans — visibility depends on a custom transaction wrapper
Native — START + SUCCESS / FAILURE tied by run_id
Catches a job that started but never finished
Only via timeout-and-kill heuristics inside the worker
Auto-FAILED — basic-watchdog closes orphaned STARTs past `max_duration_seconds`
Catches "succeeded but did nothing" (empty output)
Custom log assertion you author per job
DURATION_LOW threshold — built-in event
Records run duration distribution
Histogram metric + dashboard query you build
Min / Avg / P95 / Max per monitor, automatic
Payload alongside each run
Sentry events truncated at ~100 KB
Up to 10 KB stored per ping by default; larger bodies overflow to S3/R2
Setup per new job
15+ minutes (instrument + dashboard + alert)
Under 60 seconds — start + finish ping in your job entrypoint
Team seats
Per-seat APM pricing
Unlimited on every plan

FAQ

Common questions

Did your webhook handler actually finish?

Async tasks run when no one is watching. Drumbeats watches for you — every run, every duration, every payload — and auto-FAILS the ones that quietly hang.

No credit card required · 50 monitors free · Setup in 60 seconds

Keep exploring

What to read next

Related monitoring

Free tools

  • AI agent setup

    Generate a paste-ready prompt that wires start / success / failure pings across your codebase.

  • Cron expression generator

    When you need to schedule a one-off job and want to verify the cron line.

Resources