Question 1

How is this different from Sentry / Datadog APM?

Accepted Answer

APM platforms instrument the request path — every HTTP request becomes a span, every span becomes a metric. Background jobs are largely invisible to that model: workers produce few spans, run on machines that may not be in your APM contract, and the failures that matter most (didn’t run, ran but did nothing, ran twice) don’t produce error events. Drumbeats is purpose-built for the per-run pattern. Most teams use both: APM for the request path, Drumbeats for the async layer.

Question 2

What’s a `run_id` and why does it matter?

Accepted Answer

A `run_id` groups multiple pings into one logical execution. When you send /start, /log, /log, /success in sequence, the `run_id` ties them together so the dashboard can show "this specific run took 47 seconds, processed 1,230 records, completed at 14:22:11." Pass your own `run_id` (your job’s internal id, an idempotency key, anything unique) or let Drumbeats generate one on the start ping. Without it you still get monitor-level metrics but lose per-run drill-down.

Question 3

What is /log for?

Accepted Answer

A `/log` ping is for the steps in between `/start` and the terminal `/success` or `/failure`. Send one when a meaningful sub-step completes — "pulled 1,230 rows", "PDF rendered", "uploaded to S3" — with an optional short message. The dashboard groups them under the same `run_id` and you get a per-step timeline on the run detail view. Each `/log` ping costs one Beat, so put them on real milestones, not every iteration of a tight inner loop.

Question 4

What incident events fire for a JOB_BASIC monitor?

Accepted Answer

FAILED — explicit /failure ping or non-zero exit code (alerts after `failure_tolerance` consecutive failures, default 1). DURATION_HIGH — the run took longer than `max_duration_seconds`. DURATION_LOW — the run completed faster than `min_duration_seconds`. Auto-FAILED — a START ping that goes unmatched past `max_duration_seconds` is closed as FAILED by the basic-watchdog. JOB_BASIC has no schedule, so MISSED does not apply.

Question 5

Can I send a payload with each ping?

Accepted Answer

Yes. POST a body up to 10 KB by default — that’s the per-ping store cap. Larger payloads are accepted up to 1 MB at the body-parser level and overflow into S3 / R2 storage when your project has it enabled (the `payload_truncated` flag on the ping shows whether truncation happened). Common uses: error stack traces on /failure pings, processing summaries on /success, intermediate progress on /log. Don’t put secrets or PII in payloads — they’re visible to all project members.

Question 6

What if my job has variable but legitimate run times?

Accepted Answer

Either set `max_duration_seconds` to the 99th-percentile-acceptable time (so DURATION_HIGH fires only when it really matters), or skip the threshold entirely and rely on FAILED + auto-FAILED for the real signal. For jobs whose duration is informative but not alert-worthy, the duration distribution chart on the monitor detail page surfaces drift without paging.

Question 7

Can I monitor AWS Lambda / GCP Cloud Functions?

Accepted Answer

Yes. Add a single fetch / HTTP call at the top of your handler (start ping) and another at every return path (success or failure). Cold-start cost is negligible (sub-50ms HTTPS call). For Lambda specifically, run the success ping after your business logic but before the return so you don’t lose visibility into failures during teardown.

Question 8

How does Drumbeats handle jobs that retry on failure?

Accepted Answer

Each retry is a new run with its own `run_id`. Set `failure_tolerance` to 2 or 3 if you only want to be paged after N consecutive failures — that way a single transient failure (which the job’s own retry handles) doesn’t escalate, but a poisoned message that fails three times in a row does.

Question 9

What’s the most common integration mistake?

Accepted Answer

Sending the success ping before the work is actually committed. If your job ends with `db.commit()` followed by `ping(success)` but the commit can throw, you’ll record successes for jobs that rolled back. Always ping AFTER the commit, or use a try / finally that pings /failure on any unhandled exception. The /integrate AI-agent flow handles this pattern correctly out of the box.

Know what every async task actually did.

How it works

Four views into a webhook handler

Volume and success rate per 30-minute bucket

The duration knob is what makes this useful

Track multi-step jobs with `/log`

Every ping, every payload, exportable

Async tasks fail in ways your APM can’t see

Run started but never finished

Success ping that was actually empty

Run that fired three times instead of once

One ping. You're done.

What does it cost?

Drumbeats vs. APM-shaped async-task monitoring

Common questions

Did your webhook handler actually finish?

What to read next

Related monitoring

Free tools

Resources