How-To

Why Your Recurring Agent Task Dies Right Before Delivery

Recurring agent tasks burn 60-120s on bootstrap before the first tool call runs, clipping the last step. Fix: size timeouts as bootstrap + work + buffer.

If your recurring agent task keeps dying seconds before it posts results, the bootstrap is eating your timeout. A richly configured agent burns 60 to 120 seconds on memory loading, credentials, and skill discovery before it ever makes its first tool call. Everything after that competes for what's left of the budget, and the last step, usually the Slack post or email your team actually sees, is the one that gets clipped.

The fix

Size your timeout as bootstrap_p95 + work_p95 + buffer, not just the work. Bootstrap (memory loading, credentials scan, skill discovery) commonly costs 60 to 120 seconds on a richly configured agent before the first tool call fires. That means your effective budget is smaller than whatever you typed into the config. Then reorder so the human-facing output runs before cleanup, and make delivery idempotent independently of work so retries can fill in whatever the previous attempt missed.

Step-by-step

1. Instrument bootstrap as a milestone

Before you touch the timeout, measure. Log a timestamp when the process starts and another when the first tool call dispatches. The gap between them is your bootstrap. Track p95 across a full week; one-shot readings lie.

# Pseudocode. Swap `metrics` for your StatsD, Datadog, or OpenTelemetry client.
import time
t_start = time.time()

# Inside your agent runner, right before the first tool dispatch:
def on_first_tool_call(tool_name):
    delta = time.time() - t_start
    metrics.gauge(
        "agent.bootstrap_seconds",
        delta,
        tags=[f"agent:{agent_id}", f"first_tool:{tool_name}"],
    )

On an agent with heavy memory files, many skills, and multiple credentials, bootstrap can easily land between 60 and 120 seconds. If you see under 10, check where you placed the second timestamp.

2. Size timeouts as bootstrap + work + buffer

A 300-second limit for 3 to 4 minutes of expected runtime feels safe, but it isn't once bootstrap eats the first minute. Subtract bootstrap first, then add a cushion.

timeout = bootstrap_p95 + work_p95 + buffer

For a rich agent running a multi-step pipeline, that math often lands at 900 to 1800 seconds, not 300. On OpenClaw:

openclaw cron edit <job-id> --timeout-seconds 1800

For other schedulers, update the equivalent field: activeDeadlineSeconds on the Kubernetes CronJob's jobTemplate, WorkflowExecutionTimeout (or a relevant StartToClose / ScheduleToClose) in Temporal, or the task deadline in your agent framework. The configured number needs to go up.

3. Post output before cleanup

Order the steps so the human-facing announcement runs before any expensive cleanup: post to Slack before updating the tracking spreadsheet, and send the summary email before archiving artifacts. If the deadline fires mid-cleanup, your user still saw the result, and the only loss is an internal log row you can backfill on the next trigger.

# Wrong order
do_work() -> update_tracking_sheet() -> cleanup_artifacts() -> post_to_slack()

# Right order
do_work() -> post_to_slack() -> update_tracking_sheet() -> cleanup_artifacts()

It's a one-line reshuffle that keeps the Slack post from getting skipped when the deadline lands mid-cleanup.

4. Make delivery idempotent separately from work

Idempotent work is not enough. If last night's run opened a ticket and filed a card but never posted anywhere visible, today's run needs to check delivery on its own instead of inferring it from backend artifacts. One clean approach: tag each announcement with a stable key (run ID, date, source event ID) and look for that key before sending. If the work exists but the key does not, re-announce.

# Pseudocode. `announcement_posted` and `post_to_slack` are stand-ins for
# whatever delivery layer and idempotency store your agent uses.
run_key = f"bug-triage:{today.isoformat()}:{issue_id}"

if not announcement_posted(run_key):
    post_to_slack(message, idempotency_key=run_key)

Without this split, a retry that sees the ticket already open will call the pipeline done, and your user never hears about the run.

How to verify

Look at your bootstrap metric across the last 7 days. If p95 exceeds 60 seconds and your timeout sits under 600, you're at risk. Confirm by triggering the job manually and tailing logs: the gap between process start and the first tool invocation is your real overhead, and everything after competes for whatever remains of the budget.

For the ordering change, scan your agent's code or prompt and make sure human-facing output precedes any logging, tracking, or archival step. For idempotent delivery, invoke the job twice in a row with identical inputs. The second invocation should resend the message if the first one was clipped, not bail silently.

Why this happens

Bootstrap scales with workspace size, so more memory files, skills, and credentials mean more setup cost before any real progress. Teams usually pick the timeout once, around the time the agent ships, and then the team adds memory files, skills, and credentials over the next three months. The number that felt generous is now clipping whatever runs last, and because that last step is usually the one your team sees, you notice only when somebody asks why Monday's report never arrived.

Build your first agent at pazi.ai →

This pattern came from a bug-triage agent at Pazi, powered by OpenClaw.