Agent Systems Need a Reliable Delivery Plane
Your agent platform manages context, permissions, and triggers. But when an agent fires a webhook to an external system, the agent runtime has no idea whether it actually worked. Delivery is a separate problem that requires separate infrastructure.
There is a specific failure that agent platforms have not solved, and it is not a model quality problem.
An agent decides to notify a Slack channel that a deployment completed. The agent runtime constructs the webhook payload, sends it to the Slack incoming webhook URL, receives a 200 response, and records the action as successful. The agent's execution log shows the notification was sent. The agent moves on to its next task.
The Slack message never appears in the channel.
The bot token attached to the incoming webhook was revoked three days ago when someone rotated credentials during a security audit. Slack's webhook endpoint accepts the request — it parses the JSON, validates the structure — and returns 200. But the message is silently dropped because the token no longer has permission to post to the channel. Slack does not return an error for this. It returns 200.
The agent runtime has no mechanism to detect this. It sent a webhook. It got 200. From its perspective, the job is done. The human who was supposed to see the deployment notification never sees it. Maybe they notice eventually. Maybe they don't.
This is not an edge case. This is the default behavior of agent systems that treat outbound delivery as a solved problem because the HTTP layer returned a success code.
Agent runtimes own context, not transport
Agent platforms like HeyStax solve a real and difficult set of problems. HeyStax is a platform where you install agent stax — modular agent capabilities — enable them per-context, govern what grants each agent has, trigger them safely within defined boundaries, and inspect what they did. The platform manages the lifecycle of agent operations: what can this agent do, when should it do it, who authorized it, and what happened when it ran.
That is a substantial amount of complexity. Context management, grant governance, safe triggering, and execution inspection are each individually hard problems. Solving them together in a coherent platform is serious infrastructure work.
But there is a boundary where the agent runtime's authority ends: the outbound HTTP request. When an agent fires a webhook to notify Slack, create a Jira ticket, trigger a GitHub Actions workflow, or push data to a third-party API, the agent runtime hands the payload to the network and waits for a response code. That is the last thing the runtime knows.
What happens after the 200 arrives is outside the agent's operational model. The agent runtime does not:
- Capture the full HTTP request and response for forensic inspection
- Verify that the downstream system actually committed the side effect
- Detect when delivery patterns change — latency spikes, new error codes, response body anomalies
- Provide controlled replay when a batch of deliveries fails
- Maintain a durable evidence trail that survives agent restarts and redeployments
These are delivery plane responsibilities. They require different infrastructure, different storage patterns, and different operational concerns than the agent runtime itself.
The gap between "sent" and "worked"
The Slack example above is one instance of a general pattern. Every outbound webhook from an agent system has the same gap: the HTTP response code tells you the request was received, not that the intended outcome was achieved.
Consider a more consequential scenario. An agent monitors a CI pipeline. When tests pass on the main branch, the agent triggers a deployment by sending a webhook to a deployment service. The deployment service receives the webhook, returns 200, and queues the deployment job.
The deployment job fails because the deployment service's database connection pool is exhausted from an unrelated batch job. The deployment is never executed. The agent's log says "deployment triggered successfully." The team assumes the deployment happened. The release notes go out. Customers are told the fix is live.
The fix is not live. The deployment never ran.
The agent platform did everything correctly. It evaluated the trigger condition. It checked permissions. It constructed the payload. It sent the request. It recorded the 200 response. The problem is downstream of everything the agent platform controls. For a deeper analysis of why 200 responses prove nothing about outcomes, see why delivered doesn't mean applied.
What an agent delivery plane provides
A delivery plane sits beneath the agent runtime and provides operational truth about outbound delivery. It is not a replacement for the agent platform — it is the transport layer the agent platform delegates to.
Delivery inspection. Every outbound webhook is captured with full HTTP detail: method, URL, headers, body, timestamp, response status, response body, round-trip latency. When the Slack notification fails silently, the delivery plane has the raw evidence. When the deployment webhook succeeds but the deployment job fails, the response body captured by the delivery plane may contain error details that the agent runtime discarded after checking the status code.
Outcome verification. This is the critical gap. After the downstream system processes the webhook, it sends a signed receipt back to the delivery plane confirming the side effect committed. If the Slack message was posted to the channel, the integration sends a receipt. If the deployment job was queued and executed, the deployment service sends a receipt. If no receipt arrives within the SLA window, the delivery is flagged as Applied Unknown — not failed, just unverified. That distinction matters. Applied Unknown means "investigate." It does not mean "retry blindly." For the technical detail on how outcome receipts work, see webhook platforms cannot stop at HTTP request logging.
Replay with safety. When a batch of agent deliveries fails — the deployment service was down for 20 minutes and 15 deployment webhooks were lost — the delivery plane provides controlled re-delivery. Not blind retry. Controlled replay: the operator (or the agent, with appropriate grants) selects the failed deliveries, reviews the payloads, and replays them to the correct target with lineage tracking. Each replayed delivery is linked to the original, creating an audit trail.
Anomaly detection. Delivery patterns change before outright failures happen. Response latency increases gradually. A downstream service starts returning 429 rate-limit responses intermittently. Error rates tick up from 0% to 0.5%. The delivery plane monitors these patterns and surfaces anomalies before they become incidents. The agent runtime is not instrumented to detect transport-layer drift — it processes individual requests and moves on. The delivery plane observes the aggregate.
Why agent platforms cannot absorb this
The instinct is reasonable: why not build delivery infrastructure into the agent platform itself? The agent runtime already sends the webhooks. Why not add inspection, receipts, replay, and anomaly detection as platform features?
Three reasons.
Operational separation. The agent runtime and the delivery plane have different availability requirements. The agent runtime needs to make fast orchestration decisions. The delivery plane needs to durably store evidence and provide controlled re-delivery. Coupling them means a delivery-layer incident (a storage spike from high-volume inspection data, a replay job consuming resources) directly affects agent orchestration. Separating them means each system can fail independently and recover independently.
Evidence independence. The delivery plane's evidence is only trustworthy if it is captured independently of the system that sent the request. If the agent runtime is both the sender and the inspector, a bug in the runtime can corrupt both the delivery and the evidence about the delivery. An independent delivery plane captures what actually happened on the wire, regardless of what the agent runtime believes happened.
Scope discipline. Agent platforms already solve hard problems: context management, grant governance, safe triggering, execution inspection. Each of those is a substantial engineering surface. Adding a full delivery infrastructure — with its own storage, its own API, its own monitoring, its own replay engine — is a second product built inside the first. The complexity compounds. The team building grant governance should not also be building replay safety logic. These are different domains with different failure modes.
The concrete architecture
The architecture that works is explicit separation:
Agent Runtime (HeyStax)
- Manages agent context, grants, triggers
- Sends outbound webhooks through delivery plane
- Reads delivery status from delivery plane API
- Does not store or inspect delivery evidence
Delivery Plane (HookTunnel)
- Captures every outbound request with full HTTP detail
- Tracks outcome via signed receipts from downstream systems
- Provides replay with lineage and operator approval
- Monitors delivery patterns for anomalies
- Exposes delivery status and evidence via API
Downstream Systems (Slack, Jira, GitHub, deployment services)
- Receive webhooks from delivery plane
- Send outcome receipts after committing side effects
The agent runtime decides what to send and when. The delivery plane handles how it gets there and whether it worked. Downstream systems report back through receipts.
This separation means the agent platform's view of delivery is richer than "200 OK." It sees Applied Confirmed (the downstream system committed the side effect), Applied Unknown (the SLA window passed without a receipt), or Applied Failed (the downstream system explicitly reported failure). The agent can make decisions based on these states — retry, escalate, alert, or mark complete — with actual evidence instead of assumptions.
When the Slack notification fails silently
Return to the opening scenario. The agent sends a Slack notification. The bot token was revoked. Slack returns 200. The message is never posted.
Without a delivery plane, the agent records success and moves on. The failure is invisible.
With a delivery plane, the picture changes. HookTunnel captures the full request and response. The outcome receipt SLA window opens — 60 seconds for the Slack integration to confirm the message was posted. No receipt arrives. The delivery moves to Applied Unknown.
The agent runtime reads this state from the delivery plane API. Its policy for Applied Unknown on Slack notifications is to flag for human review rather than auto-retry (retrying a Slack notification is low-risk but noisy). The human sees the flagged delivery, inspects the captured response body, notices that Slack returned 200 but the response payload includes "ok": false, "error": "token_revoked" — detail that was in the HTTP response but that the agent runtime's status-code-only check missed.
The human rotates the token, replays the notification through HookTunnel, and confirms Applied Confirmed within minutes. Total exposure: one missed notification, caught within the SLA window. Without the delivery plane, the exposure is unbounded — however long it takes someone to notice the missing message.
The deployment scenario
The agent triggers a deployment. The deployment service returns 200 but the job fails due to a database connection pool issue. The team thinks the release is live. It is not.
Without a delivery plane: the team discovers the failed deployment hours later when a customer reports the bug is still present. The investigation is manual — cross-referencing CI logs, deployment service logs, and agent execution logs to reconstruct what happened.
With a delivery plane: the deployment service sends a receipt on successful deployment. The receipt does not arrive within the SLA window. The delivery moves to Applied Unknown. The agent runtime's policy for deployment triggers in Applied Unknown state is to escalate immediately — deployment failures are high-severity. The on-call engineer is paged within 60 seconds of the SLA window expiring.
The difference is not that the delivery plane prevented the failure. The deployment service's database pool was still exhausted. The deployment still failed. The difference is detection time: 60 seconds versus hours. And the evidence: full HTTP capture of the deployment webhook, the response body, the missing receipt, and the ability to replay the deployment trigger after the database pool recovers.
This is not about model quality
The agent systems discussion focuses heavily on model quality, reasoning capability, and orchestration sophistication. Those matter. But the agents that fail as products — the ones that users abandon — frequently fail on delivery, not reasoning.
The agent reasoned correctly. It decided to notify the right channel, trigger the right deployment, create the right ticket. The reasoning was sound. The delivery failed silently, and the agent had no mechanism to detect it.
Delivery reliability is not a model problem. It is an infrastructure problem. It requires infrastructure solutions: inspection, verification, replay, anomaly detection. These capabilities exist. They do not need to be invented. They need to be composed into the agent architecture as a first-class concern, not treated as an afterthought that HTTP status codes will handle.
Agent runtime plus delivery plane equals a complete system. The runtime owns intent. The delivery plane owns evidence. Without both, you have an agent that can reason about what to do but cannot prove it did it.
For how HookTunnel provides the specific mechanisms — canary probes, replay safety, proof-backed status — see how HookTunnel provides webhook reliability. For the deeper argument that HTTP logging alone is insufficient, see webhook platforms cannot stop at HTTP request logging. For the philosophical underpinning — why trust requires proof, not pings — see trust requires proof not pings.
Stop guessing. Start proving.
Generate a webhook URL in one click. No signup required.
Get started free →