Reliable System Integrations: Webhooks, Queues, and the Idempotency Problem

Most integration failures are not caused by bugs. They are caused by assumptions. The integration assumes the receiving system is always available. It assumes each event is delivered exactly once. It assumes a 200 response means the payload was actually processed. It assumes that if the integration worked in staging, it will work in production under load.

None of these assumptions are safe, and operational integrations — the ones that move data between your CRM, your data warehouse, your notification systems, and your automation pipelines — fail in proportion to how many of them are embedded in the design.

The patterns in this post are not theoretical. They address the specific failure modes that surface when an integration that worked fine for six months starts dropping events under higher volume, losing data during third-party outages, or producing duplicate records after a retry storm.

Treat every webhook as a delivery attempt, not a guarantee

Webhooks are the dominant integration pattern for SaaS-to-SaaS and API-to-API event delivery. They are also systematically misunderstood. The sender fires an HTTP POST to your endpoint. You return a 200. The sender considers the event delivered.

What actually happened: your endpoint received the payload, returned 200, and then crashed before writing anything to the database. Or it wrote a partial record before a downstream service timeout. Or it successfully processed the event, and then the sender's network layer failed to receive your 200 and retried — delivering the event a second time.

Webhook delivery is at-least-once, not exactly-once. Every webhook provider worth using will retry deliveries on network errors and non-2xx responses. This is correct behavior. The consequence is that your receiving endpoint will sometimes process the same event more than once. If your endpoint is not idempotent, duplicate deliveries produce duplicate records, duplicate notifications, or duplicate actions.

Acknowledge immediately, process asynchronously. The correct pattern is to receive the webhook payload, write it to a durable queue or an incoming events table, return 200, and process it in a separate worker. This decouples delivery acknowledgment from processing. Your endpoint never times out, never fails due to a downstream service being slow, and never causes unnecessary retries from the sender.

The alternative — processing the event synchronously within the webhook handler — ties your response time to the slowest downstream system in your processing chain. If your database is slow, your webhook handler is slow. If your webhook handler is slow, the sender may time out waiting for the 200 and retry. If it retries while the first delivery is still processing, you have a race condition.

Event log over direct mutation. Rather than processing a webhook by immediately updating a database record, append the raw event to an append-only event log first. The event log is the source of truth. Processing is a separate stage that reads from the log and applies the events to the application state. This gives you replay capability — if a processing bug corrupts state, you can rebuild the correct state by replaying the event log — and a complete audit trail with no extra cost.

Design for idempotency before you design for functionality

Idempotency means that processing the same event twice produces the same result as processing it once. It is the property that makes at-least-once delivery safe. Without it, retries and duplicate deliveries cause real damage: doubled charges, duplicate notifications, overcounted metrics, corrupted aggregates.

Idempotency is not a retrofit. It is a design decision that shapes how you model events, how you write to the database, and how you implement your processing logic. Trying to add idempotency to an existing non-idempotent system is expensive. Designing for it from the start is cheap.

Idempotency keys are stable, unique identifiers for each event that allow the processing layer to detect and skip duplicate deliveries. Most webhook providers include an event ID in the payload — Stripe's id field, GitHub's X-GitHub-Delivery header, Shopify's event topic and resource ID. Your processing layer should check whether an event with that ID has already been processed before doing any work.

The check-and-process operation must be atomic. If you check for duplicates and then process as two separate operations, a concurrent delivery of the same event can pass both checks and produce a duplicate. Use a database transaction or an atomic compare-and-set operation to make duplicate detection and event insertion atomic.

Natural idempotency keys are derived from the business meaning of the event rather than a vendor-assigned ID. For an order status update event, the natural key is the combination of order ID and status. Processing this event twice should produce the same order state. For a price change event, the natural key might be the product ID and the timestamp of the change. For inventory updates, it might be the SKU and the update timestamp.

Natural keys are more robust than vendor-assigned IDs because they survive cross-system scenarios: if the same business event reaches your system through multiple channels — a direct webhook and a polling sync both delivering the same price update — the vendor IDs will differ but the natural key will match, preventing the duplicate.

Conditional updates apply idempotency at the database level. Instead of unconditionally setting a field to the value in the event payload, update only if the event timestamp is newer than the record's current timestamp. This prevents out-of-order deliveries — which are common when retries and normal deliveries interleave — from overwriting newer data with older data.

Build explicit retry logic with dead-letter handling

The default behavior of most integration code when a downstream service fails is to either silently swallow the error or crash. Neither is acceptable for operational integrations.

Silent failure means events disappear. A webhook payload that failed to process due to a transient database timeout vanishes with no record that it was ever received. An operator reviewing the integration hours later sees no errors and no data — there is no way to know whether the missing data was never delivered or was delivered and silently dropped.

Crashing without retry means every transient downstream failure requires a manual restart of the integration service. During the downtime, events accumulate at the sender. When the service restarts, it may or may not replay the missed events depending on the sender's retry window.

Explicit retry queues make retries observable and bounded. When processing fails, the event is moved to a retry queue with a retry count, the last error, and a scheduled retry time. Workers pull from the retry queue according to an exponential backoff schedule: first retry after 30 seconds, then 2 minutes, then 15 minutes. After a configurable maximum retry count — typically three to five attempts — the event moves to a dead-letter queue.

Dead-letter queues hold events that have exhausted their retry budget and require human review. The dead-letter queue should be monitored: when an event arrives, an alert fires. The alert should include the event payload, the error history, and a direct link to the runbook for that integration. An operator can inspect the event, determine whether it can be safely reprocessed, and either requeue it or discard it with a reason logged.

Dead-letter queues are also the primary diagnostic tool for systematic integration failures. If a large number of events accumulate in the dead-letter queue with the same error, that is a signal that something has changed in the downstream system — a schema change, an authentication failure, a rate limit being hit — rather than isolated transient failures.

Retry budget calibration should be set based on your freshness requirements and the expected recovery time for downstream services. A downstream API that is typically unavailable for at most 15 minutes during deployments should have a retry window that covers at least 30 minutes. An integration where data older than one hour is operationally useless should have a short retry window and a rapid dead-letter escalation.

Instrument every integration path

An integration that runs without errors and produces incorrect output is harder to debug than one that fails loudly. The most common form of silent integration failure is data that flows through every stage successfully but ends up in the wrong place, with wrong values, or at the wrong time.

Per-integration metrics track the health of each integration path independently. For a webhook-based integration, the minimum useful metrics are: events received per minute, events successfully processed per minute, processing latency (time from receipt to completion), retry queue depth, and dead-letter queue depth.

These metrics should be tracked per event type, not just in aggregate. An aggregate success rate of 97% looks healthy. A breakdown that shows one event type has a 40% failure rate while the others are at 99.9% reveals a targeted problem.

Structured event logs attach a correlation ID to every event as it enters the system and carry that ID through every processing stage. When debugging why a specific order update did not trigger the expected downstream action, you can retrieve the full processing history for that event — when it was received, when each processing stage ran, what the outcome was at each stage — by filtering on the correlation ID.

End-to-end latency tracking measures the time from when an event occurs at the source to when it is visible in the destination system. This is distinct from processing latency within your integration layer. End-to-end latency includes delivery delay from the webhook provider, queue wait time, processing time, and write latency to the destination. High end-to-end latency that does not show up in processing latency metrics indicates a backlog somewhere between the source and your integration layer.

Delivery verification closes the loop between what was sent and what was received. For high-value integrations — order data, payment events, compliance-relevant records — periodic reconciliation compares the event log from the sender against the records in your destination system. Discrepancies identify gaps caused by events that were delivered but not processed, events that were processed but failed to write, or events that were never delivered.

Handle schema changes without downtime

Upstream APIs and event schemas change. A field is renamed. A new required field is added. An enum gains a new value. A previously nested field is flattened. If your integration is tightly coupled to the current schema, any change to the upstream schema breaks the integration.

Schema versioning at ingestion stores the schema version of each incoming event alongside the payload. When the upstream API publishes a breaking change, the integration layer can route old-version events to a legacy processing handler and new-version events to the updated handler. The migration happens at the processing layer, not the ingestion layer, and old events can be reprocessed with the new handler once the migration is complete.

Tolerant readers accept unexpected fields and missing optional fields without failing. An integration handler that only reads the fields it needs — and ignores the rest — is naturally tolerant of additive changes: new fields being added to the payload do not cause failures. The integration only breaks when a field it depends on is removed or fundamentally changed.

Validation against expected schema happens after ingestion but before processing, and produces actionable errors rather than silent corruption. If an event arrives that does not conform to the expected schema — a required field is missing, a field has an unexpected type — the event goes to a validation error queue rather than the processing queue. The validation error queue is monitored separately from the dead-letter queue, because validation errors indicate a schema change that requires an integration update rather than a transient failure that can be retried.

Contract testing between the integration and the upstream API catches schema changes before they reach production. A contract test defines the fields and types the integration expects from the upstream API, and verifies that the API still conforms to those expectations. Running contract tests as part of the CI pipeline for the integration provides early warning when the upstream API changes in a way that would break the integration.

If your team's integrations are held together with fragile glue code and manual intervention, see how ValenTech approaches integration delivery or review our engagement process to understand how we scope and ship these systems.

Reliable System Integrations: Webhooks, Queues, and the Idempotency Problem

Treat every webhook as a delivery attempt, not a guarantee

Design for idempotency before you design for functionality

Build explicit retry logic with dead-letter handling

Instrument every integration path

Handle schema changes without downtime

Need this built and operated for your team?

Related Posts

Pagination at Scale: Reliable Strategies for Large-Dataset Collection

ETL Pipeline Design for Operations Teams

Automation Architecture Playbook for Ops Teams