Webhook Idempotency: Handle Retries Safely

Every transactional email provider that delivers webhooks does so at-least-once. When SESMetric POSTs a delivery or bounce event to your endpoint and the TCP connection drops before the ack arrives, the system has no way to know whether you processed the event or not. The only safe choice is to retry. Webhook idempotency is the contract on your side that turns that ambiguity into safety: no matter how many times the same event lands on your endpoint, the observable side effect happens exactly once.

This guide walks through the verify-dedupe-ack pipeline that survives real retry storms, with a complete FastAPI receiver you can adapt, the SQL schema behind it, and a test plan that proves it works.

Why providers deliver at-least-once

A webhook delivery has three failure surfaces, and only one of them is your code.

The network can drop the response on the way back to the provider. Your handler succeeded, the provider sees a timeout, the event is re-queued.
Your load balancer can return a 502 while your worker is mid-write. The event partially landed and the provider retries.
Your handler can throw after a side effect has already been committed (a charged card, a sent email, a Slack ping). The retry repeats the side effect.

In every ambiguous case the provider chooses to resend. That is the only correct default, because losing a bounce event is worse than delivering it twice. Your receiver therefore needs to assume that any single event_id may arrive 2, 5, or 20 times, sometimes seconds apart, sometimes hours apart after a long outage.

The receive pipeline in order

A robust handler does the same four things in the same order on every request:

Read the raw request body as bytes. Not a parsed object yet.
Verify the HMAC signature over those raw bytes. Reject anything that does not match in constant time.
Atomically claim the event_id in a deduplication table. If the row already exists, you have seen this event before; ack and stop.
Do the work, then return 2xx.

The order matters. Parsing before verifying lets a malformed payload crash you before you authenticate it. Deduping before verifying lets an attacker fill your dedupe table with garbage IDs. Acking before claiming leaves you exposed to a crash between ack and write.

Verify the signature first

The provider signs each payload with an HMAC, typically HMAC-SHA256, over the raw body plus a timestamp header. You recompute the same digest with your shared secret and compare in constant time. The timestamp header is there so you can reject replays older than a small window — five minutes is conventional.

import hmac, hashlib, time

MAX_SKEW_SECONDS = 300

def verify_signature(raw_body: bytes, timestamp: str, signature: str, secret: bytes) -> bool:
    try:
        ts = int(timestamp)
    except ValueError:
        return False
    if abs(time.time() - ts) > MAX_SKEW_SECONDS:
        return False
    mac = hmac.new(secret, f"{ts}.".encode() + raw_body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(mac, signature)

Three details that catch teams out. First, always sign over raw_body, never over a re-serialized JSON object — a single re-ordered key changes the digest. Second, use hmac.compare_digest, not ==, so signature comparisons do not leak timing information. Third, reject anything outside the timestamp window before you do anything else; an attacker who captured a real signed body should not get to replay it a week later.

Why verification before dedupe

If you write the event_id to your dedupe table before verifying the signature, an unauthenticated client can post arbitrary IDs at you and watch them stick. That lets an attacker pre-claim real event IDs they have guessed, so the real provider's later delivery is dropped as a duplicate. Verify first, then index.

The canonical dedupe pattern

Once the signature checks out, claim the event. A single table with the event ID as primary key, combined with INSERT ... ON CONFLICT DO NOTHING, gives you an atomic claim in one round trip — no advisory locks, no SELECT then INSERT race.

CREATE TABLE webhook_events (
    event_id      TEXT        PRIMARY KEY,
    provider      TEXT        NOT NULL,
    event_type    TEXT        NOT NULL,
    received_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
    processed_at  TIMESTAMPTZ,
    payload       JSONB       NOT NULL
);

CREATE INDEX webhook_events_received_at_idx
    ON webhook_events (received_at);

The insert pattern looks like this:

INSERT INTO webhook_events (event_id, provider, event_type, payload)
VALUES ($1, $2, $3, $4)
ON CONFLICT (event_id) DO NOTHING
RETURNING event_id;

If RETURNING gives you a row, you are the first to see this event; do the work and update processed_at on success. If it gives you nothing, you have seen it before; return 200 immediately. That single statement is the heart of the receiver: the database, not your application code, is the authority on "first delivery."

Choosing a window

Storing every webhook forever is fine for a while, then it stops being fine. Most providers stop retrying a given event after 24 hours to a few days; SESMetric, for example, gives up after 72 hours of failed attempts. A dedupe window of 30 days is a comfortable margin — long enough that no retry will outlive it, short enough that the table stays small. Run a nightly job that deletes rows where received_at < now() - interval '30 days'. Anything that arrives after the window is treated as a fresh event, which is fine: the provider has long since stopped retrying.

2xx versus non-2xx response semantics

Your status code tells the provider what to do next. Get it wrong and you either lose events or invite an infinite retry loop.

2xx — the provider stops retrying this event. Return 2xx after the claim row is committed, even if you have not finished the downstream work yet, as long as the work is durably queued.
4xx (except 408 and 429) — the provider treats this as a permanent failure and stops. Use 401 for bad signatures, 400 for malformed JSON. Do not return 4xx for transient problems.
408, 429, 5xx — the provider retries with backoff. Return these only when you genuinely cannot accept the event right now (database down, queue full).

The trap is acking before the claim is durable. If you return 200 and then crash before the INSERT commits, the provider thinks the event landed and you have lost it. Commit the dedupe row first, then ack, then process. The dedupe row is the receipt.

A full FastAPI receive handler

Here is the whole pipeline in one place. It uses SQLAlchemy 2.0 async style, but the shape is the same with asyncpg, Django, or any ORM that exposes a transaction context.

import os, hmac, hashlib, time, json
from fastapi import FastAPI, Request, HTTPException, Header
from sqlalchemy import text
from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine, async_sessionmaker

WEBHOOK_SECRET = os.environ["WEBHOOK_SECRET"].encode()
MAX_SKEW_SECONDS = 300

engine = create_async_engine(os.environ["DATABASE_URL"])
Session = async_sessionmaker(engine, expire_on_commit=False)
app = FastAPI()


def verify_signature(raw: bytes, ts: str, sig: str) -> bool:
    try:
        if abs(time.time() - int(ts)) > MAX_SKEW_SECONDS:
            return False
    except ValueError:
        return False
    mac = hmac.new(WEBHOOK_SECRET, f"{ts}.".encode() + raw, hashlib.sha256).hexdigest()
    return hmac.compare_digest(mac, sig)


@app.post("/webhooks/sesmetric")
async def receive(
    request: Request,
    x_sesmetric_timestamp: str = Header(...),
    x_sesmetric_signature: str = Header(...),
):
    raw = await request.body()
    if not verify_signature(raw, x_sesmetric_timestamp, x_sesmetric_signature):
        raise HTTPException(status_code=401, detail="bad signature")

    try:
        event = json.loads(raw)
        event_id = event["id"]
        event_type = event["type"]
    except (ValueError, KeyError):
        raise HTTPException(status_code=400, detail="malformed payload")

    async with Session() as session:
        claim = await session.execute(
            text(
                "INSERT INTO webhook_events (event_id, provider, event_type, payload) "
                "VALUES (:id, 'sesmetric', :type, CAST(:payload AS JSONB)) "
                "ON CONFLICT (event_id) DO NOTHING RETURNING event_id"
            ),
            {"id": event_id, "type": event_type, "payload": raw.decode()},
        )
        if claim.first() is None:
            return {"status": "duplicate"}

        try:
            await handle_event(session, event)
            await session.execute(
                text("UPDATE webhook_events SET processed_at = now() WHERE event_id = :id"),
                {"id": event_id},
            )
            await session.commit()
        except Exception as exc:
            await session.rollback()
            await record_failure(event_id, exc)
            raise HTTPException(status_code=500, detail="processing failed")

    return {"status": "ok"}

The shape to keep in mind: parse-after-verify, claim-before-work, commit-before-ack. Everything else is detail.

Dead-letter queue for poison messages

Some events will fail no matter how many times the provider retries — a malformed payload that slipped past validation, a foreign key pointing at a deleted user, a downstream API permanently 403. Without a dead-letter queue, those events ride your retry loop forever and pollute your logs.

A simple DLQ is another table:

CREATE TABLE webhook_dlq (
    event_id    TEXT        PRIMARY KEY,
    payload     JSONB       NOT NULL,
    last_error  TEXT        NOT NULL,
    attempts    INTEGER     NOT NULL,
    failed_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

You have two reasonable policies. Either return 5xx for the first N attempts and let the provider's backoff do the retry work, then on attempt N+1 insert into the DLQ and return 200 so the provider stops. Or write to the DLQ on the first hard failure and return 200 immediately. The first policy is safer for transient errors but slower; the second is faster but assumes you have good observability on the DLQ table. Either way, alert on webhook_dlq row count so a human looks at the entries before they rot.

Testing the retry path

Webhook idempotency is the kind of property that holds in development and breaks at 3 AM, so test the failure modes explicitly. Five tests cover the ground.

import pytest, httpx

async def test_duplicate_event_processed_once(client, signed):
    payload, headers = signed({"id": "evt_123", "type": "delivery"})
    r1 = await client.post("/webhooks/sesmetric", content=payload, headers=headers)
    r2 = await client.post("/webhooks/sesmetric", content=payload, headers=headers)
    assert r1.status_code == 200 and r1.json()["status"] == "ok"
    assert r2.status_code == 200 and r2.json()["status"] == "duplicate"
    assert await count_side_effects("evt_123") == 1

Replay the same event ID and assert the side effect runs once. This is the load-bearing test for the whole pipeline.
Tamper with a byte of the body before signing and assert 401. Proves the HMAC check actually runs.
Skew the timestamp by an hour and assert 401. Proves the replay window works.
Inject a transient failure mid-handler with a fault flag, post twice, and assert the second post succeeds. Proves a failed first attempt does not poison the dedupe row.
Post an event older than the dedupe window and assert it is processed as fresh. Documents the policy, not a bug.

Run these as part of your integration suite, not unit tests — the database round-trip and the ON CONFLICT behavior are the whole point. If they all pass, you have a receiver that survives whatever the provider throws at it, and webhook idempotency is no longer a thing you have to think about once a quarter.