Email Observability: Metrics, Logs, Traces

When a password reset never arrives, the first question is always the same: where did it go. Without instrumentation, the answer is a shrug and a grep through application logs that may or may not exist. Email observability is the practice of treating your sending pipeline like any other production system: you measure it, you log it, and you trace requests end to end. Done well, you can answer "why didn't this user get the email" in under a minute, with evidence.

This article covers what to instrument, how to wire it together, and what a healthy pipeline looks like in a dashboard.

The three pillars of email observability

The same three pillars that apply to web services apply to transactional email: metrics, logs, and traces. Each answers a different question.

Metrics for rates and ratios

Metrics are pre-aggregated counters and histograms. They answer "is the system healthy right now" at a glance. For email, the headline metrics are sends, deliveries, bounces (split hard and soft), complaints, and rejections, each broken out by recipient ISP and sending identity. You also want latency histograms for each stage of the pipeline: API accept, queue dwell, provider submit, and end-to-end (request to "delivered" webhook).

Store metrics in a time-series backend like Prometheus or a columnar store. Keep cardinality sane: ISP, identity, and template are useful labels; per-recipient labels are not.

Structured logs for individual sends

Logs answer "what happened to this specific email." Every send must emit a structured event, in JSON, with a stable schema. Free-text logs are unsearchable at scale and useless during an incident at 2 a.m.

Ship logs to a queryable store: ClickHouse if you want SQL and high cardinality, Loki if you prefer label-based search, OpenSearch if you already run it. The choice matters less than the discipline of emitting consistent fields.

Traces for the full chain

A single send touches your API, a queue, a worker, a provider, and several webhook callbacks. A trace stitches those spans together so you can see where time was spent and where errors bubbled up. OpenTelemetry is the de facto standard; instrument each hop with the same trace ID and you get a flame graph per send.

Tracing every send is expensive. Sample aggressively in production (1 to 5 percent) and force-sample on errors or when a debug header is set.

What to count

The metrics below are the minimum useful set. Anything less and you are guessing.

email_send_total{identity, template, isp} - sends accepted by your API
email_delivered_total{identity, isp} - confirmed by the provider's delivered webhook
email_bounced_total{identity, isp, type} - type is hard or soft
email_complained_total{identity, isp} - FBL or webhook complaint events
email_rejected_total{identity, reason} - rejected before send (suppression list, malformed payload)
email_send_latency_seconds{stage} - histogram with stage in accept, queue, submit, delivered

From these you derive the ratios that matter: delivery rate, hard bounce rate per ISP, and p95 end-to-end latency. A spike in bounced_total{isp="gmail"} without a corresponding spike at Outlook is a Gmail-specific reputation issue; the same spike across every ISP usually points at a bad recipient list or a broken suppression check on your side.

Here is the aggregation that catches a deliverability regression on Gmail in the last hour, written in PromQL:

sum by (isp) (
  rate(email_bounced_total{type="hard"}[1h])
)
/
sum by (isp) (
  rate(email_send_total[1h])
)
> 0.02

Or the equivalent against a ClickHouse events table:

SELECT
  isp,
  countIf(event = 'bounced' AND bounce_type = 'hard') AS hard_bounces,
  countIf(event = 'send')                              AS sends,
  hard_bounces / nullIf(sends, 0)                      AS hard_bounce_rate
FROM email_events
WHERE ts >= now() - INTERVAL 1 HOUR
GROUP BY isp
HAVING hard_bounce_rate > 0.02
ORDER BY hard_bounce_rate DESC;

Alert on the ratio, not the raw count. A hundred bounces an hour is fine at scale and catastrophic for a small sender.

Webhooks as the event source

Your application knows when a send was accepted. It does not know whether the message was delivered, bounced, opened, or clicked - the receiving ISP knows that, and your email provider relays it back to you over webhooks. Subscribing to delivery, bounce, complaint, open, and click events is what turns a one-way fire-and-forget pipeline into a closed loop.

Treat webhook handlers like any other ingestion path:

Verify the signature on every request. Drop unsigned or invalid payloads.
Make handlers idempotent. Providers retry, and duplicate events will arrive.
Persist the raw payload before processing. If your parser breaks tomorrow, you can replay.
Emit a metric and a structured log on every event.

The webhook stream is the source of truth for everything downstream of "accepted by provider." If you only count what your own code sends, you will never see the bounces.

The message-id correlation key

Every event - the original send, the queue insert, the provider submit, the delivered webhook three seconds later, the complaint two hours after that - must carry the same correlation key. The conventional choice is the RFC 5322 Message-ID header, which the provider echoes back in every webhook callback.

Your structured log event for a send looks like this:

{
  "ts": "2026-05-19T14:22:08.412Z",
  "level": "info",
  "event": "email.send.accepted",
  "message_id": "<01HXY9F3K2P7QW8R-acme@mail.example.com>",
  "request_id": "req_2bN7eK9pVxL",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "identity": "noreply@example.com",
  "to_hash": "sha256:7d3c...e2a1",
  "template": "password-reset",
  "provider": "ses-eu-west-1",
  "provider_message_id": "0102018f9c1a4b2e-...",
  "size_bytes": 4218,
  "latency_ms": {"accept": 12, "queue": 4}
}

A few notes on the schema:

message_id and provider_message_id are both stored. The first is yours; the second is what shows up in the provider's webhooks. Index both.
Hash the recipient address. You almost never need the plaintext for debugging, and storing it broadens your blast radius on a breach.
trace_id matches the OpenTelemetry trace so you can pivot from a log line to a flame graph.

When a user opens a support ticket, you paste their email address (hashed the same way) into your log store, get the message_id, and from there pull every event in the chain. No guessing.

A typical raw provider acknowledgement line looks like this once it lands in your log shipper:

2026-05-19T14:22:08Z provider=ses region=eu-west-1 msgid=0102018f9c1a4b2e-... status=accepted to_hash=sha256:7d3c...e2a1

That line, joined to your own send event on the provider message ID, gives you both halves of the handshake.

Replay tooling

Webhooks fail. Workers crash mid-send. A bad template renders broken HTML for an hour before anyone notices. You need to be able to replay.

Replay has two prerequisites. First, you persist the original send payload - the rendered subject and body, the variables, the identity, the timestamp - keyed by message_id. Object storage with a 30 to 90 day TTL is enough; keep it cheap. Second, you build a tool that takes a message_id or a query (all sends to a given recipient in the last hour, all sends with template X between two timestamps) and re-submits them through the same code path the original send used.

Replay is also your best debugging tool. When a customer says "the formatting is wrong on this receipt," you replay the exact payload into a staging environment, render it, and see the bug without trying to reconstruct what the variables were at the time. Redact aggressively before storing payloads - PII has no business sitting in your replay store unencrypted.

Tooling at a high level

You can assemble email observability from open-source pieces. A common stack is Prometheus for metrics, Loki or ClickHouse for logs, and OpenTelemetry Collector to fan traces out to Tempo, Jaeger, or a hosted backend. ClickHouse in particular is well suited to email events because it handles high-cardinality columns (message ID, recipient hash) without falling over.

The specific tools matter less than the contracts: a stable metric schema, a stable log schema, and a single correlation key that flows through every hop. Swap Loki for OpenSearch later if you want - your dashboards and runbooks should not care.

What good looks like

A pipeline with healthy email observability passes all five of these:

You can answer "did this email get delivered" for any send in the last 30 days in under a minute, by message ID or recipient hash.
Every metric, log, and trace for a given send shares one correlation key, surfaced in your log store and your tracing UI.
Bounce, complaint, and delivery rates are broken out per ISP and per sending identity, with alerts on ratios rather than raw counts.
Webhook handlers are signature-verified, idempotent, and persist raw payloads before parsing.
You can replay any send from the last 30 days against staging with the original rendered payload, with PII redacted at rest.

Hit those five and deliverability stops being a mystery. Miss any one of them and the next incident will be a long one.