Notification System Spec: Delivery Guarantees
Every time I see a spec that says "send the user a notification when their password is reset," I know the team is about to ship a bug. There is no such thing as "a notification." There is an email that may bounce, an SMS that costs money and must never duplicate, a push that may silently fail, and an in-app badge that the user may never see. The spec has to name each channel and each guarantee.
Review Note
Reviewed May 3, 2026. This article is maintained as a focused companion to the API Contracts Hub. It now includes channel-specific guarantees, provider failure evidence, and suppression rules for notification teams.
"Deliver This Notification" Is a Lie
The sentence "deliver this notification to the user" hides four different delivery models. Email goes through SMTP relays that retry for days. SMS goes through a carrier gateway that may charge you twice if you retry. Push goes through APNs or FCM, which accept your payload and never tell you if the device was actually awake. In-app lives in your own database and is as reliable as your read queries.
My rule: the spec must address each channel separately. Do not write "the notification is sent." Write "the email is enqueued for at-least-once delivery to the verified address, and the SMS is dispatched at-most-once to the E.164 number with consent recorded within the last 365 days."
Per-Channel Guarantees I Will Die On
Here is the matrix I use for every notification spec. It is opinionated on purpose.
- Email: at-least-once. The user can tolerate two copies of a receipt far better than zero copies. Retry on 4xx from the provider up to three times, with exponential backoff.
- SMS: at-most-once. A duplicate 2FA code is a support ticket and a trust event. If the send fails, log and escalate; do not blindly retry. SMS also costs real money, which sharpens the decision.
- Push: best-effort. The push service acknowledges the token, not the device. Treat push as a hint, never as the canonical delivery. If the message matters, a second channel must carry it.
- In-app: exactly-once at the read layer. Dedup by notification ID on the client, persist read state server-side, and never show the same item twice even across device switches.
The Fallback Ladder Must Name Timing
Most specs wave at "fallback to email if push fails." That is not a spec, that is a wish. I want to see the actual ladder, with timing and condition.
For a password-reset notification, the ladder I would write into the spec looks like this: dispatch push immediately to all registered device tokens. If no client-side acknowledgement arrives within 30 seconds, enqueue email. If the notification is flagged as critical (password reset, fraud alert, account lockout) and the user has an SMS-consented number, also dispatch SMS in parallel with the email. No waiting, no further ladder.
Write the 30 seconds down. Write "critical" down as a flag on the template, not a case-by-case judgement. Ladders without numbers rot the first time someone asks "was it supposed to be 30 seconds or 5 minutes?"
Suppression That Survives Every Resend
Suppression is the part junior specs forget. It is also where regulators get interested. The spec has to describe a central suppression store that every send path consults before enqueuing anything.
- Unsubscribe state, per channel and per category. Unsubscribing from marketing email does not unsubscribe from password resets, and the spec must say that explicitly.
- Hard bounces and spam complaints. One hard bounce suppresses the email forever until the user re-verifies.
- Carrier-level SMS blocks (STOP keyword, Twilio 30007) persist across accounts, not just the current session.
- Temporary rate suppression. If a batch job tries to resend the same template 10 times because of a bug, suppression should stop it cold after the second attempt.
Suppression must survive every retry and every "just send it again" button. If it does not, one intern with a SQL console will spam your entire user base.
Rate Limiting Is Part of the Contract
I spec rate limits in two dimensions. Per-user: no more than N messages of the same category per hour, and no more than M total notifications per day. Per-account (for B2B products): no more than X per tenant per day, because one broken workflow should not annihilate your sender reputation. Both limits belong in the spec, not in a config file nobody reviews.
Batching, Templates, and the Versioning Trap
Not every notification wants to fly out individually. For social feeds and low-urgency activity, I spec a digest: the user opts in at signup, the digest runs every N hours, and the spec names the cutoff (anything older than 24 hours drops). Real-time is reserved for transactional and security messages.
Template versioning is the other trap. The spec must say: templates are versioned, every in-flight send records the template version it was composed with, and rollout happens by routing a percentage of sends to v2 while v1 drains. Never hot-swap a template on a live queue. I have watched that send two different password-reset emails to the same user ninety seconds apart.
Delivery Receipts: Trust Nothing
Product managers love delivery dashboards. The spec should set honest expectations: email "opens" require a tracking pixel that many clients block, so treat the open rate as a floor. SMS delivery reports (MDN) are carrier-dependent and some carriers simply lie. Push receipts from APNs or FCM mean the message was accepted by the push service, not that the phone rendered it. The only reliable receipt is the user taking an action inside the app as a direct result, which is why critical flows should include a confirmation step.
Compliance and Observability
Every notification spec I write has a compliance section. CAN-SPAM requires a working unsubscribe link in every commercial email, honored within 10 business days. TCPA requires prior express written consent for marketing SMS, and the spec must name where that consent is captured. GDPR requires a lawful basis: transactional mail is contract necessity, marketing is consent, and the spec lists which applies per template.
Name the dashboards before the code ships. Per-channel success rate (email under 98% pages on-call). Per-template open and click rate so we catch the subject line that crashed engagement. Complaint rate tracked weekly, because crossing 0.3% can throttle your sending domain for weeks.
Acceptance Criteria for a Password-Reset Notification
- Given a user requests a password reset And the user has a verified email and an SMS-consented phone number When the reset event is emitted Then a push is dispatched to all registered devices And if no push ACK arrives within 30 seconds Then an email is enqueued for at-least-once delivery And an SMS is dispatched at-most-once in parallel - Given the user has previously unsubscribed from marketing email When a password reset is triggered Then the transactional email is still sent And the suppression check records "transactional override" - Given the SMS provider returns a transient error When the dispatcher handles the failure Then no retry is attempted And the event is logged with severity "warn" and surfaced on the SMS success-rate dashboard
Final Takeaway
A notification system is not one promise, it is four. Write the spec as four promises: name the channel, name the guarantee, name the fallback timing, name the suppression rule. Do that and your on-call rotation will stop being a graveyard shift for "user says they never got the email." Skip it and you will keep shipping the same bug in a different hoodie.
Provider Failure Drill
The launch review should include one controlled provider failure. Turn off the email provider in staging, trigger the notification, and capture what happens. The spec should predict the result before the drill runs.
Provider failure evidence Template: Primary channel: Fallback channel: Suppression category: Provider failure simulated: Expected retry count: Expected fallback delay: User-visible state: Dashboard metric: Alert owner: Manual resend allowed: yes / no Reason if manual resend is blocked:
This drill catches awkward policy gaps. For example, a password reset can safely fall back to email after push fails, but a marketing SMS should not fall back to another channel just because the provider returned a transient error. The spec needs to say that before the code guesses.
Keep Reading
Editorial Note
- Author details: Spec Coding Editorial Team
- Editorial policy: How we review and update articles
- Corrections: Contact the editor