Payment Workflow Spec: Failure and Retry Matrix
Most payment specs I have reviewed describe the happy path in three pages and the failure behavior in one sentence: "retry on error." That sentence is where 80% of the production incidents come from. A payment workflow spec earns its keep by naming, category by category, exactly what the system does when the card network, the issuer, or the customer refuses to cooperate.
Start With a Failure Taxonomy, Not a Flowchart
Before I draw a single box, I force the spec to answer one question: what are the categories of failure this workflow can produce? I insist on five, because collapsing them into "error" is what creates the mess.
- Network timeout. The processor never answered. The charge may or may not exist. This is the only category where retrying with the same idempotency key is mandatory.
- Soft decline. The issuer said no for a recoverable reason. Insufficient funds, do-not-honor, expired card. Retry is allowed but only with customer action or a later attempt window.
- Hard decline. Stolen card, pickup card, fraudulent. Retrying is never correct and on some networks it increases your risk score. The spec must forbid it.
- Fraud review. The processor accepted the auth but punted on a decision. The response is async. The spec must describe the waiting state and the webhook that ends it.
- 3DS challenge. The issuer demanded Strong Customer Authentication. This is not a failure — it is a branch. The customer sees a redirect or in-flow iframe and the workflow pauses until they finish.
Every downstream decision in the spec — retry policy, user messaging, observability — hangs off this five-row matrix. If the taxonomy is wrong, nothing below it will save you.
Retry Rules That Actually Match the Category
Here is the rule I write into every payment spec, word for word: retry policy is a function of failure category, not of HTTP status. A 402 from Stripe can be either a soft decline you should let the customer fix or a hard decline you should never touch again. The spec has to branch on the processor's decline code, not the transport code.
Concretely: for a Stripe response of card_declined with decline_code: insufficient_funds, I permit up to three retries spaced by the dunning schedule below, each one gated on either a customer-initiated action or a scheduled job. For card_declined with decline_code: stolen_card, the spec sets a permanent flag on the payment method and any subsequent attempt must fail closed before hitting the network. For a connection error or an HTTP 5xx with no response body, the spec requires an immediate retry with the same Idempotency-Key, because the processor may have already charged the card and a fresh key would double-charge.
Idempotency Keys Belong to the Attempt
The single most common mistake I see: idempotency keys scoped to the HTTP request instead of the logical attempt. If a timeout triggers a retry and the retry generates a new key, the processor treats it as a new charge. The spec must say, in one sentence, that the key is minted when the attempt begins and survives every retransmission inside that attempt. A new attempt — meaning a new customer action, a new dunning cycle, or a new order — gets a new key. Nothing in between does.
I also spec the key's lifetime: processors typically honor keys for 24 hours. If a retry crosses that boundary, the spec needs to reconcile against the processor's ledger (list charges by metadata) rather than assume the retry is safe.
The 3DS Branch Is a First-Class State
SCA is not an error. If the spec treats it as one, the frontend will do something stupid like show a red banner while the issuer is mid-challenge. The spec needs a state called requires_action (or whatever your processor calls it) with explicit transitions: entered when the auth returns a challenge URL, exited when the webhook confirms success or failure.
I spec two flavors separately. In-flow challenge: the client SDK mounts the iframe, blocks interaction, and resolves. Redirect challenge: the browser navigates to the issuer's ACS URL and comes back to a return URL we control. The spec nails down the return URL, what query params we expect, and what happens if the customer closes the tab mid-challenge. That last one always gets forgotten, and it is the case that produces stuck subscriptions in production.
Auth, Capture, and the Seven-Day Cliff
If your workflow does auth-now / capture-later, the spec must call out the auth expiry. Most processors auto-void an uncaptured auth at roughly seven days (Stripe is 7, Adyen varies by scheme, some schemes are shorter for debit). The spec needs to answer: what happens if the fulfillment job runs on day 8? My answer is always the same — the spec requires a fresh auth before capture attempts beyond day 5, and it treats any capture attempt against an expired auth as a hard failure that opens a new authorization, not a retry.
Multi-capture makes this worse. If you are capturing in pieces against a single auth, the spec must state the partial-capture order, whether over-capture is permitted (it usually is not), and how refund-before-final-capture interacts with the remaining authorized amount. I have watched teams discover at 2am that their "simple" refund reduced the captureable balance to zero and killed the next shipment.
Dunning Is a State Machine, Write It Down
For subscription failures the spec should contain the dunning schedule verbatim, not a vague "we will retry." The schedule I default to:
- Attempt 1: immediate, at renewal.
- Attempt 2: +3 days, silent retry.
- Attempt 3: +7 days, preceded by an email 24 hours earlier.
- Attempt 4: +14 days, final notice email.
- Cancel: +21 days, subscription moves to
canceled, access revoked on the next billing cycle boundary.
Each transition is a row in a state table: previous state, trigger, new state, side effects (email, webhook, access flag). Without this table the team argues the schedule every quarter.
The Webhook Is the Source of Truth
I write this as a non-negotiable clause: the synchronous response from the processor is advisory. The webhook is the ledger. The spec must forbid any state transition that is derived only from the API response — everything meaningful (capture confirmed, refund settled, dispute opened, 3DS completed) has to wait for the corresponding event.
This has a concrete consequence: the spec needs an outbox or a reconciliation job. If the webhook is delayed, the UI may show "processing" longer than the customer expects. The spec owns that tradeoff and picks a timeout after which the job polls the processor directly. I pick 30 seconds for interactive flows and 15 minutes for background ones.
Acceptance Criteria, With a Real Retry Scenario
- Given a customer with a Visa ending 4242 and a recurring $29 subscription
When the renewal charge returns card_declined / insufficient_funds
Then the payment is marked past_due
And attempt 2 is scheduled for +3 days with the same payment method
And no email is sent on this attempt
And the customer retains access until the grace period expires
- Given attempt 3 has just failed with the same decline_code
When the dunning job runs
Then a past_due_final email is sent
And attempt 4 is scheduled for +14 days
And the subscription remains active until attempt 4 resolves
- Given the client receives a connection timeout on charge creation
When the client retries within 24 hours
Then it reuses the original Idempotency-Key
And the processor returns the original charge, not a duplicate
Observability the Spec Has to Name
Three metrics I refuse to let a payment spec ship without: authorization rate broken down by BIN range and card scheme; decline-reason distribution with the processor's raw decline_code preserved (not bucketed into "declined"); and 3DS drop-off measured as challenges initiated versus challenges completed. If any of these are missing, the team is flying blind the first time a single issuer changes its risk model and tanks your approval rate overnight.
I also require a dashboard for webhook lag — the gap between the processor's event timestamp and our ingestion timestamp. A growing lag is usually the earliest signal that something in the payment pipeline is about to page someone.
The Takeaway I Give Every Team
A payment spec is not "describe the charge endpoint." It is a failure-handling document with a small happy path attached. Get the taxonomy right, attach a retry rule to each row, treat 3DS as a branch instead of an error, and let the webhook be the source of truth. Everything else — the dunning copy, the dashboards, the refund flows — falls out of those four decisions. Skip them and you will spend the next two quarters patching symptoms.
Contract Review Packet to Copy
Use this when a payment workflow can fail in more than one place. The packet names retries, reversals, duplicate events, and the evidence each path needs.
API contract review packet: Payment Workflow Spec: Failure and Retry Matrix Decision to make: - Write payment workflow specs with retryable errors, declined-card handling, timeout behavior, 3DS branches, and dunning states. Owner check: - Product owner: - Engineering owner: - QA or operations reviewer: Scope boundary: - In scope: - Out of scope: - Assumption that still needs approval: Acceptance evidence: - Test or fixture: - Log, metric, or screenshot: - Manual review step: Contract boundary: no release without compatibility classification, consumer impact, retry behavior, and rollback notes. Reviewer prompt: - What would still be ambiguous to someone who missed the planning meeting? - What evidence would make this safe enough to ship?
Case study: preventing a double capture
A payment retry path was safe in unit tests but unsafe in the provider timeline. The review case made "retry" concrete enough to test against a duplicate webhook and a pending capture.
| Moment | Risk | Spec gate |
|---|---|---|
| Client retries after timeout | Second capture attempt. | Same idempotency key returns original capture_id. |
| Webhook arrives before API response | State overwritten by stale response. | Provider event version wins over client polling. |
| Provider reports pending | User sees success too early. | UI shows pending and blocks another submit. |
Case study: preventing a double capture
A payment retry path was safe in unit tests but unsafe in the provider timeline. The review case made "retry" concrete enough to test against a duplicate webhook and a pending capture.
| Moment | Risk | Spec gate |
|---|---|---|
| Client retries after timeout | Second capture attempt. | Same idempotency key returns original capture_id. |
| Webhook arrives before API response | State overwritten by stale response. | Provider event version wins over client polling. |
| Provider reports pending | User sees success too early. | UI shows pending and blocks another submit. |