Payment Workflow Spec: Failure and Retry Matrix

Most payment specs I have reviewed describe the happy path in three pages and the failure behavior in one sentence: "retry on error." That sentence is where 80% of the production incidents come from. A payment workflow spec earns its keep by naming, category by category, exactly what the system does when the card network, the issuer, or the customer refuses to cooperate.

Case StudiesAPI Contracts

Published on 2026-03-03 · Updated 2026-06-02 · 8 min read · Author: Spec Coding Editorial Team · Review policy: Editorial Policy

Start With a Failure Taxonomy, Not a Flowchart

Before I draw a single box, I force the spec to answer one question: what are the categories of failure this workflow can produce? I insist on five, because collapsing them into "error" is what creates the mess.

Network timeout. The processor never answered. The charge may or may not exist. This is the only category where retrying with the same idempotency key is mandatory.
Soft decline. The issuer said no for a recoverable reason. Insufficient funds, do-not-honor, expired card. Retry is allowed but only with customer action or a later attempt window.
Hard decline. Stolen card, pickup card, fraudulent. Retrying is never correct and on some networks it increases your risk score. The spec must forbid it.
Fraud review. The processor accepted the auth but punted on a decision. The response is async. The spec must describe the waiting state and the webhook that ends it.
3DS challenge. The issuer demanded Strong Customer Authentication. This is not a failure — it is a branch. The customer sees a redirect or in-flow iframe and the workflow pauses until they finish.

Every downstream decision in the spec — retry policy, user messaging, observability — hangs off this five-row matrix. If the taxonomy is wrong, nothing below it will save you.

Retry Rules That Actually Match the Category

Here is the rule I write into every payment spec, word for word: retry policy is a function of failure category, not of HTTP status. A 402 from Stripe can be either a soft decline you should let the customer fix or a hard decline you should never touch again. The spec has to branch on the processor's decline code, not the transport code.

Concretely: for a Stripe response of card_declined with decline_code: insufficient_funds, I permit up to three retries spaced by the dunning schedule below, each one gated on either a customer-initiated action or a scheduled job. For card_declined with decline_code: stolen_card, the spec sets a permanent flag on the payment method and any subsequent attempt must fail closed before hitting the network. For a connection error or an HTTP 5xx with no response body, the spec requires an immediate retry with the same Idempotency-Key, because the processor may have already charged the card and a fresh key would double-charge.

Idempotency Keys Belong to the Attempt

The single most common mistake I see: idempotency keys scoped to the HTTP request instead of the logical attempt. If a timeout triggers a retry and the retry generates a new key, the processor treats it as a new charge. The spec must say, in one sentence, that the key is minted when the attempt begins and survives every retransmission inside that attempt. A new attempt — meaning a new customer action, a new dunning cycle, or a new order — gets a new key. Nothing in between does.

I also spec the key's lifetime: processors typically honor keys for 24 hours. If a retry crosses that boundary, the spec needs to reconcile against the processor's ledger (list charges by metadata) rather than assume the retry is safe.

The 3DS Branch Is a First-Class State

SCA is not an error. If the spec treats it as one, the frontend will do something stupid like show a red banner while the issuer is mid-challenge. The spec needs a state called requires_action (or whatever your processor calls it) with explicit transitions: entered when the auth returns a challenge URL, exited when the webhook confirms success or failure.

I spec two flavors separately. In-flow challenge: the client SDK mounts the iframe, blocks interaction, and resolves. Redirect challenge: the browser navigates to the issuer's ACS URL and comes back to a return URL we control. The spec nails down the return URL, what query params we expect, and what happens if the customer closes the tab mid-challenge. That last one always gets forgotten, and it is the case that produces stuck subscriptions in production.

Auth, Capture, and the Seven-Day Cliff

If your workflow does auth-now / capture-later, the spec must call out the auth expiry. Most processors auto-void an uncaptured auth at roughly seven days (Stripe is 7, Adyen varies by scheme, some schemes are shorter for debit). The spec needs to answer: what happens if the fulfillment job runs on day 8? My answer is always the same — the spec requires a fresh auth before capture attempts beyond day 5, and it treats any capture attempt against an expired auth as a hard failure that opens a new authorization, not a retry.

Multi-capture makes this worse. If you are capturing in pieces against a single auth, the spec must state the partial-capture order, whether over-capture is permitted (it usually is not), and how refund-before-final-capture interacts with the remaining authorized amount. I have watched teams discover at 2am that their "simple" refund reduced the captureable balance to zero and killed the next shipment.

Dunning Is a State Machine, Write It Down

For subscription failures the spec should contain the dunning schedule verbatim, not a vague "we will retry." The schedule I default to:

Attempt 1: immediate, at renewal.
Attempt 2: +3 days, silent retry.
Attempt 3: +7 days, preceded by an email 24 hours earlier.
Attempt 4: +14 days, final notice email.
Cancel: +21 days, subscription moves to canceled, access revoked on the next billing cycle boundary.

Each transition is a row in a state table: previous state, trigger, new state, side effects (email, webhook, access flag). Without this table the team argues the schedule every quarter.

The Webhook Is the Source of Truth

I write this as a non-negotiable clause: the synchronous response from the processor is advisory. The webhook is the ledger. The spec must forbid any state transition that is derived only from the API response — everything meaningful (capture confirmed, refund settled, dispute opened, 3DS completed) has to wait for the corresponding event.

This has a concrete consequence: the spec needs an outbox or a reconciliation job. If the webhook is delayed, the UI may show "processing" longer than the customer expects. The spec owns that tradeoff and picks a timeout after which the job polls the processor directly. I pick 30 seconds for interactive flows and 15 minutes for background ones.

Acceptance Criteria, With a Real Retry Scenario

- Given a customer with a Visa ending 4242 and a recurring $29 subscription
  When the renewal charge returns card_declined / insufficient_funds
  Then the payment is marked past_due
    And attempt 2 is scheduled for +3 days with the same payment method
    And no email is sent on this attempt
    And the customer retains access until the grace period expires

- Given attempt 3 has just failed with the same decline_code
  When the dunning job runs
  Then a past_due_final email is sent
    And attempt 4 is scheduled for +14 days
    And the subscription remains active until attempt 4 resolves

- Given the client receives a connection timeout on charge creation
  When the client retries within 24 hours
  Then it reuses the original Idempotency-Key
    And the processor returns the original charge, not a duplicate

Observability the Spec Has to Name

Three metrics I refuse to let a payment spec ship without: authorization rate broken down by BIN range and card scheme; decline-reason distribution with the processor's raw decline_code preserved (not bucketed into "declined"); and 3DS drop-off measured as challenges initiated versus challenges completed. If any of these are missing, the team is flying blind the first time a single issuer changes its risk model and tanks your approval rate overnight.

I also require a dashboard for webhook lag — the gap between the processor's event timestamp and our ingestion timestamp. A growing lag is usually the earliest signal that something in the payment pipeline is about to page someone.

The Takeaway I Give Every Team

A payment spec is not "describe the charge endpoint." It is a failure-handling document with a small happy path attached. Get the taxonomy right, attach a retry rule to each row, treat 3DS as a branch instead of an error, and let the webhook be the source of truth. Everything else — the dunning copy, the dashboards, the refund flows — falls out of those four decisions. Skip them and you will spend the next two quarters patching symptoms.

Contract Review Packet to Copy

Use this when a payment workflow can fail in more than one place. The packet names retries, reversals, duplicate events, and the evidence each path needs.

API contract review packet: Payment Workflow Spec: Failure and Retry Matrix

Decision to make:
- Write payment workflow specs with retryable errors, declined-card handling, timeout behavior, 3DS branches, and dunning states.

Owner check:
- Product owner:
- Engineering owner:
- QA or operations reviewer:

Scope boundary:
- In scope:
- Out of scope:
- Assumption that still needs approval:

Acceptance evidence:
- Test or fixture:
- Log, metric, or screenshot:
- Manual review step:

Contract boundary: no release without compatibility classification, consumer impact, retry behavior, and rollback notes.

Reviewer prompt:
- What would still be ambiguous to someone who missed the planning meeting?
- What evidence would make this safe enough to ship?

Case study: preventing a double capture

A payment retry path was safe in unit tests but unsafe in the provider timeline. The review case made "retry" concrete enough to test against a duplicate webhook and a pending capture.

Moment	Risk	Spec gate
Client retries after timeout	Second capture attempt.	Same idempotency key returns original capture_id.
Webhook arrives before API response	State overwritten by stale response.	Provider event version wins over client polling.
Provider reports pending	User sees success too early.	UI shows pending and blocks another submit.

Case study: preventing a double capture

A payment retry path was safe in unit tests but unsafe in the provider timeline. The review case made "retry" concrete enough to test against a duplicate webhook and a pending capture.

Moment	Risk	Spec gate
Client retries after timeout	Second capture attempt.	Same idempotency key returns original capture_id.
Webhook arrives before API response	State overwritten by stale response.	Provider event version wins over client polling.
Provider reports pending	User sees success too early.	UI shows pending and blocks another submit.

Keywords: payment workflow spec · idempotency key · 3DS challenge · dunning state machine · decline code taxonomy · webhook source of truth

Payment Workflow Spec: Failure and Retry Matrix

Start With a Failure Taxonomy, Not a Flowchart

Retry Rules That Actually Match the Category

Idempotency Keys Belong to the Attempt

The 3DS Branch Is a First-Class State

Auth, Capture, and the Seven-Day Cliff

Dunning Is a State Machine, Write It Down

The Webhook Is the Source of Truth

Acceptance Criteria, With a Real Retry Scenario

Observability the Spec Has to Name

The Takeaway I Give Every Team

Contract Review Packet to Copy

Case study: preventing a double capture

Case study: preventing a double capture

About This Article