CoreInnovateCoreInnovate
← Blog·Operations

Observability for Transaction-Critical Systems

February 20, 2025·10 min read

Monitoring dashboards and alerting thresholds are not the same as observability. Payment systems need distributed tracing, structured logging, and SLO-based alerting to be truly observable.

There is a meaningful distinction between monitoring and observability, and it matters most in systems where failures have direct financial consequences. Monitoring tells you that something is wrong. Observability tells you why.

Payment systems — gateways, wallet platforms, transaction processors — are where this distinction becomes a business problem. A monitoring system that fires an alert when authorization rate drops 2% tells you to act. An observable system tells you which PSP degraded, which BIN ranges are affected, which merchant categories are experiencing elevated decline rates, and which upstream dependency changed in the last deployment.

The three pillars in a payment context

Metrics tell you what is happening at aggregate level. For payment systems, the minimum viable metric set is: authorization rate (by PSP, by card scheme, by merchant), transaction latency at p50/p95/p99, error rate by error type, and queue depths for any asynchronous processing.

Traces tell you what happened for a specific transaction. Every transaction should have a trace that spans from the initial API request through all downstream calls — PSP authorization, fraud check, ledger update, webhook dispatch — with timing and outcome at each step. When a transaction fails, the trace should tell you exactly where it failed and why.

Logs tell you the details. Use structured logging (JSON) rather than unstructured text. Every log line should include the transaction ID, the service name, the operation, the outcome, and the relevant context (PSP name, amount, currency, merchant). Unstructured logs are not searchable at scale.

Related: The Most Common Architecture Mistakes in Payment Gateways

SLO-based alerting

Threshold-based alerting — alert when latency exceeds 500ms — produces too many false positives and misses gradual degradations. SLO-based alerting fires when you're on track to burn through your error budget, which correlates with actual customer impact.

Define your payment SLOs explicitly:

  • Authorization rate: 99.0% over 30 days (with a faster-burning daily SLO)
  • Transaction latency: p99 under 800ms for 99.5% of traffic
  • Webhook delivery: 99.5% delivered within 5 minutes

Set alert thresholds based on error budget burn rate, not raw values. An alert that fires when you're burning error budget at 5× the sustainable rate gives you time to respond before the SLO is breached.

PSP-level observability

PSP degradations are the most common source of authorization rate drops, and most payment systems don't track them with enough granularity to respond quickly. You need to know within minutes which PSP is degrading, what error types it's returning, and whether a routing change would help.

Instrument every PSP call with its own span and its own metrics. Track: request latency, error rate by error code, authorization rate by response code, and timeout rate. When a PSP's error rate exceeds a threshold, you should have an automated or semi-automated decision about whether to reroute.

Transaction audit logs as observability data

The immutable transaction log that your ledger requires for correctness is also observability data. Every state transition of a transaction — created, authorized, captured, settled, refunded, disputed — should be a structured event with a timestamp, the triggering action, and the resulting state.

Related: Ledger Design Principles for Wallet Platforms

This log serves three purposes: operational debugging (what happened to this transaction?), financial reconciliation (do our records match the PSP?), and regulatory reporting (what was the state of this account at this point in time?).

Incident response and observability

The test of an observability system is how long it takes to identify the root cause of an incident. If your mean time to identify (MTTI) is longer than 15 minutes for a significant degradation, your observability system needs work.

Define your incident investigation workflow and trace it through your tooling. For a sudden drop in authorization rate: which dashboard do you look at first? What query do you run to identify which PSP or BIN range is affected? How do you determine whether it's an application issue or an upstream dependency? The answers to these questions should be defined before an incident, not discovered during one.


If incidents in your payment system regularly take more than 30 minutes to understand, a platform assessment will identify the observability gaps and give you a prioritised remediation plan.

CoreInnovate

Working on a payment platform challenge?

Our specialist engineers work directly with payment gateways, wallet providers, and fintech platforms. Start with a scoped architecture assessment.