ROOT CAUSE ANALYSIS
Peach Payments – Paysafe downtime
April 28, 2022
Incident Number 2022042801
Customers Affected Customers accessing the Paysafe UI
Start Time April 28, 2022 08:16:00 AM
End Time April 28, 2022 11:34:00 AM
Impact Duration minutes
At a glance…
• What was affected?
Ability to access the Paysafe UI was denied during this incident. Main processing platforms remained fully operational and unaffected as did the Paysafe API. Ongoing RCA of symptoms indicate that this is a recurrence of an incident which took place last year.
• What was the root cause?
A major contributor to the issue was identified as being code used for notification of Payment links being generated on PaySafe combined with a lack of appropriate indices on the associated database instance. Once indices were applied to mitigate the poor performance of the code, normalisation of resources on the server began immediately.
• What preventative actions are being taken?
The application is under code freeze with the exception being for security updates as required. The applied indices have been included in the application source for future deployments (if they occur).
Monitoring and notifications have been updated to include early-warning identifiers.
If you have any questions concerning the content of this RCA, please open a case via email@example.com
Alternatively, please contact your relevant Account Managers for alternative methods of contacting emergency support.
This root cause analysis (RCA) document is a follow-up to the incident detailed above, which resulted in an interruption to Peach Payments Paysafe services. All primary processing platforms remained operational and were unaffected by this event. Additional details are outlined in the following sections.
At 10.01 AM SAST merchant escalations identified an incident whereby Paysafe Web UI was failing to render. Application and infrastructure teams joined in war-room to investigate the cause.
Once missing indices were identified, they were applied and latency on the API calls to generate Payment Links began to drop immediately.
Services were restarted and fully restored at 11:34 AM SAST.
Full Root Cause Details
There were a combination of things which together caused this incident; missing indices on the database instance, slow downstream response times for SMTP systems accepting Payment Links emails, a large enough volume of Payment Links being generated and poor design in the code responsible for generating Payment Links.
When a downstream SMTP provider is slow, this generates a long tail of backlogged messages in the database and causes DB queries to become slow to the point of rendering the application unusable. Indexes which were absent from the database were critical to preventing the latency of queries on the specific tables included in a large run of "payment links" being generated via the API. The volume of Payment Links being generated triggers a threshold of latency which results in a runaway exhaustion of DB resources, leading to the query latency and subsequently the long-tail.
Planned Preventative Action Items
Target Date Action
April 29, 2022 DevOps team has implemented early-warning monitoring on the DB instance to identify the latency threshold in queries which could potentially lead to a similar event.
May 02, 2022 “Code Freeze” declared on the system with the exception of critical security updates as required.
To view our Peach Payments Status Page and to subscribe, please click on the below link