[EU] Can't login to the platform
Updates
Incident Summary:
Users were unable to access the application.
Time of Incident:
Friday, 11:38 AM
Issue Detected:
Reports were received indicating that users could not log in to the application.
Initial Findings:
- Numerous alerts from the authentication service indicated connectivity issues with the database.
- A event log showed an database failover occurred near the time of the incident.
Immediate Action Taken:
- All authentication service pods were reset.
- After the reset, new pods started functioning correctly, and users were able to log in successfully.
Root Cause:
The application’s authentication and authorization mechanism relies on auth-service. Following the database failover, auth-service instances did not reconnect to the database endpoint, causing login failures. Although tests indicate auth-service is designed to handle database failovers, the instances failed to recover the connection in this instance. The exact cause remains unclear, and further investigation is ongoing.
Actions Taken:
- Added error logs to a dedicated channel to monitor similar cases in the future.
- Increased database memory allocation to reduce the likelihood of future failovers.
- Continued investigation into why auth-service did not recover its connection after the DB failover.
- Implemented synthetic and domain monitoring to proactively identify login issues.
A fix has been implemented and we are monitoring the results.
← Back