Service Availability Issue
Updates
Summary
On November 10, 2025, customers in the US region experienced a service disruption that affected both the application and API availability. The issue began around 14:00 IDT and was caused by unusually high resource usage in a core feature management component, leading to instability and degraded performance. The team responded by increasing system resources and adjusting scaling parameters, which stabilized the service. The system was confirmed stable later that day, and ongoing monitoring was put in place to ensure continued reliability.
Key Timeline (IDT)
November 10, 2025, 14:00 IDT: Service disruption began, impacting application and API availability.
November 10, 2025, 14:33–15:00 IDT: Investigation identified high resource usage in a feature management component; resource limits were increased and scaling adjustments made.
November 10, 2025, 16:27 IDT: System stability improved after resource and scaling changes.
November 10, 2025, 18:33 IDT: Service confirmed stable; monitoring continued.
November 11, 2025, onward: Further adjustments and monitoring to maintain stability.
Root Cause
The incident was triggered by a temporary overload in a feature management process, which led to excessive resource consumption and instability. This was likely due to a combination of configuration settings and increased request rates from backend services.
Actions Taken
- Increased CPU and memory resources for the affected component.
- Adjusted scaling parameters to better handle load.
- Relaxed health check settings to reduce unnecessary restarts.
- Monitored system performance and confirmed restoration of normal service.
Action Items
- Improve monitoring and alerting for resource usage spikes.
- Enable additional tracing and metrics collection for the feature management process.
- Adjust backend service request intervals to reduce load.
- Review and optimize resource allocation settings to prevent recurrence.
US region API timeouts were traced to CPU saturation in our feature-flag service. We increased capacity by raising CPU/memory requests and limits, scaling replicas, and increasing the liveness-probe timeout. We also lengthened client polling to reduce load.
Service is stable. we’re adding metrics and tracing to prevent recurrence.
We’ve mitigated the issue and increased resources to prevent further impact. The service is stable, and we’re continuing to monitor while investigating the root cause.
← Back
Status Page