Delays in CLI, PushEvents and Repositories Scans (SAST, Container and SCA)

Minor incident US Environment API (US Environment) Application/UI (US Environment)
2025-06-30 15:00 IDT · 1 day, 22 hours, 11 minutes

Updates

Post-mortem

Summary

On June 30, 2025, some customers experienced intermittent issues with GitHub webhook events failing inconsistently.
This led to dropped notifications and occasional service errors. The problem stemmed from an internal configuration issue that caused excessive resource consumption within our systems.
Following this, on July 1, 2025, a backlog of processing jobs developed, specifically impacting certain types of scans. This was due to an issue where some jobs became stuck, preventing others from being processed efficiently.

Timeline

30.06.25 07:10 AM (GMT+3) – Our monitoring systems detected initial issues with external event processing.
30.06.25 09:39 AM (GMT+3) –Our engineering team began a deep investigation, examining recent system changes as a potential cause. During the review, we attempted to configure a new container base image. When this didn’t resolve the issue, we reverted to the old base image. Additionally, we increased the memory allocation for the service in an effort to improve stability.
30.06.25 09:52 AM (GMT+3) – Despite initial adjustments to system resources, the issue persisted, prompting deeper investigation.
30.06.25 10:35 AM (GMT+3) – We observed intermittent service errors during external event processing. Services briefly recovered after restarts but then experienced further interruptions.
30.06.25 10:45 AM (GMT+3) – Analysis revealed recurring errors related to processing incoming event data.
30.06.25 10:49 AM (GMT+3) – It was confirmed that our processing instances were overwhelmed, and significant increases to their resources were applied.
30.06.25 10:59 AM (GMT+3) – The core issue was identified as a missing configuration required for proper operation.
30.06.25 11:13 AM (GMT+3) – A fix, which involved restoring the missing configuration, was deployed.
30.06.25 11:17 AM (GMT+3) – The system began to stabilize, and the flow of external event processing recovered.
30.06.25 12:30 AM (GMT+3) – While primary systems were stabilized, some customers reported experiencing lingering effects or continued issues.

Root Cause

  • June 30th Issue: The core problem was a missing internal configuration within our system. This led to a critical component repeatedly attempting and failing to initialize, causing excessive resource consumption (CPU and memory spikes) and ultimately leading to system crashes under load. This issue was indirectly introduced by an update to a related system dependency.
  • July 1st Issue: Following a data re-synchronization effort, our processing queues experienced a massive surge in jobs. While the queue initially appeared to be processing, a subset of these jobs were missing a critical internal identifier. This prevented them from completing, effectively “sticking” them in the queue and blocking other, lower-priority, or later-scheduled jobs from being processed.

Actions Taken:

June 30th Issue:

  • Restored critical system configurations.
  • Adjusted system resource allocations to prevent overload.
  • Reverted recent system base image changes as a precautionary measure.

July 1st Issue:

  • Temporarily bypassed a validation requirement to unblock immediate processing.
  • Identified and cleared problematic jobs from the processing queue via database action.
  • Deployed updates to address the underlying cause of stuck jobs.

Action Items:

  • Implement enhanced configuration management to prevent similar issues.
  • Develop improved mechanisms for re-synchronizing data, ensuring a more even distribution of processing load.
  • Review and optimize data processing flows to handle increased loads more resiliently.
July 8, 2025 · 19:37 IDT
Resolved

The issue is now resolved.

July 2, 2025 · 02:30 IDT
Update

An additional fix was deployed to improve the situation. The queue started to clear faster.
Please note that some scans that were stuck in the queue for an extended period may still experience timeouts.

July 1, 2025 · 17:00 IDT
Update

We identified the root cause of the issue and began deploying a fix to address it.

July 1, 2025 · 12:30 IDT
Issue

We are experiencing some slownesses in the system to process CLI, PushEvents and Repositories scans (SAST, Containers, SCA)
PullRequests are not impacted

July 1, 2025 · 11:00 IDT

← Back