PR Scans are Delayed
Updates
Incident Summary:
Delayed scans on Monday morning.
Time of Incident:
Monday, 10:15 AM
Issue Detected:
An alert indicated a failure in the scanning service. Initial investigation showed no PR scans were executing.
Initial Findings:
- Lag detected in the PR topic within the metadata service.
-
Multiple
KeyAlreadyExists
errors in the metadata service related to a recent integration.
Immediate Action Taken:
- Faulted entities were manually removed from the database, allowing new entities to process correctly.
- Lag began decreasing immediately, and the issue was fully resolved shortly thereafter.
Root Cause:
GitLab offers an option to share repositories across multiple groups. In Cycode, for GitLab Enterprise integrations, specific groups can be designated as the ‘organization’ of the integration, initiating syncs from these groups. With shared repositories, the system attempts to process the same repository multiple times due to identical identifiers, causing unexpected errors and delays from retry mechanisms. This led to Kafka lag and PR processing delays.
Actions Taken:
- Stopped syncing shared repositories, so each repository syncs only from the original source group.
- Removed duplicated webhooks that resulted from duplicated syncs.
- Added protective measures to the metadata service to prevent reprocessing issues even if duplicate events occur.
The incident has been resolved. We will provide the RCA document early next week.
A fix has been implemented and we are monitoring the results.
We have identified an issue causing delays with our scan PR feature and are working to address it.
← Back