PR Scans are Delayed
Incident Report for Cycode
Postmortem

Incident Summary:
Delayed scans on Monday morning.

Time of Incident:
Monday, 10:15 AM


Issue Detected:
An alert indicated a failure in the scanning service. Initial investigation showed no PR scans were executing.


Initial Findings:

  • Lag detected in the PR topic within the metadata service.
  • Multiple KeyAlreadyExists errors in the metadata service related to a recent integration.

Immediate Action Taken:

  • Faulted entities were manually removed from the database, allowing new entities to process correctly.
  • Lag began decreasing immediately, and the issue was fully resolved shortly thereafter.

Root Cause:
GitLab offers an option to share repositories across multiple groups. In Cycode, for GitLab Enterprise integrations, specific groups can be designated as the 'organization' of the integration, initiating syncs from these groups. With shared repositories, the system attempts to process the same repository multiple times due to identical identifiers, causing unexpected errors and delays from retry mechanisms. This led to Kafka lag and PR processing delays.


Actions Taken:

  1. Stopped syncing shared repositories, so each repository syncs only from the original source group.
  2. Removed duplicated webhooks that resulted from duplicated syncs.
  3. Added protective measures to the metadata service to prevent reprocessing issues even if duplicate events occur.
Posted Nov 10, 2024 - 10:10 UTC

Resolved
The incident has been resolved. We will provide the RCA document early next week.
Posted Nov 04, 2024 - 12:07 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 04, 2024 - 11:02 UTC
Investigating
We have identified an issue causing delays with our scan PR feature and are working to address it.
Posted Nov 04, 2024 - 09:52 UTC
This incident affected: US Environment (API (US Environment)).