Resolved -
On November 11, 2023, at 1:00 UTC, GitHub background jobs encountered delays lasting up to 50 minutes. This delay affected various services utilizing background jobs, including Actions, Webhooks, Pull Requests, and Pages. The impact persisted for approximately one hour until 2:10 UTC.
During the incident, some customers experienced delays in starting Github Actions workflow runs and Pages builds. We estimate that about 10% of Actions workflow runs were delayed during the impact window and 99% of Pages builds failed from 1:00 UTC to 1:20 UTC. Users may have experienced a delay in seeing recent pushes reflected in pull request views. This delay averaged between 5 and 10 minutes and affected up to 30% of pull request page views during the incident. 1% of pull request page views experienced delays of up to 60 minutes. Finally, 30% of webhook deliveries in this window missed our target of being delivered within 1 minute of the triggering event.
This incident was caused by excessive rebalancing in our Kafka consumer group that feeds our background job system. We have altered our Kafka configuration to reduce the likelihood of this issue, created diagnostic tools to identify future causes, and will be breaking up this relay into multiple groups to limit the blast radius if the problem does reoccur.
Nov 11, 02:14 UTC
Update -
Pages is operating normally.
Nov 11, 02:14 UTC
Update -
Actions is operating normally.
Nov 11, 02:13 UTC
Update -
Rebalancing completed and job queues are improving. We continue to monitor for full recovery of Webhooks, Actions, and Pages workflows.
Nov 11, 01:53 UTC
Update -
Actions is experiencing degraded performance. We are continuing to investigate.
Nov 11, 01:42 UTC
Update -
Webhooks is experiencing degraded performance. We are continuing to investigate.
Nov 11, 01:41 UTC
Update -
Pages builds, webhooks, and other workflows were delayed starting at 1:00 UTC. We have failed over the service that was contributing to the delays and see successful processing. We are continuing to monitor for full recovery
Nov 11, 01:40 UTC
Investigating -
We are investigating reports of degraded performance for Pages
Nov 11, 01:26 UTC