Resolved -
A performance and resilience optimization to the authorization microservice contained a memory leak that was exposed under high traffic. This resulted in a number of pages returning 404’s that should not have. Testing the build in our canary ring did not expose the service to sufficient traffic to discover the leak, allowing it to graduate to production at 6:37 PM UTC. The memory leak under high load caused pods to crash repeatedly starting at 6:42 PM UTC, failing authorization checks. These failures triggered alerts at 6:44 PM UTC. Rolling back the authorization service change was delayed as parts of the deployment infrastructure relied on the authorization service and required manual intervention to complete. Rollback completed at 7:08 PM UTC and all impacted GitHub features recovered after pods came back online. We are evaluating changes to our rollout strategy to better detect this sooner and with less impact, changes to remove the dependency between authorization services and deployment rollback, and incident response improvements to reduce the overall time to recover.
This incident is unrelated to the Slack integration incident. The combined status updates are a limitation of our status reporting tooling and we recognize the confusion this creates. Work to address this was already in progress and will be complete this month.
Nov 3, 19:21 UTC
Update -
Slack notifications have recovered.
Nov 3, 19:17 UTC
Update -
Webhooks is operating normally.
Nov 3, 19:15 UTC
Update -
Pull Requests is operating normally.
Nov 3, 19:15 UTC
Update -
Issues is operating normally.
Nov 3, 19:15 UTC
Update -
Git Operations is operating normally.
Nov 3, 19:15 UTC
Update -
API Requests is operating normally.
Nov 3, 19:15 UTC
Update -
Actions is operating normally.
Nov 3, 19:15 UTC
Update -
Pages is operating normally.
Nov 3, 19:14 UTC
Update -
Codespaces is operating normally.
Nov 3, 19:14 UTC
Update -
Packages is operating normally.
Nov 3, 19:13 UTC
Update -
We have completed the rollback and are monitoring recovery.
Nov 3, 19:10 UTC
Update -
We’re in the process of rolling back an authorization-related change that is causing 404s and other errors.
Nov 3, 19:09 UTC
Update -
Packages is experiencing degraded availability. We are continuing to investigate.
Nov 3, 19:08 UTC
Update -
Pages is experiencing degraded availability. We are continuing to investigate.
Nov 3, 19:07 UTC
Update -
Actions is experiencing degraded availability. We are continuing to investigate.
Nov 3, 19:01 UTC
Update -
Packages is experiencing degraded performance. We are continuing to investigate.
Nov 3, 19:00 UTC
Update -
Codespaces is experiencing degraded performance. We are continuing to investigate.
Nov 3, 18:59 UTC
Update -
API Requests is experiencing degraded performance. We are continuing to investigate.
Nov 3, 18:58 UTC
Update -
Actions is experiencing degraded performance. We are continuing to investigate.
Nov 3, 18:56 UTC
Update -
Pull Requests is experiencing degraded performance. We are continuing to investigate.
Nov 3, 18:55 UTC
Update -
Issues is experiencing degraded performance. We are continuing to investigate.
Nov 3, 18:55 UTC
Update -
Git Operations is experiencing degraded performance. We are continuing to investigate.
Nov 3, 18:55 UTC
Update -
The delayed Slack notifications should be fully processed in about 30 minutes.
Nov 3, 18:25 UTC
Update -
Delayed Slack notifications are processing and the queue is expected to clear in just over an hour.
Nov 3, 17:53 UTC
Update -
Users may see delayed Slack notifications coming through as the queue is processed.
Nov 3, 17:25 UTC
Update -
Fix has been deployed, Slack integrations are recovering
Nov 3, 17:22 UTC
Update -
We are testing a fix for Slack integrations.
Nov 3, 16:56 UTC
Update -
We are aware of issues with Slack integration and are working on resolving the problem.
Nov 3, 16:11 UTC
Investigating -
We are currently investigating this issue.
Nov 3, 16:10 UTC