GitHub Status

Mar 14, 2024

No incidents reported today.

Mar 13, 2024

Resolved - From March 12, 2024 23:39 UTC to March 13, 2024 1:58 UTC, some Pull Requests updates were delayed and did not reflect the latest code that had been pushed. On average, 20% of Pull Requests page loads were out of sync and up to 30% of Pull Requests were impacted at peak. An internal component of our job queueing system was incorrectly handling invalid messages, resulting in stalled processing.

We mitigated the incident by shipping a fix to handle the edge case gracefully and allow processing to continue.

Once the fix was deployed at 1:47 UTC, our systems fully caught up with pending background jobs at 1:58 UTC.

We’re working to improve resiliency to invalid messages in our system to prevent future delays for these pull request updates. We are also reviewing our monitoring and observability to identify and remediate these types of failure cases faster.

Mar 13, 01:58 UTC

Update - Pull Requests is operating normally.
Mar 13, 01:58 UTC

Update - We believe we've found a mitigation and are currently monitoring systems for recovery.
Mar 13, 01:53 UTC

Update - We're continuing to investigate delays in PR updates. Next update in 30 minutes.
Mar 13, 01:18 UTC

Update - We're continuing to investigate an elevated number of pull requests that are out of sync on page load.
Mar 13, 00:47 UTC

Update - We're continuing to investigate an elevated number of pull requests that are out of sync on page load.
Mar 13, 00:12 UTC

Update - We're seeing an elevated number of pull requests that are out of sync on page load.
Mar 12, 23:39 UTC

Investigating - We are investigating reports of degraded performance for Pull Requests
Mar 12, 23:39 UTC

Mar 12, 2024

Incident with API Requests, Git Operations, Webhooks and Copilot

Resolved - On March 11, 2024 starting at 22:45 UTC and ending on March 12, 2024 00:48 UTC various GitHub services were degraded and returned intermittent errors for users. During this incident, the following customer impacts occurred: API error rates as high as 1%, Copilot error rates as high as 17%, and Secret Scanning and 2FA using GitHub Mobile error rates as high as 100% followed by a drop in error rates to 30% starting at 22:55 UTC. This elevated error rate was due to a degradation of our centralized authentication service upon which many other services depend.

The issue was caused by a deployment of network related configuration that was inadvertently applied to the incorrect environment. This error was detected within 4 minutes and a rollback was initiated. While error rates began dropping quickly at 22:55 UTC, the rollback failed in one of our data centers, leading to a longer recovery time. At this point, many failed requests succeeded upon retrying. This failure was due to an unrelated issue that had occurred earlier in the day where the datastore for our configuration service was polluted in a way that required manual intervention. The bad data in the configuration service caused the rollback in this one datacenter to fail. A manual removal of the incorrect data allowed the full rollback to complete at 00:48 UTC thereby restoring full access to services. We understand how the corrupt data was deployed and continue to investigate why the specific data caused the subsequent deployments to fail.

We are working on various measures to ensure safety of this kind of configuration change, faster detection of the problem via better monitoring of the related subsystems, and improvements to the robustness of our underlying configuration system including prevention and automatic cleanup of polluted records such that we can automatically recover from this kind of data issue in the future.

Mar 12, 01:00 UTC

Update - We believe we've resolved the root cause and are waiting for services to recover
Mar 12, 01:00 UTC

Update - API Requests is operating normally.
Mar 12, 00:56 UTC

Update - Git Operations is operating normally.
Mar 12, 00:55 UTC

Update - Webhooks is operating normally.
Mar 12, 00:54 UTC

Update - Copilot is operating normally.
Mar 12, 00:54 UTC

Update - We're continuing to investigate issues with our authentication service, impacting multiple services
Mar 12, 00:14 UTC

Update - Webhooks is experiencing degraded performance. We are continuing to investigate.
Mar 11, 23:55 UTC

Update - Webhooks is operating normally.
Mar 11, 23:31 UTC

Update - Copilot is experiencing degraded performance. We are continuing to investigate.
Mar 11, 23:21 UTC

Update - Git Operations is experiencing degraded performance. We are continuing to investigate.
Mar 11, 23:20 UTC

Update - Webhooks is experiencing degraded performance. We are continuing to investigate.
Mar 11, 23:09 UTC

Investigating - We are investigating reports of degraded availability for API Requests, Git Operations and Webhooks
Mar 11, 23:01 UTC

Mar 11, 2024

Incident with Actions

Resolved - On March 11, 2024 between at 18:44 UTC and 19:10 UTC, GitHub Actions performance was degraded and some users experienced errors when trying to queue workflows. Approximately 3.7% of runs queued during this time were unable to start.

The issue was partially caused by a deployment of an internal system Actions relies on to process workflow run events. The pausing of the queue processing during this deployment for about 3 minutes caused a spike in queued workflow runs. When this queue began to be processed, the high number of queued workflows overwhelmed a secret-initialization component of the workflow invocation system. The errors generated by this overwhelmed system ultimately delayed workflow invocation. Through our alerting system, we received initial indications of an issue at approximately 18:44 UTC. However, we did not initially see impact on our run start delays and run queuing availability metrics until approximately 18:52 UTC. As the large queue of workflow run events burned down, we saw recovery in our key customer impact measures by 19:11 UTC, but waited to declare the incident resolved at 19:22 UTC while verifying there was no further customer impact.

We are working on various measures to reduce spikes in queue build up during deployments of our queueing system, and have scaled up the workers which handle secret generation and storage during the workflow invocation process.

Mar 11, 19:22 UTC

Update - Actions experienced a period of decreased workflow run throughput, and we are seeing recovery now. We are in the process of investigating the cause.
Mar 11, 19:21 UTC

Investigating - We are investigating reports of degraded performance for Actions
Mar 11, 19:02 UTC

Incident with Copilot

Resolved - This incident has been resolved.
Mar 11, 10:20 UTC

Update - We are deploying mitigations for the failures we have been observing in some chat requests for Copilot. We will continue to monitor and update.
Mar 11, 10:02 UTC

Update - We are seeing an elevated failure rate for chat requests for Copilot. We are investigating and will continue to keep users updated on progress towards mitigation.
Mar 11, 09:03 UTC

Investigating - We are investigating reports of degraded performance for Copilot
Mar 11, 08:14 UTC

Mar 10, 2024

No incidents reported.

Mar 9, 2024

No incidents reported.

Mar 8, 2024

No incidents reported.

Mar 7, 2024

No incidents reported.

Mar 6, 2024

No incidents reported.

Mar 5, 2024

No incidents reported.

Mar 4, 2024

No incidents reported.

Mar 3, 2024

No incidents reported.

Mar 2, 2024

No incidents reported.

Mar 1, 2024

Incident with API Requests, Copilot, Git Operations, Actions and Pages

Resolved - On March 1, 2024, between 17:00 UTC and 17:42 UTC, we saw elevated failure rates (from 1 to 10%) for Copilot, Actions, Pages, and Git for various APIs.

This incident was triggered by a newly-discovered failure mode of a deployment pipeline to one of our compute clusters when it could not write a specific configuration file. This caused a drop in the amount of resources available in this cluster, which was mitigated by a redeployment.

We have addressed the specific scenario to ensure resources are properly written and retrieved and added safeguards to ensure the deployment does not proceed if there is an issue of this type. We are also reviewing our systems to more effectively route traffic toward healthy clusters during an outage and adding more safeguards on cluster resource adjustments.

Mar 1, 17:42 UTC

Update - Git Operations is operating normally.
Mar 1, 17:42 UTC

Update - Actions and Pages are operating normally.
Mar 1, 17:41 UTC

Update - Copilot is operating normally.
Mar 1, 17:36 UTC

Update - Pages is experiencing degraded performance. We are continuing to investigate.
Mar 1, 17:34 UTC

Update - One of our clusters is experiencing problems, and we are working on restoring the cluster at this time.
Mar 1, 17:34 UTC

Investigating - We are investigating reports of degraded performance for API Requests, Copilot, Git Operations and Actions
Mar 1, 17:30 UTC

Incident with Pull Requests, Actions and Issues

Resolved - On March 1, 2024, between 14:17 UTC and 15:54 UTC the service that sends messages from our event stream into our background job processing service was degraded and delayed the transmission of jobs for processing. No data or jobs were lost. From 14:17 to 14:41 UTC, there was a partial degradation, where customers would experience intermittent delays with PRs and Actions. From 14:41 to 15:24 UTC, 36% of PRs users saw stale data, and 100% of in progress Actions workflows did not see updates , even though the workflows were succeeding. At 15:24 UTC, we mitigated the incident by redeploying our service and jobs began to burn down, with full job catchup by 15:54 UTC. This was due to under provisioned memory and lack of memory based back pressure in the service, which overwhelmed consumers and led to OutOfMemory crashes.

We have adjusted memory configurations to prevent this problem, and are analyzing and adjusting our alert sensitivity to reduce our time to detection of issues like this one in the future.

Mar 1, 16:12 UTC

Update - Issues, Pull Requests and Actions are operating normally.
Mar 1, 16:12 UTC

Update - We're seeing our background job queue sizes trend down, and expect full recovery in the next 15 minutes.
Mar 1, 15:48 UTC

Update - Issues is experiencing degraded performance. We are continuing to investigate.
Mar 1, 15:39 UTC

Update - We're continuing to investigate issues with background jobs that have impacted Actions and Pull Requests. We have a mitigation in place and are monitoring for recovery.
Mar 1, 15:27 UTC

Update - We're investigating issues with background jobs that are causing sporadic delays in pull request synchronization and reduced Actions throughput.
Mar 1, 14:51 UTC

Investigating - We are investigating reports of degraded performance for Pull Requests and Actions
Mar 1, 14:39 UTC

Feb 29, 2024

Incident with Issues, Webhooks and Actions

Resolved - On February 29, 2024, between 9:32 and 11:54 UTC, queuing in our background job service caused processing delays to Webhooks, Actions, and Issues. Nearly 95% of delays occurred between 11:05 and 11:27 UTC, with 5% during the remainder of the incident. During this incident, the following customer impacts occurred: 50% of webhooks experienced delays of up to 5m, 1% of webhooks experienced delays of 17m at peak; Actions: on average, 7% of customers experienced delays, with a peak of 44%; and many Issues saw a delay in appearing in searches. At 9:32 UTC our automated failover successfully routed traffic to a secondary cluster. But an improper restoration to primary at 10:32 UTC caused a significant increase in queued jobs until 11:21 UTC, when a correction was made and healthy services began burning down the backlog until full resolution.

We have made improvements to the automation and reliability of our fallback process to prevent recurrence. We also have larger work already in progress to improve the overall reliability of our job processing platform.

Feb 29, 12:27 UTC

Update - We're seeing recovery and are going to take time to verify that all systems are back in a working state.
Feb 29, 12:21 UTC

Update - Issues is operating normally.
Feb 29, 12:19 UTC

Update - Webhooks is operating normally.
Feb 29, 12:18 UTC

Update - We're continuing to investigate delayed background jobs. We've seen partial recovery for Issues, and there is ongoing impact to actions, notifications and webhooks.
Feb 29, 11:05 UTC

Update - Actions is experiencing degraded performance. We are continuing to investigate.
Feb 29, 10:58 UTC

Update - We're seeing issues related to background jobs, which are causing delays for webhook delivery and search indexing, and other updates.
Feb 29, 10:36 UTC

Investigating - We are investigating reports of degraded performance for Issues and Webhooks
Feb 29, 10:33 UTC

Product

Platform

Support

Company