Azure DevOps Outage Proves That Typing Is Still Hard

Posted on Saturday, Jun 10, 2023 by Ned Bellavance

Featured in this episode of Chaos Lever

On May 24th, Azure DevOps had a 10 hour outage in the South Brazil region due to a typo that caused the system to inadvertently delete 17 production databases. Ouch.

No data was lost, since Microsoft apparently does keep backups, but it took them almost 10 hours to restore full service due to a series of compounding issues.

The initial problem was in a pull request switching their code base from using a deprecated Azure management package library to the new Resource Manager NuGet library. The process was supposed to delete old database snapshots during deployment, but instead a typo caused the deletion of Azure SQL Servers backing the service and all databases on them.

That is, in parlance, bad. Very bad.

The issue was detected within 20 minutes of the change being merged, but they encountered three issues in restoration, including having to work with the Azure SQL team to recover servers, dealing with unexpected replication of geo-redundant databases, and a slow warm up process for the web-front end.

Microsoft has put in mitigations to prevent this issue from happening in the future, which simply means things will break in new and exciting ways. Yay technology!