Crowdstrike Internet Outage: development techniques that could have prevented it.
Internet of Bugs 24:15
40,857 views · 2,407 likes Watch on YouTube ↗
In my last video. early last week, we talked about how a study showed that 90% of all catastrophic software failures were caused by poor error handling code.
This week let's see if we might be able to think about some event that we might be able to use as an example of poor error handling, and talk about the kinds of endemic problems in our industry that allow such things to happen, and talk about a number of strategies (Seventeen of them, to be specific) that should have been used to prevent this kind of nightmare from happening, but obviously weren't.
00:00 Clown (Crowd?) Strike fail
00:45 After Channel Intro
01:22 What we know happened
03:28 Business reasons this can happen
06:28 Technical reasons this can happen
07:12 Who this video is for
07:56 Error Handling techniques
08:23 "Works on MY machine!"
09:11 Unit Tests are USELESS
10:01 Web programmers can learn from this, too
11:09 How do you test for this kind of thing?
12:12 What if those tests don't catch it?
12:59 Phased/Slow Roll-out
13:33 Centralized software log collection
14:48 Sanity check files before you execute them
15:17 Pay attention to weird behavior, don't just "see if it happens again"
15:58 Separate high-risk code like parsing new drivers files into a different process
16:18 Minimize the code that runs in kernel space, at boot, or with elevated permissions
18:15 Why didn't the update get rolled back when the boot failed?
19:54 Recap
22:26 These skills can't be replaced by A.I.
Links:
Wikipedia Article on outage (in case you're watching this in the future and don't know which outage we're talking about):
https://en.wikipedia.org/wiki/2024_CrowdStrike_incident
Open Source group that knows how to rollback a bad update - unlike Microsoft - and how they do it:
https://slimbootloader.github.io/security/firmware-resiliency-and-recovery.html
My previous video on Why businesses let this happen:
https://www.youtube.com/watch?v=hKqqU1J-WXk
My video from Early last week about how "Software Design" should, but doesn't, include Error Handling:
https://www.youtube.com/watch?v=4xqkI953K6Y
Paper (and presentation) on how 90% of al catastrophic software failures are caused by poor error handling code:
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/yuan
Thumbnail Images from:
By Smishra1 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=150535443
and/or
https://flickr.com/photos/84539227@N00/53867936421
This week let's see if we might be able to think about some event that we might be able to use as an example of poor error handling, and talk about the kinds of endemic problems in our industry that allow such things to happen, and talk about a number of strategies (Seventeen of them, to be specific) that should have been used to prevent this kind of nightmare from happening, but obviously weren't.
00:00 Clown (Crowd?) Strike fail
00:45 After Channel Intro
01:22 What we know happened
03:28 Business reasons this can happen
06:28 Technical reasons this can happen
07:12 Who this video is for
07:56 Error Handling techniques
08:23 "Works on MY machine!"
09:11 Unit Tests are USELESS
10:01 Web programmers can learn from this, too
11:09 How do you test for this kind of thing?
12:12 What if those tests don't catch it?
12:59 Phased/Slow Roll-out
13:33 Centralized software log collection
14:48 Sanity check files before you execute them
15:17 Pay attention to weird behavior, don't just "see if it happens again"
15:58 Separate high-risk code like parsing new drivers files into a different process
16:18 Minimize the code that runs in kernel space, at boot, or with elevated permissions
18:15 Why didn't the update get rolled back when the boot failed?
19:54 Recap
22:26 These skills can't be replaced by A.I.
Links:
Wikipedia Article on outage (in case you're watching this in the future and don't know which outage we're talking about):
https://en.wikipedia.org/wiki/2024_CrowdStrike_incident
Open Source group that knows how to rollback a bad update - unlike Microsoft - and how they do it:
https://slimbootloader.github.io/security/firmware-resiliency-and-recovery.html
My previous video on Why businesses let this happen:
https://www.youtube.com/watch?v=hKqqU1J-WXk
My video from Early last week about how "Software Design" should, but doesn't, include Error Handling:
https://www.youtube.com/watch?v=4xqkI953K6Y
Paper (and presentation) on how 90% of al catastrophic software failures are caused by poor error handling code:
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/yuan
Thumbnail Images from:
By Smishra1 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=150535443
and/or
https://flickr.com/photos/84539227@N00/53867936421
Playback is via YouTube's official embedded player. Data from YouTube; Exumo is not affiliated with YouTube.