CrowdStrike crisis a teachable moment for IT departments

You can’t prevent every crisis, but you can influence how to react when things go pear-shaped, says Caragh O'Carroll

Pro

7 August 2024

Your scenario planning or business strategy didn’t predict this. Something has gone wrong with your IT. Your company is in the media for all the wrong reasons and the share price starts tanking. What can you do?

You can’t prevent every crisis unless you’ve infinite resources, foresight and a crystal ball but you can influence the reaction should your day go pear-shaped.

Crisis management combines proactive processes to identify potential threats, how to mitigate them if they arise and the reactive actions taken when an incident occurs. On 19 July, security firm CrowdStrike’s brand recognition took off thanks to a misfire that disabled some 8.5 million Microsoft systems across multiple industries around the world. So, how did it perform in response?

Sarah Armstrong-Smith, Microsoft chief security advisor and author of Effective Crisis Management gives an A-Z of the factors at play when directing successful crisis management: action, communication, empathy, facts, honesty, investigation, lessons, opportunity, and resilience.

Paul Reid, ex-HSE Chief Executive advises these tenets when faced with a crisis: act fast, empower people, accept that mistakes will be made, and get out of the way.

What went wrong for CrowdStrike

On 19 July, CrowdStrike did what it does multiple times a day: it released an updated rapid response configuration for their endpoint detection and response solution. Such Falcon sensor ‘channel files’ updates are a normal process to push out responses to new cyber threats, techniques and procedures discovered by CrowdStrike. But one change was implemented incorrectly and a bug in the ‘content validator template configuration’ meant it tried to read beyond the memory bounds and this triggered a Windows operating system crash affecting virtual machines and PCs.

How did CrowdStrike do? CEO George Kurtz quickly released a statement to apologise, advise what happened, the scope, the remediation underway and the support available. Further updates came, refining the scope and the recovery recommendations. There was little empathy for the thousands of businesses impacted and it took several updates for CrowdStrike to demonstrate an understanding of the emotiveness of the issue, the cancelled flights, the disrupted holidays, people stranded, and businesses unable to serve customers.

Even though it wasn’t a cyber incident, the key focus for everyone was how to get people and systems back up and running again. The issue was so great that a meeting of the UK’s Cabinet Office Briefing Room A (COBRA) meeting was called to discuss the fragility and dependency of the digital supply chain. Microsoft had 5,000 engineers working on solutions, creating recovery tools, even preparing USB keys to help people boot up safely.

CrowdStrike did, however, make one terrible error by offering a $10 Uber Eats voucher as a token of apology to IT staff who had to clean up the mess. It was seen as an insult. It also backfired when Uber Eats flagged attempts to use the vouchers as fraud because of the high volumes.

Commercial impact

Vouchers do not compensate businesses for the revenue hit. Parametrix estimates that the outage may have cost Fortune 500 companies as much as $5.4 billion in revenue and gross profit through lost working hours and inability to service customers. Recovery took days in some cases and could still take weeks for some organisations.

CrowdStrike has continued providing updates. One part of the technical issue (Channel File 291) was fixed within 78 minutes; the other issue with Content Validator is being tested ahead of deployment.

CrowdStrike’s preliminary post incident review identified extra levels of testing needed for rapid response content, additional validation checks for the content validator, additional error handling, as well. They are also introducing a staggered (phased) deployment strategy, starting with a canary deployment, and providing customers with greater information and control over updates. A full root cause analysis will be published when complete.

In my own experience, having established processes (e.g. major incident managers, crisis management teams, lines of communications) goes a long way to managing updates internally, for understanding impact and for supporting customers. Business continuity planning helps preparation, but mental agility is key – what you plan for isn’t what will happen.

CrowdStrike did well on some best practices: acting fast, taking action, communicating updates, sharing the facts, being honest, investigating what happened. They saw the opportunity to improve their resilience and change their approach to update rollout. They did’t seem to understand why their lunch vouchers were a mistake, however. Whatever their crisis management plan was, some bits worked but plans are only a foundation when a crisis hits.

Both Armstrong-Smith and Reid advise that you need not rely on a single decision maker but empower more leaders to have the confidence to make decisions quickly and effectively. That empowerment comes from being diligent in planning phases. CrowdStrike has learnt a very expensive lesson in planning. They may still need to do more in developing emotional intelligence – to empathise with customers, understand the impact, give real support, not tokenism.

CrowdStrike crisis a teachable moment for IT departments

What went wrong for CrowdStrike

Commercial impact

Sign up for the Technology Minute

Support our advertisers

Listen to Tech Radio

Most Popular