October 10, 2024
Best Practices to Mitigate Software Outages
The infamous CrowdStrike outage from July of this year may be in the rearview mirror, but the reflection of the billion-dollar company is still an unsavory one: several companies are following Delta Airlines’ lead in suing the cybersecurity giant.
In case you missed it, CrowdStrike released a faulty software update to its customers, which in addition to Delta Airlines include such behemoths as Intel and Amazon. It seems no corner of the world was left untouched by the error.
Software outages aren’t new, and as we explained in an earlier blog, 100% bug-free software is unattainable – although few will garner the worldwide attention or widespread consequences of the CrowdStrike release. Still, many premature or buggy software releases negatively impact everyday users around the globe.
Don’t Let Urgency Supersede Best Practices
According to Flint Hills Group CEO, Dave Cunningham, many times the culprit of a faulty software update release is something many businesspeople feel every day when trying to release a new or updated product to the public: a sense of urgency.
“Companies often blast out a release to everyone too quickly,” Cunningham said. “That sense of urgency is understandable, but when something goes wrong, like with the CrowdStrike issue, it’s devastating for users and in their case, huge businesses.”
Cunningham detailed several strategies software developers can initiate to mitigate the risk of releasing a faulty update that creates an outage.
- Staged Rollout. One effective measure against releasing a faulty update, particularly when there are large user groups such as with CrowdStrike, is to release the update to beta users. Next, move forward with a larger group of users before pushing the update out to everyone.
- Regression Testing. This software testing ensures that recent code changes have not adversely affected existing functionality. It requires time and patience, as it involves re-running previously conducted tests on the modified software to check that the new code has not introduced any bugs or broken any existing features – but it is time well-spent compared to the potential consequences, and costs, of a future outage. It also increases confidence in your updates among your users and helps establish a reputation of consistency for you and your software.
- Code Freezes. A code freeze is when no one touches the code for a certain time period before a release. This means no new code changes implemented, as the developers’ focus shifts from development to quality assurance with testing, debugging, and finalizing the existing codebase. It also helps stabilize the software by preventing new changes that could introduce bugs right before a release, and helps developers still meet deadlines but with careful deliberation.
- Allocate Resources for Audits and Fixes. Plan to have people available to jump in and fix any discovered bugs, or to pull the update back after its release. Having this manpower in place prior to a release helps ensure the development team can quickly deploy and make corrections with minimal delays and disruptions. Some developers also engage third-party or independent auditors to review code and updates prior to deployment, providing an additional layer of scrutiny with a fresh set of eyes.
Communication and Documentation Should Be a Constant
A few months ago we wrote about problems software developers encounter when taking over a project from another developer, and key among those problems were communication and documentation issues. The CrowdStrike outage again provided examples of why communication is necessary – both from the software company and in this case, its customers, as well – when an outage occurs, but such communication becomes instinctive if that culture is created prior to such an unfortunate event. This communication should be ongoing during an outage. Conveying that the problem is being worked even if no solution has been found is still better than no communicion at all.
Having detailed documentation goes hand in hand to help manage expectations and reduce panic in the event of an issue. This could be achieved by effective developer comments in the code at a minimum. It’s also a best practice in version control, so that updates can be tracked and managed effectively – and allows for a robust rollback mechanism so that an immediate reversion to a previous, stable version can take place once an issue is identified. This simple but effective measure can prevent a more serious escalation of a flawed update.
Preparation is Key When Practice Doesn’t Always Make Perfect
At Flint Hills Group, we elevate development best practices not just during the initial software development phase, but with each update. It doesn’t matter if we’ve written the same or similar code dozens of times – we don’t rely solely on repetition, we rely on preparation, using proven strategies to significantly reduce the likelihood of releasing a faulty update and mitigate the impact if such an issue occurs. If you want a partner with decades of experience that values rigorous testing and contingency planning in each phase of software development, give us a call or drop us a line.
Karen S Johnson
Technology Enthusiast
Karen S. Johnson is a freelance writer, public relations consultant and technology enthusiast who traded farm life in North Dakota for a smaller-scale farm outside of Waco, Texas. When not writing articles and crafting messaging strategies for technology clients, Karen can usually be found jumping her horses around her 20-acre farm or watching the spectacular sunsets with her husband, dogs and cats.
Karen S. Johnson
Technology Enthusiast
Karen S. Johnson is a freelance writer, public relations consultant and technology enthusiast who traded farm life in North Dakota for a smaller-scale farm outside of Waco, Texas. When not writing articles and crafting messaging strategies for technology clients, Karen can usually be found jumping her horses around her 20-acre farm or watching the spectacular sunsets with her husband, dogs and cats.