Designed to Fail

“Success is not final, failure is not fatal: it is the courage to continue that counts.” — Winston Churchill

One of the most common phrases we hear and use within the software development community is “No program survives contact with the user.” Brief research shows that this is an rephrasing of a similar statement attributed to Field Marshall Helmuth Karl Bernhard Graf von Moltke, in which he stated “No battle plan ever survives contact with the enemy.”

While we certainly prefer not to view our users as the enemy, it is often the case that releasing software, art, or other resources to broader use can reveal flaws inherent to the project itself. In software we call these “bugs”, or if we’re feeling kind perhaps “emergent features.” In art and architecture, they’re often instead called “flaws,” or “imperfections.”

In any case, and by any terminology, failure states will occur. The key is to design both for success, and for failure.

Making Exceptions the Rule

In software development, an “exception” is an unanticipated error state in which the program hit a condition that cannot be fulfilled within its constraints. There are many causes for exceptions. In an ideal case, these will truly come from unanticipated conditions. Perhaps a file we were relying on has gone missing. Perhaps a remote server is no longer responding, or the database has shut off unexpectedly. It’s even possible that, despite our best efforts, a user has found a way to get the program into a state we didn’t plan for which it cannot handle on its own.

When we’re designing a building, a car, a rocket, or a website, it’s very easy to focus on how things should look when everything is working properly. This is the state we’re most concerned about, because we want this to be the state that the most people will see and interact with. As an artist, it is very easy to get focused on all the ways you want something to work without delving further into all the ways it should behave when it doesn’t work.

However, it is naïve to operate on the assumption that exceptions will not occur, or even that they will not be reasonably common. Treat exceptions as the norm. Assume that something can go wrong in any case and condition where it is possible to do so. Write code that allows the software to degrade gracefully. Give the building redundant supports so that a structural failure of one does not bring it down.

Failure By Design

If we operate on the assumption that our project will reach a failure state, we quickly identify that there must be some way to allow the project to continue as best it can while informing the relevant parties that a failure occurred. Total shutdown should be a final resort, not the first line of defense.

Consider Amazon as an example. This is a very complicated eCommerce site, and one of the strongest digital marketplaces in the world. They operate with millions of users, some as strict consumers and others as businesses, pushing transactions of many kinds in great volume throughout the day. The extreme majority of transactions will have no issues. However, in a complex system there are multiple points of failure.

It is possible that they will get enough traffic or orders to shut down their orders database. Even though that database is down, they do their best to never prevent the user from interacting with the site as much as possible. So, even when orders are offline a user may be able to go through the store, read reviews, explore products, search, and add items to their cart or wishlist. It is only when the user reaches the failure point that they are stopped, and when that occurs the user is notified that their desired interaction cannot be completed, but that they can continue to interact with other portions of the site while the system administrators address the issue.

In architecture, this can include stress fracture planning. When you look at a sidewalk or driveway, there are lines through the cement called “contraction joints”. They aren’t structurally necessary and they aren’t particularly appealing visually. In an ideal world and ideal design, they probably wouldn’t be there, so why do those lines exist? They exist for failure, for that undesired state. Due to natural processes, the land under and around a driveway or sidewalk will shift over time, adding stress to the cement. The contraction joints are intentional, structured weaknesses designed to be a controlled point of failure. By having them, the majority of fractures will occur along and down those lines, in relatively straight lines that can be easily patched and filled.

Redefining Success

In many cases, success can be defined as “working exactly as intended.” An application is successful if the user can do everything they’re supposed to do, and if it looks and acts the way it is intended to.

Such a definition of success is insufficient, though, because it only tells half the story. Success does not exist in a vacuum, and is not an absolute state. If everything is working properly, we’re certainly doing well, but we haven’t reached success just yet.

Success needs to be a broader term. Rather than considering success to be everything working as intended, we need to redefine it. Success is not the absolute inability to fail, but the ability to continue when failure occurs. Only the simplest projects can even be presumed to exist in a state where no error is possible, and even then it is still naïve to assume none could occur. When there is no possibility of existence without failure, we must consider success to mean the ability to fail in a controlled fashion and get back into a functional state to continue operating.