31 March 2010

Error Handling

Error handling is a much debated topic.
A large portion of the discussion centers on the technical aspect of error handling: when to throw exceptions and when to return error status, how to incorporate error reporting facilities, etc. but the subject is much wider.
In this post I’d like to explore the topic at a higher level and focus on the validation code, the code that generates the errors: Where to write validation code, and how much validation code to write so that the errors can actually be handled.

There is no fail safe program
The naive approach to validation is to regard each method (function, procedure, subroutine, etc.) as a separate entity. As such each method needs to, on the one hand validate all inputs, i.e. accept only valid inputs and reject all others with an error, and on the other hand handle all possible errors generated by called methods.
The naive approach tries to achieve fail safety by attempting to cover every eventuality. This is not only unrealistic, but also impossible:
  1. It is not possible to validate all input.
  2. Let’s look at a simple example: binary search. For binary search to work it requires the random access list which is one of its inputs to be sorted. Can we verify the input to binary search is sorted? Not in O(log(n)) time complexity.
  3. It is not possible to handle all errors.
  4. There are whole sets of error types programmers choose to ignore: hardware failures, memory corruption, bugs in driver software, etc. Most programs won’t even try to cope with memory allocation errors (out of memory exceptions.) Some of these failures are handled by the operating system others require special hardware. Unless the program is critical, chances are it is perfectly acceptable for the program to crash under these circumstances.
  5. Validation code is not bug free.
Bugs are statistical beings the more validation code written the higher the chance a bug during validation will cause the program to fail.

Managing failures with contracts
A powerful approach for designing reliable programs is called Program by Contract.
Using this approach each software unit (method/class/library) in the program is designed using a Contract.

The contract contains three parts:
  1. A set of valid entry states usually stated as pre-conditions that must be met before calling the unit, and the set of inputs recognized by the unit.
  2. The action performed by the unit for each valid entry state.
  3. The state of the unit (or post-condition) after the call completes.
The important thing to remember about contracts is that they not only specify how the unit succeeds, but also how it fails.
Let’s take the file IO operations for instance. Attempting to open a file without sufficient permission fails, but fails in an expected way (throwing a security exception or returning an appropriate error code.)
What happens if the function is not called from a valid entry state, i.e. pre-conditions are not met or the input is not recognized?
In this case the result is undefined, as in anything can happen.

To illustrate this let’s look at the binary search function once more. The contract for binary search specifies that given a value and a random access sorted list, the search can find (in O(log(n)) time complexity) the index of the value in the list if it exists or the insertion point for the value so the list remains sorted otherwise.
If the list is not sorted then binary search returns some index in the list as it is not possible to define what index is returned without knowing the exact values of the list.

Using contracts allows us to write less code.
If we can ensure an entry state that causes an error is never used then the error will never occur and we don’t need to write error handling code.
Furthermore if we know an input is validated by a method in the façade, we don’t need to write validation code for the same input in an internal method; and approach called “perimeter defense”.
How does having less validation and error handling code make for more reliable programs? It may sound counter-intuitive but reliability is not obtained through more validation.
Contracts define how a software unit behaves and not having to write unnecessary code leaves time to concentrate on the two activities that do increase reliability:
  1. Designing the program so that preconditions are always met so only expected errors can happen and are handled.
  2. Writing unit tests to verify the software units comply with the contract.
There are other ways in which having less unnecessary code increases reliability. The code is more readable and as mentioned validation and error handling code is not bug free. The worst case scenario is for a program to fail due to a bug in error handling code for an error that doesn't happen.

Validation code revisited
As mentioned contracts allows removal of unneeded validation and error handling code. While we definitely do not want to write error handling code for errors that should never happen, we may still want to leave more validation code than strictly necessary in place.
The first and most important reason is to make the code more reusable. Errors that can never occur in one program may be inevitable in another. Having validation code in place means the contract is less onerous on the side of the caller and we can use the code in more situations.
The second reason is to prevent unacceptable harm from unforeseen errors, such as those caused by bugs. Database foreign key constraints are an example of this type of validation. Programs that use a database do not really need foreign keys. Business logic in the middle tier is usually coded with the relations in mind. In fact inserting and removing data is many times less efficient than it could be because of them.
However we still use them to ensure data integrity remains even if there is a bug in the business logic.
The benefits of extra validation must always be weighted against the impact on performance, reliability and readability of the overall program.