Error detection is not a halting problem,(1) which is to say that we can halt before crashing.(2) The objective of error detection is to determine operating regimes where
- Continued operation presents
- a risk to the operator
- a risk to the those in the environment
- a risk to the device
When those conditions are detected the device should provide feedback to the user and/or migrate into a different operating mode.
How do you know when faults are possible?
There are three primary ways in which errors can be predicted:
- Known “dangerous” operating conditions: (fever condition)
In some cases the fault condition is known; when your temperature goes above X you have a fever and should get treatment. In the same fashion for most devices, there are known conditions where continued operations are a risk.
- Interpolated error: (crystal ball)
In some cases, the current conditions with forward prediction can predict that the device will at some near point in time enter into a known dangerous condition. In this case a combination of how far off is the condition and how long have you been trending towards the condition should inform when the alert is raised.
- Regression/ML/AI error: (history book)
A statistical approach to predictive maintenance and error detection can be taken e.g., when conditions that historically have led to a fault are detected, then the alert can be raised. This differs from the crystal ball in that the root cause may not be understood.
How to respond
How you respond should depend on the fault severity and the timeline of the fault. The higher the severity of the fault the higher the response; the closer the fault is to “activating,” the faster the response.
The lowest level of alert is the diagnostic message. For a low impact, delayed incident, this is acceptable. However, the end user may ignore this so…
For a mid-level response, the device may reduce the performance. For example, adaptive cruise control could apply braking if the vehicle in front is too close.
At the final end of things (high alert), the device should deploy / respond / act in such a way to protect the users and the people in the environment.
Better late than never; better early than late
As the section above shows, as the response increases the recovery is more significant. By the time you hit the high alert, repairing the device is a greater expense than a simple “check engine” light.”(4) Use the three prediction methods and you(5) can prevent faults.
- I was surprised that there are no good engineering cartoons on the halting problem; perhaps they haven’t been finished yet.
- For many engineers with error detection and fault handling, they treat the perfect as the enemy of the good. It is always better to error out gracefully than to continue operation at risk.
- I know this is a fault screen for the small backhoe of a BobCat, but I like to imagine this is a nature documentary following around a wildcat (i.e., some of you may remember the early ’90’s commercials featuring a live bobcat).
- The migration from Adaptive cruise control to air bags is clear; if you brake, you don’t crash.
- In this case the “you” is you and your whole engineering team.