Bubble wrap syndrome (BWS): A safety critical issue

We all know what it is:

Bubble wrap syndrome (BWS): The overwhelming desire to “pop” cells
in
 a sheet of bubble wrap.  See also Packing Peanut Popping (PPP).

Even if it isn’t a real psychiatric diagnosis it makes for a good, slightly humorous, starting point of this post.  Bubble wrap is designed to protect items in shipping.  When we “pop” bubbles we decrease the effectiveness of the wrap.  In the software domain, this corresponds to the slow creep of “features” into the diagnostic code.

The road to heck…

By Image result for the road to helldesign diagnostic code is isolated from the functional code.  Because of this diagnostic code will, sometimes, duplicate calculations that exist in the functional code.  It can be tempting to interleave, reuse, those calculations for both the functional and diagnostic code.

There are three problems with this

  1. Unit test interference: by mixing the functional and diagnostic code you prevent unit testing of the components.
  2. Development independence / common failure mode: Mixing the two components together you run the risk of introducing a common failure mode in the analysis of the data.
  3. Link to a common processor: for some safety critical systems diagnostic code runs on a separate processor or core.

 

Where are the boundaries?

In this Image result for boundary watersexample, the “Proposed Mines” are the possible infestation of our pristine “boundary waters” diagnostic code.  The question is “how close is to close?”  The following rules of thumb can be applied

  1. The functional code should not depend on any output from the diagnostic beyond the diagnostic flags.
  2. The diagnostic code should not make calls to functional code
  3. The diagnostic code should not perform filtering, table lookups or integration operations
  4. The diagnostic code should be non-interuptable.

 

Many tools, many formats…

Welcome to my latest blog, I have tried for a “dynamic” whiteboard experience today; I apologize in advance for the occasional motion blur.

The focus of this video is how to exchange data between multiple tools, with a recommended approach of settling on a common data format that all tools can read and write into.  A list of common data formats can be found here:

Debugging in Simulink

First a definition:

software bug is an error, flaw, failure or fault
in a computer program or system that causes it to
produce an incorrect or unexpected result,
or to behave in unintended ways.

This is in contrast to incomplete development where the program is not yet performing the intended function.

There are three types of bugs:

  1. Changed induced: these are bugs that arise when part or all of the program is changed.  These can often be resolved by doing comparisons against the earlier version of the program.
  2. Corner case bugs: these bugs are due to missed behavioral cases; for instance not accounting for overflow in a button counter.
  3. Incorrect library/function usage: these bugs arise from the use of library functions incorrectly; for instance passing a double to an integer function.

DIF:  Detect, Isolate, Fix

In debugging the first step is to trace the issue to its root; in Simulink, this is normally a subsystem, in Stateflow a set of state transitions; in either case, the issue could be due to changes in parameterization so…

  1. Review/compare parameter data: inspect the parameter data that specifies the behavior of the system.  Try reverting to earlier versions of the data.
  2. Introduce data logging: the simplest level of debugging is the introduction of intermediary data logging points.   If this is a change induced bug this is often enough to determine the problem.
  3. Simulate by unit: where possible decompose the full model into components and simulate them in isolation to determine the subsystem behavior.
  4. Introduce breakpoints: both Simulink and Stateflow allow for the introduction of breakpoints.  Conditional breakpoints, where the simulation halts for a given configuration, add additional debugging power.
  5. Use formal methods: use of formal method tools such as Simulink Design Verifier to detect dead logic and overflows/underflows can automatically determine the location of some bugs.
  6. Second eyes: Bring another person in to talk about your model, what you expect and what it is doing.

Common “bugs” and simple fixes

The following are common development bugs

  1. Miss aligned ports:  verify that the inputs to a Model or Library correctly map with the calling subsystem.  This issue arises when the referenced model/library is changed.
  2. Never reached: dead code due to logic that is never activated.  This is found using the coverage report or through SLDV.
  3. NAN: nan, or not a number, occurs when you have a divide by zero operation.  To detect this set the Simulink diagnostic to detect this condition.
  4. Interpolation data and tables: by default blocks in Simulink will interpolate outside of the specified range.  This can cause problems if
    1. The data is of integer type and the result is a float
    2. The data is not valid outside of the specified range
  5. Saturation/limiters: frequently people put in limit blocks into systems during development.  These blocks can “prevent” issues but also introduce errors.  Inspect the data going into and out of limit blocks (and limits on integrators.)
  6. Synchronization: in some instances, the behavior depends on synchronized data; if the signals are out of alignment due to either introduction of unit delays or sample rate of the system.  Look for cases where transitions are dependent on the status of two highly transient variables at the same time.

Feedback:

I would love to hear about your common bugs and debugging techniques.

How deep is your data?

I recently had a conversation with a client about how they instantiated constants for their model.  Thier approach was to group common parameters together into a structure.  What I pictured was something like this

smallStruct

In this instance, we have a small structure, one layer deep with 4 elements.  This would be enough information to perform the calculations required to transform a throttle sensor voltage into a throttle position.

However, what they showed me was something quite different.  In their instance, they had a hierarchical structure that was, in some places, 7 layers deep.   The single structure contained not only all the parameters required for a single component but for multiple models.

The deep…

ThisImage result for the deep isn’t the first time I have seen a structure like this, in general, they grow organically.  As multiple people work on a project they want an “easy” way to share data and, at first, when it is small, the method works well.  However, as the structure grows in size several problems start to emerge.

  1. Where is my data: the first problem with large data structure is finding the data.  Even the most logically organized structure
  2. Waste of space: deep structures inevitably end up with unused data.
  3. Repeated data: Going along with the “where is y data” is the “repeated data.  People will add the same data to multiple locations.
  4. Customization: with a large data structure you have to configure the whole data structure as one.

Finding balance

The Image result for balanceargument against flattening structures was “If we break them up we will have thousands of parameters”.  While that was factually correct it missed the fact that they already had 1000’s of parameters, just in a deep structure.  The advantage of the flat format are

  1. Ability to easy find parameters
  2. Ability to customize on a parameter by parameter basis
  3. Only used parameters are in the generated code

There are some disadvantages, related to how the parameters are stored in files; a single structure can be stored easily in a single structure.  With multiple parameters, a storage format needs to be determined.  Standard approaches include

  1. Use of MATLAB files
  2. Use of .mat files
  3. Use of Simulink Data Dictionary
  4. Use of an external data base

Any of these approaches can be used to organize the data.

 

 

Soccer or Football? Multiple names for the same thing…

For those of you reading in the distant future, e.g. more than a month from now, let me set the stage.  It is July 2018 and (as an American) World Cup Soccer is in full swing.  Now if I was anyplace else in the world it would just be “The World Cup”.  However, with either name you know what I am talking about; this is what is called a “one-to-one” mapping.

In natural, e.g. spoken languages, these “one-to-one” mappings (or near mappings) are common and can often be understood through the context of the conversation.  However, in software these mappings are problematic.

A rose by any other name… may still have thorns

Multiple51C01-RUL6L._SX355_ names come into existence for multiple reasons.

  1. Multiple developers: This is the most common reason, multiple developers working in separate models/files.  They each “need” to use a variable so they create a name for the variable.
  2. Units: For physical qualities, such as vehicle speed, it is common to see, VehSpeedMPH and VehSpeedKPH.  While this may seem like a good idea, e.g. the units are known, this is perhaps the most problematic duplicate as you will often see the two instance out of sync with each other.
  3. Reusable functions: In this instance, the same function is used in multiple instances. In this instance, the key observation is to have meaningful generic names in the function.
  4. Common usage: For a subset of cases the reuse of names should be encouraged.  This is the scoped data with common usage.  For example “cnt” can be used for a counter in multiple models/functions.

The reconciliation project

First,Image result for reconciliation determine if it is worth doing.  While consistency is important, code that is fully developed and validated may not warrant the work required to update the code.  Once the decision has been made to update the code/model base the following steps should be taken.

  1. Identification:  find where the same data concept is used with different names.
  2. Select a common name: Use of a descriptive name is important in these cases to smooth the transition process.
  3. Check for “unit” issues:  If the different names existed due to desired units, validate that the change to the common units is accounted for downstream.
  4. Update documentation/test cases: (You do have those right?). Documentation and test cases will often, reference the “old” names.  Update the documentation to reflect the “new” name.
  5. Validate the units: After the updates have been performed the full regression test suite should be run to validate the behavior of the models.

 

 

Is it a guideline if you can’t enforce it?

As the past coordinator for the MAAB Style Guidelines, I have spent a fruitful number of hours thinking about guidelines for the Model-Based Design and Safety Critical environments. In a recent discussion, I was challenged by the question “Why have a guideline that you cannot check”?

Now any guideline can be validated through a manual review process.  In this instance, the query was specifically asking about automatic validation.  Further, they were working in an environment where the majority of users were new to the Model-Based Design environment.  So here is my, evolved, answer.

Why can’t it be enforced?

Some guidelines cannot be enforced because they depend on human judgment, things like “meaningful names” or “readable diagrams” are, by their very nature subjective.  (Though I have an idea for a neural network solution for the meaningful names issue).  Since that can’t be enforced is it worth throwing out?  Generally no; it becomes a “best practice” and perhaps a subset of the rule could be enforced (e.g. limit the number of blocks per level of the model, minimum name lengths…)

Image result for enforcer

What do you do with the non-enforceable?

As a general best practice when guidelines are rolled out there should be an education seminar to explain the guidelines and their purpose.  Special emphasis needs to be placed on those guidelines that cannot be automatically enforced.  Explain the

  • Rationale: why the guideline benefits the end user
  • Effort: how hard will it be for the user to follow
  • Benefit: what does the end user and the team get out of following the guideline

Final thoughts

In the end, these guidelines should be thought of as a recommendation.  Some, but not all will be caught during reviews with co-workers and by test engineers.  That they will not always be followed should be expected but if you never provide the guidance they never can be followed.  If you keep them to a minimum, say no more then 6 to 10, these guidelines that are highly impactful, well,  eventually people will follow them without thinking.

Image result for recommendations

When is the glue too thick?

The term “glue code” is a colloquial term for “a thin layer of software that connects software”.  In general, it is used to connect to software components that were not designed to interact with each other.   This is a common problem and, when sensibly done, is a fine approach.  However, it is possible to develop “super glue” solutions which are, in the end, fragile and difficult to maintain.

“Standard glue”

TheImage result for standard glue following are examples of standard “glue code” functions.   Correctly implemented they are a thin layer between modules

  • Data format translations: repackaging data between two different formats, such as an XML to CSV translation function.
  • Communication port: adding a data transmission function between two pieces of code.
  • Registry/read functions: these are functions that map data from one source (such as requirements) onto an object (such as a model)
  • Error catching: these functions, generally, work in concert with other glue code ensuring that the data exchanged between the modules is correctly formatted.

Crazy glue

How Image result for crazy glue helmetdo I tell if I have crazy glue?  There are 4 basic warning signs

  1. Use of multiple languages:  connecting two software components written in different languages is a common task.  However, if you find your glue code uses more than one language chances are you doing something to convoluted.
  2. Use of files for the interface: ideally glue code is written by leveraging existing APIs in the software components.  If the function interface is through a file and a “poling function” the data exchange will be difficult to maintain.
  3. The growth of “special cases”:  when the number of special cases the glue code has to handle gets above 10 to 15 chances are the data exchange format is not well defined.
  4. Size: there is no hard and fast rule for how large glue code can be, however at some point in time it stops being glue and becomes its own software component.

What to do with crazy glue?

In the ideal world, the glue code would be replaced with improved APIs in the source software components.  Often this is not possible due to ownership issues.  When this is the case basic software best practice come into play

  1. Break the glue into components
  2. Look for ways for one component to encapsulate the other
  3. Refactor the code to remove special cases
  4. Consider using different software components

Image result for duct tape

 

 

 

Hierachy of reuse

I have written about reuse in a number of post before; today I wanted to talk about a conceptual framework of reusability.  In this video I talk about four types of reusable “objects”

  • Concepts: These have a high reusability with a corisponding high initial development cost.  Examples include:
    • Model Architectures, testing infrustructure, modeling guidelines
  • Fundemental objects: These also have a high reusability, but have a realitivly low initial cost.  Examples include:
    • Transfer function implimentations, bang-bang control algorithms…
  • System implimentations: These have a medium level of reuse with and a moderate level of initial implimentation cost.  The “smaller” the system is the lower the cost.  Examples include:
    • An internal combustion engine model, a wing’s control surface
  • Targeted implimentation: These have the lowest level of reusability and implimentation cost.  Examples include
    • Hardware device drivers, specific fault montering algorithms.

 

 

Why testing fails…

ForImage result for improper complex systems, the goal of “complete” testing can be impossible to reach.  This often boils down to issues of time and complexity.  This is the most “common” type of failure.  A second, more insidious type of failure is improper testing.

What is improper testing?

Improper testing arises from not fully understanding what is being tested or how the testing algorithms work.

Let’s take a simple example, response time to a step input.  For this example, we will use the following test requirementImage result for step input response

TR_1028: Within 0.5 seconds of the unit step input the control system will settle to within 5% of the input value.

Failure type 1:  Single point pass:

There are two errors with this test:

  1. The hard comparison for the driver input value.  A correct method for evaluating this would be
    abs(driver-target) < eps(target)
  2. The pass condition is a single point in time.  What this means is that the signal could continue past the target not settling.

stepMessureError

Failure type 2: “Noise” intolerant

To fix this error the settling time needs to be defined.  To fix this a new state of “Settling” is added.   However, we have now introduced a  new error

  1. Hard failure condition: the response two the signal can include overshoot before the signal dampens down.  The current test fails if the signal falls out of the “less then 5%” range at any time.

failure_1_p2

Solution to this issues: 

The solution to this issue is to define a max settling time and to trigger a failure condition if the signal has not settled within the time.

failure_1_p3

Failure type 3: Sampling

A common failure mode, when dealing with large data relates to sampling.  Let’s take a look at an error flag example.

TR_2034: During the duration of operation the check engine flag will not be active.

It is common when logging data in a system to downsample the data; e.g. log every 10th or 100th data point. If the flag is intermitent it is possible that the flag would not be logged.

There are two solutions to this problem; first, the code could be refactored to include an “ever high” flag.  The second option would be to reconfigure the sampling rate for this specific test.

Failure type 4: Matching problems…

One Image result for matchescommon test type is baseline comparisons.  There are three types of problems that arise when comparing baseline data

  1. Failure to align/sampling:  baseline data is often taken from physical systems or prior simulations.  In this case, alignment of the data between the simulation and the test data is critical.  Likewise, if the sample rate of the baseline and the simulation are different then the alignment becomes more difficult.
  2. Poorly defined tolerances:  comparisons can be made against, absolute, relative and percentage error.  Further, the data type (integer, enum, double) have an effect on the type of comparison that should be performed
  3. Scaling/Units: The final, common, baseline data issue is a failure to correctly account for differences in units and scaling between the to sets of data.

Summary

The types of failures described here are a subset of improper testing issues.  They represent a failure to understand the full scope of the behavior of the system or a failure to understand how the collected data can be inspected.