Featured Image

Reporting: Linking data to decisions

7 min read

To skip directly to the sample report with interactive elements, click here.


Here at neurocat one of our core solution offerings is ML perception model testing. These tests use image augmentations to push the model and see how it performs in conditions at the edge of its training set. The purpose of the tests is to provide results that can help developers decide the next steps in their product development. 

But how can we know if a given test fulfilled its purpose? This is where reporting comes in.  

Test, ours included, often produce copious results. Yet in their raw form, results have limited application. To maximize their value the results, a collection of data, need to be selectively structured and presented with a given purpose and audience in mind. When this is done, the results become a report.  

So let's talk about reports: both what they should look like and how neurocat's ML safety reports for perception models fit this report ideal type. 

Purpose

Every report should have a purpose. Moreover, that purpose should be stated at the top of the report, because every person who sees a report may not have been involved in it before first seeing it. 

Broadly a report's purpose is to leverage knowledge, backed by data, to inform future decisions and/or actions. Thus, our reports’ purpose statements outline what inputs (data and its augmentation, models) went into what activity (tests) in order to determine exactly what (safety levels) for what (given objects, here cars in the lower third of the image). Put in plain English: Is our perception model safely segmenting close by vehicles in unseen images with diverse rainy conditions?

To be scalable, the text should be dynamic, changing when the data or model changes or when one's goals evolve. Below you can see an example of the summary for a robustness test of an off-the-shelf segmentation model using augmentations of an open data set.

Audience

The report's purpose is the determining factor for setting the report's audience. The purpose for our reports is to assess if a model is robust in safety-critical situations no matter the conditions. This calls for the report to be targeted at: 

  • Data specialists/analysts/wranglers 
  • Machine learning developers and engineers 
  • Safety engineers/managers 

These experts have a task: they need to translate the information in the report into knowledge. This knowledge then is reviewed by the decision makers who make the ultimate call on whether the model is safe enough to proceed to the next development stage? 

We do not know who this decision maker is. It could be the experts sitting together, it could be a business operations manager. Because of this, our report includes two types of supporting information: detailed for the experts and exemplary for the educated generalist.

Content

While any reader may wish to read the whole report, we section it by audience: data, ML, and safety. Let's look at these in turn. 

Overview

Did you do what you intended to do and look at the right things? Mistakes happen, especially when dealing with large quantities of data. Thus, a basic check of the data you used, both original and generated, and the tests you ran is the first information we present. If more than one model or different data subsets were used, this information will also be presented. 

In addition to general data, we include also full specification of the selected augmentations’ parameters. Such a detailed view is required since augmentations are neurocat's unique value add. Thus, the report reader will be less familiar with this aspect and will require more information to ensure it meets their criteria. As a user can include many augmentations in any one test, one section like that below will appear for each augmentation chosen. 

This section also provides a means to view the augmentations. For visual data and tasks human assessment can be more valuable than information on, e.g., ranges of the precipitation rate, helping to ensure you got what you wanted or to set more useful or realistic parameters for subsequent iterations.

For interactive component click here.
Performance

If the data checks out, the next thing to inspect is the model. This section is targeted toward AI/ML engineers and developers. As such it is dominated by metrics.  

The information includes whichever evaluation metrics were selected during testing setup. Metric performance is shown for each augmentation type that was included in the test(s). If more than one model was tested then each model tested is presented in a separate section.

Safety

The final and most elaborated section of our reports is focused on safety, as assessment of safety is the ultimate purpose of neurocat's reports. 

We present our safety section with a focus on showing the error rates of various error patterns. For instance, in the sample images here, and on our sample report page, we are looking at misclassifications (the error mode) at several thresholds (the error pattern) and how often the pattern occurs (the error rate). Tables like this one can be shown for the whole or any part of the image and any given objects of interest (and of course any thresholds).

Not inherent in the error pattern but often important for safety assessment is also seeing the error patterns by ODD tag. This can greatly inform future testing.

For example: Your current test may be looking at rain as the trigger condition, but you have other criteria: e.g. that the rain can be during the day or the night.  Failure to disaggregate the results by ODD tag could mean you believe your model has failed your tests, when the reality may be that your model met your safety thresholds for the day images but failed (badly) on the night images. If you do not know this, you may collect, train, and re-test a bunch more day images, when you only needed more night images.

Note: this chart does not appear in the sample report linked at the bottom of this page.

While the report purpose should set a threshold at which performance can be classified as "safe”, the safety manager should be able to view other thresholds. To enable such examination, we include a visualization that allows continuous adjustment of the threshold. Such views can help identify breakpoints in performance that can inform future training, data augmentation, etc. 

To maximize the amount of information the visual presents we also allow viewing of both the augmentation and the inference results in the same visual. Combined with the thresholds, and ability to easily see the worst performing augmentations, this can allow visual identification of specific conditions or image features (e.g. objects) that are particularly prone to lead to model failure.

For interactive component click here.
Report Conclusion

Only human insight into the machine produced data can truly turn it into knowledge, and only knowledge can be turned into foresight, i.e. help inform decisions. Because of this, a report's conclusion should not provide decisions. It's purpose is only to help the experts who must make those decisions return from the detailed data back to the big picture: to see the forest for the trees.

Because of this, our conclusions are concise and relate exactly to what the aggregate data says about the model performance for the top-level safety thresholds set in the introduction (which in turn were set by the user's input when setting up the test(s)). This allows for a structured decision-making process on what really matters point-by-point.

As one can see, our model did not do particularly well at its defined task of segmenting nearby cars in rainy images (unsurprising for an off-the-shelf model). But armed with our report, our experts will be able to tell us why and we can retrain it to do better in the next round.

Conclusion

While above we have outlined our standard report, our reports are highly customizable.  

All data used to build the report are available in a report data folder and can be used by us to add information for you, or by you personally to run additional analyses. 

But why not take a deeper look at our sample report, where you can see the visualizers in action and follow the flow of the report. Or get in touch and see a report on your own data! 

© neurocat GmbH

Back to top Arrow