Why we test: A look at the oft unspoken reasons we test ML perception

Testing is a natural step in the development of machine-learning models for use in perception components of autonomous vehicles. It is so natural, that it is unquestioned. We include it automatically, without reflection.

But why do we test our perception models? It is often worth going back to the unasked questions, and their seemingly obvious answers, because we need to be clear as to why we test before we can think about how we test.

Optimize, comply, control

While every company will have its own specific reasons for testing, these purposes can be aggregated into three broad rationales, each which helps meet a corresponding business objective:

Testing Rationale	Business Rationale
Model learning and improvement	Competitive advantage (via enhanced performance and/or coverage)
Meet compliance obligations	Customer assurance leading to product adoption
Mitigate risks and control costs	Control development processes to ensure deployment within realistic constraints

One could conduct testing for each purpose separately. However, if you want to maximize the value of each testing purpose then you should build one testing strategy that links together all three testing rationales.

Testing to improve

The first purpose of testing is the most obvious: we test to improve the performance of a machine learning model. A model cannot learn by training alone; to do so training must be matched with its natural corollary of testing.

The purpose of this learning and improvement is so that the model is robust to failure. Improvements to robustness can focus on two aspects: more robust within the contexts the model is already trained, or increased robustness in new contexts that lie at the edge of the model’s operational environment (or Operational Design Domain as we say in autonomous vehicle development).

Ideally testing should do both, or at least the first while also increasing our understanding of the latter: the limits of the model. Thereby, we can know when our model is uncertain so that we can ensure associated components in our autonomous driving system can compensate.

Testing to improve is complicated by two factors. First is the complex environment which a perception must sense. To train for this environment requires a lot of data that, moreover, reflects the actual distribution of the data in the environment.

Secondly, it is complicated by the need to test at multiple levels. The output from an ML model will be fused with other data (other sensors, maps, etc.), creating new levels that need testing. Subsequently their integration with each other into a component, and then with other components into a system, will have to be tested.

Because of the external and internal complexity, there is a combinational explosion in testing requirements. It seems testing, and re-training could go on forever, and indeed some form of continuous learning may be seen in deployed ML perception models.

But how does one know when testing has gone far enough to at least reach that deploy decision? To answer that, we need to look at testing in another way.

Testing to prove

The second reason we test is to assure regulators and/or customers that we have fulfilled our regulatory and compliance obligations. Testing forms the basis for safety claims and provides evidence for safety arguments supporting those claims. It shows we did our homework, or in legalese, our due diligence.

This testing rationale can devolve into cynicism: testing as a check the box exercise simply to meet regulations or prevent lawsuits (stereotypically in the EU and the US respectively). But we should see informal standards, regulations, and the law as a reflection and codification of the ethics, norms, and values of society.

But how does this relate to testing; how does it actually help us in testing?

Regulation tells us the standards to which we need to hold ourselves and our AIs. They tell us what is reasonable risk and what is unreasonable for our stakeholders. This guidance is necessary because 100% safety is impossible to attain: thus, we need to have an idea of where to stop so we do not test, retrain, and test again ad infinitum.

Thus, it is testing for standards and norms that inform us when our model has improved enough to deploy. Room for interpretation will remain, but the scope of our decision making with be made manageable.

And this is where the cynicism fades away: why are we demonstrating we met standards? Because it assures society that our model has a stamp of approval by a recognized authority. Testing for standards reassures end customers and leads to adoption of your solution: be it a component, or an entire vehicle with ADAS/ADS features. In new fields, where the consumer may have some natural hesitance in product adoption, such evidence can be essential to give them the trust needed to become an adopter.

Testing to control

This brings us to the final reason to test, one beyond those that might be obvious to engineers and lawyers, but which the savvy business manager has in their mind: We test to mitigate costs over the life cycle of product development.

This testing rationale helps inform how we should do our testing to improve and builds on the truism that we learn by failing. In order to improve, our tests should push a model to the point where it fails. But testing should tell us not only when we failed, but how we failed.

This specificity is important because, similar to how we use standards as benchmarks, we, or a model, cannot generally learn everything. Learning has a cost in time and money. Thus, it is better to know how we failed and what needs to be re-learned in order to pass. A good test of an ML model will tell us what data it needs to do this.

This observation implies we need to learn efficiently. We should not retrain on the same data and model, nor should we just randomly add data or tweak our model. If our tests tell us exactly how we failed we can design a data campaign to get the just data we need to learn better or to retrain our model in just the right way.

Controlling costs through informative testing is cumulative because, like testing to improve, it is done at multiple levels, from components to the system. Costs can be controlled at each of these levels. However, the highest savings will be realized by testing at the component level. A solid foundation at that level will give you confidence that test failures at later levels are due to issues at that level, and not due to problems at an earlier stage of development.

From the why to the how

Having reviewed why we test, how can we immediately use this knowledge? We can use it devise a good testing strategy, as such a strategy should give an answer to the reasons we test. That is, the reasons we test form requirements for a good test.

Succinctly, a good testing strategy will guide us on how we can realistically (i.e. within constraints) and efficiently improve, to the point where unreasonable risk to our stakeholders has been removed. It will thereby allow us to confidently deploy a perception model in the face of a complex, uncertain world knowing the legitimate expectations of society, and possible customers therein, have been met.

In a future article we will use this knowledge of why we test, and the requirements for our testing strategy, to inform how we test, and explore how neurocat's solution fills these testing requirements and the purpose for our testing: to deliver safe AI for everyone, everywhere.

If you need a comprehensive testing solution for your ADAS/ADS perception systems, be sure to check out our aidkit software.