Avoiding Sisyphean data campaigns: The role of augmented and synthetic data

In Greek mythology Sisyphus was sentenced, as punishment for avoiding death, to roll a rock up a hill only for it to roll back down for all eternity.

The fate of Sisyphus is today the fate of many data campaigns. This is especially true for those working on training machine learning (ML) models for safety critical applications that must make sense of complex environments. Avoiding hazards or harm, it seems, requires eternal data collection.

But there are ways to cheat the gods. Not all data to train and test ML models needs to be collected in traditional ways. Instead, data can be generated to increase the size and diversity of a data set, increasing the robustness of models developed therewith, and thereby enhancing the safety of components using those models.

Data collection: The never-ending story

Within the domain of autonomous vehicle perception systems, in which neurocat works, we are all familiar with the aforementioned cycle.

Perception models for ADAS/ADS must make sense of a highly complex environment with constantly changing conditions. The seemingly infinite combination of conditions must be reflected in the data to cover all possible eventualities a driver may encounter within a given Operational Design Domain (ODD). But as we move down the long tail of the data's distribution, getting data becomes increasingly difficult, even while these data points are also those which are most safety relevant.

Yet this lack of coverage of edge cases is not even known until testing commences. Thus, perception model development requires iteration in collection, training, testing, supplementary collection, and re-training and -testing. Iteration is natural in machine learning, but if the level, sequencing, and timing of these steps are not well planned out, knowing when to stop iterating can be difficult and development costs can explode.

Why are we stuck in this cycle? What is wrong with our data? And how can we break out of the loop and lay down our Sisyphean burden?

The problem with data

Have you ever run into a deer while driving? Or missed a stop sign because it was obscured by vegetation?

Even if you answered no to these questions - and in fact most people would likely answer no - they are frequent events, happening every day. They are individually unlikely, but collectively common ... while also having a high impact on the individual involved.

Because of their character, collecting data reflecting such scenarios is difficult. How many millions of miles would your fleet of data collection vehicles have to drive - and across how many environments - to collect the data you need?

Moreover, collecting such rare but critical data often entails the risk of creating hazards in itself (e.g. deploying your data collection vehicles on a stormy, foggy night). And even if you collect the data, labelling can be problematic. If the stop sign was missed by a driver, will the data annotator labelling the data see it?

So how long do you collect down the long tail of statistically low-probability, high-impact events that nonetheless will occur daily when your autonomous vehicle is deployed at scale? Or are there other options?

Beyond the real world: Augmented and synthetic data

The solution to eternal data campaigns is to find another type of data which can be acquired through means other than collection. These methods must provide data that fills safety-critical gaps in your current data quickly, inexpensively, and in sufficient quantity. Two data types that fit these criteria are augmented data and synthetic data.

Augmented data will be familiar to most ML experts and practitioners, especially in computer vision applications. Common transformations such as rotating an original image qualify as augmented data, which simply means new data is generated by applying minor changes to existing data. The term ‘minor’ can be misleading though, as advanced augmentations can alter an original image to the point where even a human may have difficulty recognizing its relation to an original image. Such smart augmentations introduce corruptions to an image, for example filters simulating weather, pixelation, blur, and so on.

Synthetic data, on the other hand, generates new images based on a sample of original data or other programmatic guidelines, images which may or may not be based on the original data used to develop a model. Like augmented data, there are numerous approaches to generating synthetic data, including using CGI-based methods, game engines, and proprietary systems. Most recently, Generative Adversarial Networks (GANs) have seen increasing use both to improve the images produced with the aforementioned methods and to make large transformations to original images.

Which type of data to use?

Two questions arise when considering whether to use augmented or synthetic data. The first question is whether these data types are as good as real data or not?

The answer is neither: augmented and synthetic data have a role to play in ML model development alongside real data. Comparing them to real data is not fair because they fill in where real data is absent. Rather, the quality of the data can only be assessed by whether it contributes to improvements in model performance.

This hints already at the answer to our next question. Which should you use to solve your scarcity of data problem: augmented data or synthetic data?

The answer is both. For just like augmented and synthetic data differ from real-world data, they differ from each other and have different strengths and weaknesses.

Because there are so many approaches to generating augmented and synthetic data, generalizing these strengths and weaknesses is problematic. Thus, herein, let's start at the beginning - how they are produced - to derive how to best use them.

Augmented data changes only one (or very few) environment parameter(s). This means the gap between it and real-world data is more easily bridged with statistical experiments than for synthetic data, even if synthetic data can sometime look more 'real'. However, because synthetic data changes many parameters, it can be tweaked to have certain desired properties and particular distributions.

This means augmented data requires a thought-out strategy for systematic use and maximal value. In this weakness though is an advantage: it allows for a deliberate design and inclusion strategy for targeted testing. Conversely, synthetic data – if it is built to do so – can be ideal for use in simulations, which is useful for comprehensive testing such as system tests.

The real question: When to use a given data type?

This difference – targeted versus comprehensive testing – hints at the real question: when do you use augmented data and when do you use synthetic data (or indeed, real data)?

This question has no single answer, but it is essential that you answer the question early in your ML application development. Much will depend on the availability and distribution of available real-world data for your use case and ODD, and thus when and where you need augmented or synthetic data for training versus testing.

We at neurocat believe augmented data is ideal for component testing early in your ML development.

Using augmented data for testing offers the unique advantage of incrementally adjusting data - perturbing images - to see the exact point at which a model fails or, ideally, the point at which the risk of model failure becomes unreasonable.

Using augmented data at the component stage frees up synthetic data for testing after the component has been integrated into a system, where its comprehensive nature is more valuable and where its higher cost is justified given it is spread across testing multiple components and/or systems.

This development cycle is the basis of our business solution – with clients coming to us with models built on real-world data that we then use augmented data to validate and improve. It is also the rationale for our partnership with dSPACE, which specializes in synthetic data and simulations valuable later in the development of perception systems.

This approach ensures by the time comprehensive system and component tests (e.g. SiL/HiL) occur the ML algorithms have been tested (and maybe even re-trained) with three types of data, reducing the probability of substandard systems tests at such a late stage in ADAS/ADS development.

Data requires a plan

We often hear that data is the new oil, because it runs the new economy. And just like oil, you need to manage and conserve your use of data, by using augmented and synthetic variants, and by knowing when to stop your data campaigns. With a proper data campaign and testing, unreasonable risk can be removed without having to roll the rock up the hill endlessly. In a future blog post we will take a deep dive into specifically how we at neurocat design the use of augmented data for testing perception models in a way which is systematic, not Sisyphean.

If what you read interests you, be sure to check out our solution for testing ADAS/ADS perception systems.