Artificial Intelligence (AI) continues to prove its worth by innovating operations and optimizing the workload for organizations across all industries. As more industries seek to harness the power of AI, we need to be extra sensitive to the data we use to train this technology. If we aren’t, we risk falling back against all the progress society has made in recent times regarding Intrinsic Prejudice against Black, Indigenous and Colored people (BIPOC).
The rise of synthetic data
Companies are using AI to venture into previously unexplored territory. Human-in-the-loop data training can go a long way, but what about the cases where we don’t have previous data? How can we teach an AI model to do something for which we do not yet have the tools or data?
Originally, developers had to collect training data that covered every possible scenario in order to accurately train successful AI models. If a scenario had not occurred or had been previously captured, there would be no data, leaving a huge gap in the machine’s ability to understand that particular scenario.
There are realistic scenarios that arise, but have not been documented often enough to provide the plethora of data necessary to train a machine to recognize it. For example, we don’t have enough data to train an alarm system to recognize an intruder in the house. Another example is training an autonomous vehicle to recognize a child running in front of the car. While extreme, these are real-life scenarios that we can’t just train to recognize and respond to a machine using just human-in-the-loop data.
Where there is a will, there is a way – and the way points to the path of synthetic data.
What is synthetic data?
Synthetic data is created by software, as opposed to data captured by humans from real-world scenarios. It enables computer programs to fill the gaps in use cases by orchestrating rare cases and specific real-world scenarios that typical human-collected data simply cannot manifest. These are called edge cases. This also allows for more freedom and flexibility when it comes to training more advanced AI applications.
Edge cases are the extreme, nightmare scenarios that the AI may not be prepared for. For example, catastrophes or crimes are both scenarios where it is difficult to collect data. While these can be simulated risk-free, synthetic data should be used in conjunction with as much real-world data as possible to close the gaps and ensure holistic, inclusive data sets for all possible scenarios.
by 2024, 60% of all AI data expected to be synthetic data. While the idea of synthetically generated data has been around for quite some time, its recent growth can be largely attributed to the autonomous vehicle sector. However, it can be applied in almost any program that uses computer vision, such as drones, security cameras and various consumer electronics.
No human means no human bias
Synthetic data enables companies to break away from traditional AI data limitations. When used in conjunction with human-collected data, synthetic data can provide significant benefits to businesses, including lower data and labor costs, faster data collection speed, access to edge cases, and more inclusive, less biased data sets.
Just as bias is an ever-present presence in society, it also has room in AI data sets. Because these datasets are compiled by humans, they often display the same biases as the people who create them. No, these aren’t big, obvious biases, but they’re enough to skew applications based on gender and race. Self-driving cars, for example, recognize earlier white pedestrians vs blackwhich can lead to major security issues.
What sets synthetic data apart is that it is not man-made. It is data created by software for AI. And while it can still inherit bias from the original set, that means it has much less or no bias at all.
In order for the dataset to be truly inclusive, it must include all possible scenarios and individuals who can use it. For example, facial recognition for your cell phone should be able to work for everyone, so it should be trained to identify skin tone, hair color, hair type, different facial features, accessories like glasses or sunglasses, and more. All of these variables must be added to the training dataset to ensure inclusiveness. More specifically, if we know that we don’t have data on spectacle wearers, we can artificially create that data to make the models work for spectacle wearers.
In addition, an autonomous vehicle must be trained in all road situations, including different road types, different street sign languages, different extreme experiences and whatever else comes its way. While real-world data is actively being collected to train these models, there are often scenarios that are unpredictable or rare that the model must be able to recognize to keep everyone involved safe. Let’s say a ladder falls from the trunk in front of the vehicle, the vehicle has to identify the object and move around it. These scenarios are not common enough in the real world to have enough data to properly train a model, but they can be created artificially through the use of synthetic data.
With synthetic data becoming increasingly popular, the future looks bright for AI. As more and more companies adopt the concept of supplementing human-collected datasets, we can expect many more inclusive and representative datasets that will lead to safer and more equitable uses for all genders and races.
Wilson Pang is the CTO at app and the co-author of Real World AI: A Practical Guide to Responsible Machine Learning.