11.6 C
London
Sunday, November 27, 2022

Synthetic data is the secure, low-cost alternative to real data we need

Must read

Best early Cyber ​​Monday 2022 TV deals you can get: LG OLED, TCL QLED and more

Black Friday and Cyber ​​Monday are great opportunities to get the 4K TV of your dreams for a lower price than usual. These...

Guidelines for organizing and leading successful meetings

When it comes to leadership and management, leaders sometimes ignore meeting management. Meetings that are productive are no coincidence. Calendar - Calendar Meetings that are...

Who is Maël Lucas from “Love without limits”?

Love Without Borders is a brand new reality TV series airing on Bravo in a few days. The next reality dating series follows...

Alisa Mote Facts On Kye Kelley’s Ex Wife: Who Is Mote Dating?

Alice Mote is the ex-wife of a TV star and well-known drag racer Ky Kelley. Her former husband Kye is also a member...
Shreya Christinahttps://londonbusinessblog.com
Shreya has been with londonbusinessblog.com for 3 years, writing copy for client websites, blog posts, EDMs and other mediums to engage readers and encourage action. By collaborating with clients, our SEO manager and the wider londonbusinessblog.com team, Shreya seeks to understand an audience before creating memorable, persuasive copy.

Content provided by IBM and TNW.

Babies learn to talk by hearing other people — usually their parents — making sounds repeatedly. Slowly, through repetition and discovering patterns, babies begin to connect those sounds with meaning. With a lot of practice, they eventually manage to produce similar sounds that people around them can understand.

Machine learning algorithms work much the same way, but instead of having a few parents to copy from, they use data painstakingly categorized by thousands of people who must manually review the data and tell the machine what it means.

Greetings, Humanoids

Sign up for our newsletter now for a weekly roundup of our favorite AI stories delivered to your inbox.

However, this tedious and time-consuming process isn’t the only problem with real-world data used to train machine learning algorithms.

Take fraud detection in insurance claims. If an algorithm wants to accurately distinguish a case of fraud from legitimate claims, it needs to see both. Thousands and thousands of both. And because AI systems are often provided by third parties – not by the insurance company itself – those third parties must have access to all that sensitive data. You understand where it is going, because the same applies to health records and financial data.

More esoteric but equally worrisome are all algorithms trained on text, images and videos. Except for questions about copyrighta lot creators have expressed their disagreement while their work is sucked into a dataset for training a machine that can eventually take over (part of) their work. And that’s assuming their creations aren’t racist or otherwise problematic—which in turn could lead to problematic outcomes.

And what if there is simply not enough data available to train an AI on all eventualities? In a 2016 RAND Corporation Report, the authors calculated how many miles, “a fleet of 100 autonomous vehicles traveling 24 hours a day, 365 days a year at an average speed of 25 miles per hour,” would need to travel to show that their failure rate (resulting in fatalities or injured), was reliably lower than that of humans. Their answer? 500 years and 11 billion miles.

You don’t have to be a super-brain genius to find out that the current process isn’t ideal. So, what can we do? How can we create sufficient, privacy-respecting, non-problematic, event-covering, accurately labeled data? You guessed it: more AI.

Fake data could help AIs deal with real data

Even before the RAND report, it was perfectly clear to companies working on autonomous driving that they were woefully under-equipped to collect enough data to reliably train algorithms to drive safely in any condition or condition.

Take Waymo, Alphabet’s autonomous driving company. Instead of relying solely on their real vehicles, they created a fully simulated world, where simulated cars with simulated sensors could drive around endlessly, collecting real data in their simulated way. According to the companyby 2020, it had collected data on 15 billion miles of simulated driving — compared to a measly 20 million miles in the real world.

More methods for producing synthetic data are gaining ground.

In AI parlance, this is called synthetic data, or “data applicable to a particular situation that is not obtained by direct measurement,” if you want to get technical. Or less technically, AIs produce fake data so that other AIs can learn about the real world at a faster pace.

An example is: Task2Sim, an AI model built by the MIT-IBM Watson AI Lab that creates synthetic data for training classifiers. Rather than teaching the classifier to recognize one object at a time, the model creates images that can be used to teach multiple tasks. The scalability of this type of model makes data collection less time consuming and cheaper for companies that need a lot of data.

In addition, Rogerio Feris, an IBM researcher who co-authored the article on Task2Sim, said:

The beauty of synthetic images is that you can control their parameters: the background, lighting, and the way objects are posed.

Thanks to all the concerns mentioned above, the production of synthetic data of all kinds has exploded in recent years, with dozens of startups in the field are thriving and raising hundreds of millions of dollars in investments.

The synthetic data generated ranges from “human data” such as health or financial data to synthesized images of a wide variety of human faces – to more abstract data sets such as genomic data, which mimic the structure of DNA.

How to make real fake data

There are a number of ways in which this synthetic data generation occurs, the most common and well-established of which is called GAN or generative hostile networks.

In a GAN, two AIs are pitted against each other. One AI produces a synthetic dataset, while the other tries to determine whether the generated data is real. The latter’s feedback returns to the former “training” to become more accurate at producing convincing fake data. You’ve probably seen one of the many this-X-doesn’t exist websites – ranging from people to cats to buildings – that generate their images from GANs.

Synthetic data can give smaller players the ability to turn the tables.

Recently, more methods of producing synthetic data are gaining ground. The first are known as diffusion models, in which AIs are trained to reconstruct certain types of data while adding more and more noise — data that gradually corrupts training data — to the real-world data. Eventually, the AI ​​can get arbitrary data, which it works back into a format it was originally trained in.

Fake data is like real data without, well, the authenticity

Synthetic data, however produced, offers some very concrete advantages over using real world data. First of all, it is easier to collect a lot more of it because you are not dependent on people who make it. Second, the synthetic data is labeled perfectly, so you don’t have to rely on labor-intensive data centers to label data (sometimes incorrectly). Third, it can protect privacy and copyright because the data is, well, synthetic. And finally, and perhaps most importantly, it can reduce biased outcomes.

With AI playing an increasingly important role in technology and society, expectations around synthetic data are quite optimistic. Gartner famously estimated that: 60% of training data will be synthetic data by 2024. Market analyst Cognilytica appreciated the market from generating synthetic data to $110 million by 2021, and growing to $1.15 billion by 2027.

Data has been called the most valuable asset in the digital age. Big tech has been sitting on mountains of user data that gave it an edge over smaller contenders in the AI ​​space. Synthetic data can give smaller players the ability to turn the tables.

As you might suspect, the big question regarding synthetic data is its so-called fidelity — or how closely it matches real-world data. The jury hasn’t decided yet, but research seems to show that combining synthetic data with real data produces statistically sound results. This year, researchers from MIT and the MIT-IBM AI Watson Lab showed that an image classifier pre-trained on synthetic data combined with real data, performed, as well as an image classification trained solely on real data.

Overall, synthetic and real traffic lights seem green for the foreseeable future of synthetic data dominance in training privacy-friendly and more secure AI models, and with that, a possible future of smarter AIs is just over the horizon for us.

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article

Best early Cyber ​​Monday 2022 TV deals you can get: LG OLED, TCL QLED and more

Black Friday and Cyber ​​Monday are great opportunities to get the 4K TV of your dreams for a lower price than usual. These...

Guidelines for organizing and leading successful meetings

When it comes to leadership and management, leaders sometimes ignore meeting management. Meetings that are productive are no coincidence. Calendar - Calendar Meetings that are...

Who is Maël Lucas from “Love without limits”?

Love Without Borders is a brand new reality TV series airing on Bravo in a few days. The next reality dating series follows...

Alisa Mote Facts On Kye Kelley’s Ex Wife: Who Is Mote Dating?

Alice Mote is the ex-wife of a TV star and well-known drag racer Ky Kelley. Her former husband Kye is also a member...

Holiday travel with Covid, RSV and flu means it’s time to bring back mask mandates

Going into the holiday season last year, rising Covid-19 cases overwhelmed hospitals. This year, hospitals are overwhelmed by a combination of Covid, respiratory...