Apart from drawing photorealistic images and hold apparently conscious conversations, AI has failed to deliver on many of its promises. The resulting rise in AI skepticism leaves us with a choice: we can get too cynical and watch from the sidelines as winners emerge, or find a way to filter out noise and identify commercial breakthroughs early to participate in a historic economic opportunity.
There is a simple framework for distinguishing short-term reality from science fiction. We use the single most important degree of maturity in any technology: the ability to manage contingencies, commonly known as edge cases. As a technology becomes tougher, it becomes more adept at handling increasingly rare edge cases and, as a result, gradually unlocking new applications.
Edge case reliability is measured differently for different technologies. The uptime of a cloud service can be one way of assessing reliability. For AI, a better measure would be accuracy. When an AI fails to handle an edge case, it produces a false positive or a false negative. Precision is a statistic that measures false positives, and To remind measure false negatives.
Here’s an important insight: today’s AI can deliver very high performance if it’s focused on precision or recall. In other words, it optimizes one at the expense of the other (ie fewer false positives in exchange for more false negatives, and vice versa). But when it comes to achieving high performance on both at once, AI models struggle. Fixing this remains the holy grail of AI.
Low-fidelity vs. high-fidelity AI
Based on the above, we can divide AI into two classes: high-fidelity versus low-fidelity. A high precision or high recall AI is lo-fi. And one with both high precision and high recall is hi-fi. Today, AI models used in image recognition, content personalization, and spam filtering are lo-fi. However, models that require robotic axis must be hi-fi.
There are a few key insights about lo-fi and hi-fi AI that are worth mentioning:
- Lo-fi works: Most algorithms today are designed to optimize for precision at the expense of recall or vice versa. For example, to avoid missing fraudulent credit card charges (thus minimizing false negatives), a model could be designed to aggressively flag charges at the slightest indication of fraud, increasing false positives.
- Hi-Fi = Science Fiction: Today there are no commercial applications built on hi-fi AI. In fact, hi-fi AI may be decades away, as shown below.
- Hi-fi is rarely needed: In many domains, smart product and business decisions can lower AI needs from hi-fi to lo-fi, with minimal/acceptable business impact. To do this, product leaders need to understand the limits of AI and apply it in their design process†
- Time-critical safety needs hi-fi: Time-sensitive security decisions are an area where hi-fi AI is often needed. This is where many autonomous car use cases tend to focus.
- Lo-fi + people = hi-fi: Apart from security applications, it is often possible to achieve hi-fi performance by combining artificial and human intelligence. Products can be designed to provide human assistance at appropriate times, whether by the user or by support personnel, to achieve the desired level in both precision and recall.
Quantifying the reliability of AI
A popular measure for evaluating AI reliability is the F1 score, which is kind of a numerical average of precision and recall, allowing measurement for both false positives and false negatives. An F1 of 100% represents a perfectly flawless AI that handles all edge cases. By our estimation, some of the best AI today perform at 99%, although a score above 90% is generally considered high.
Let’s calculate the F1 score for two applications:
- If Spotify plays songs you like 95% of the time (precision), but only half of the songs you like (50% recall), then F1 65%. This is a sufficient score, as high precision provides great user experience and low user churn, while low recall goes undetected by users.
- When a robo-taxi decides whether to cross at a traffic light, it makes a time-sensitive safety decision. The chance of a collision is high both by burning red (false negative) and braking unexpectedly when green (false positive). We devised a method to estimate the level of AI accuracy required to achieve equality between autonomy and human drivers, taking into account current intersection rates at intersections and other factors. We estimate that a robo-taxi must achieve more than 99.9999% precision and 99.9999% recall in detecting red lights to be on par with humans. That’s an F1 van 99.9999%-or six nines†
It is clear from the above examples that a 65% F1 is easily achievable with today’s AI, but how far are we from an F1 of six nines?
A roadmap to hi-fi
As discussed earlier, maturity and market readiness for any technology is linked to how well it handles edge cases. For AI, the F1 score can be a useful approach to maturity. Likewise, for previous waves of digital innovation, such as web and cloud, we can use their uptime as a maturity signal.
As a 30 year old technology, the internet is one of the most reliable digital experiences. The most mature sites like Google and Gmail strive for 99.999% uptime (five nines), meaning the service is down for a maximum of six minutes per year. This is sometimes overlooked, such as YouTube’s 62-minute hiatus in 2018 or Gmail’s six-hour outage in 2020.
At about half the age of the web, the cloud is less reliable. Most services offered by Amazon AWS have an uptime SLA of 99.99%, or four nines. That’s an order of magnitude less than Gmail, but still very high.
A few observations:
- It takes decades: The above examples show that it often takes decades to get up the edge-case maturity ladder.
- Some usage scenarios are particularly challenging: The extremely high level of edge-case performance required by robo-taxis (six nines) even surpasses that of Gmail. Keep in mind that self-driving also runs on computers similar to cloud services. But the operational uptime required by robotic axis must exceed what today’s web and cloud services achieve!
- Narrow Applications Beat General Purposes: Web applications are narrowly defined use cases for cloud services. As such, web services can achieve higher uptimes than cloud services because the more common the technology, the harder it is to harden.
Case study: not all autonomy is created equal
Google engineers who left their self-driving car team to start their business had a common premise: narrowly defined applications of autonomy will be easier to commercialize than generalized self-driving vehicles. In 2017, Aurora was founded to transport goods on highways via long-haul trucks. Around the same time, Nuro was founded to transport goods in small cars and at slower speeds.
Our team also shared this statement when we started within Postmates (also in 2017). Our focus has also been on moving goods, but unlike others we have chosen to leave cars behind and instead focus on smaller robots that operate off the street: Autonomous Mobile Robots (AMRs). These are widely used in controlled environments such as factory floors and warehouses.
Consider red light detection for delivery robots. While they should never cross red given the risk of collision with vehicles, conservatively stopping on green poses no safety risk. Therefore, a recall rate comparable to robot axis (99.9999%) along with modest precision (80%) would be sufficient for this AI use. This results in an F1 of 90% (one nine), which is easy to achieve. By going from street to sidewalk and from a full-size car to a small robot, the required AI accuracy drops six nines to one.
Robots are here
Delivery AMRs are the first urban autonomy application to be commercialized, while robotic axis are still waiting for an unattainable hi-fi AI performance. The pace of progress in this sector, as well as our experience over the past five years, has reinforced our view that: the best way to commercialize AI is to focus on smaller applications enabled by lo-fi AI, and use human intervention to achieve hi-fi performance when needed† In this model, lo-fi AI leads to early commercialization, and incremental improvements after the fact help to drive business KPIs.
By targeting more forgiving use cases, companies can leverage lo-fi AI to achieve early commercial success, while maintaining a realistic view of the multi-year timeline for achieving hi-fi capabilities. After all, sci-fi has no place in business planning.
Ali kashanic is the co-founder and CEO of Serve Robotics†