Sitemap

Yahoo!, OpenAI, and Tumblr Walk into a Bar

4 min readOct 29, 2024

--

Once upon a time, a company called Yahoo! ruled the internet. In 1998, Yahoo! was the most popular home screen on earth. Its wall of human-curated directory links amazed and delighted. Yahoo! made multi-billion-dollar acquisitions, and its stock price seemed to know no limit.

There’s a good chance that if you’re under 30, you don’t even really know what Yahoo! is. The story goes that Google killed it. But the story is wrong. Yahoo! died because after the initial euphoria of the internet wore off, users demanded a platform that was actually useful. A human-curated directory of the internet was simply too limited, and the need for basic utility killed Yahoo!. Google fulfilled that need.

OpenAI and its ChatGPT is today’s Yahoo!. Dropping a few billion to brute force train an LLM wasn’t new (it’s just no one thought anyone would be so stupid), but ChatGPT never-the-less amazed the GenPop. Then, they started using it, and slowly but surely companies around the world realized it was nearly useless. And unlike Yahoo!, ChatGPT is dangerous.

Anyone working with LLMs knows this, but recently, the Associated Press published an article titled “Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said” so hopefully the word is getting out.

The article focuses on one particular use case, ChatGPT as transcription tool (which OpenAI markets as “Whisper”). As the article states, “Whisper is the most popular open-source speech recognition model and is built into everything from call centers to voice assistants.” Indeed, ChatGPT/Whisper “is a built-in offering in Oracle and Microsoft’s cloud computing platforms.”

It’s that same everywhere as Yahoo! was in 1998. And it’s failing everywhere. “While most developers assume that transcription tools misspell words or make other errors, engineers and researchers said they had never seen another AI-powered transcription tool hallucinate as much as Whisper.”

Every engineer working with ChatGPT has witnessed its wild hallucinations. We benchmark our own work against the competition, and my favorite is always ChatGPT because it can be counted on to provide hilariously wrong answers.

Wrong Answers?

There are two broad categories of wrong answers. It’s important to know the difference.

First, there are wrong answer (which are not hallucinations). Wrong answers have a basis in the data/content. For most robust LLMs, these wrong answers are of two kinds:

Entity Conflation. The agent returns a result that is grounded in the data/content but, at least in part, is about the wrong entity. Agents do this with names (of people), but a few major agents have a difficult time distinguishing referential content (when one entity is addressing a different entity). For example, an entity (an organization, for example) may publish a press release that references another entity; there are LLMs that can’t draw a line between those two references.

Time. A calendar is a data structure, and a few major LLMs have either not been trained on the data structure or (sometimes and) have not been prompted to reference this structure. So when there are multiple possible correct answers, an agent will return one but not necessarily the most recent one. (By default, humans want the answer that’s currently accurate; if they want a previously correct answer, they reference the structure.)

Those two issues are correctable with better training and prompting. But there’s the second category of wrong answer: hallucinations. The hallmark of an hallucination is that the agent returns an answer not grounded in any of the data/content. (With humans, we’d probably call this ‘lying’ or else it’s indicative of some kind of mental illness.)

Generally speaking, hallucinations are solvable. But if you’re a business delivering a GenAI product to clients, you must be certain that the product does not hallucinate. You must have a battery of benchmark questions that will (in most cases) expose hallucinatory tendencies. The easiest way to baseline such a battery is to RAG through ChatGPT; OpenAI’s product suite will then produce hilarious lies for you. Those are your key benchmarks.

“But Whisper has a major flaw” the AP article concludes. So did Yahoo!

Whisper’s issues range from “A University of Michigan researcher conducting a study of public meetings, for example, said he found hallucinations in 8 out of every 10 audio transcriptions he inspected” to “A machine learning engineer said he initially discovered hallucinations in about half of the over 100 hours of Whisper transcriptions he analyzed. A third developer said he found hallucinations in nearly every one of the 26,000 transcripts he created with Whisper.”

Or, as the Founder and CEO of Salesforce put it: “It just doesn’t work.”

You going to put that in your product?

And we haven’t even started addressing data leaks. You’d have to be a maniac to test (not to mention deploy) anything where sensitive data and a public LLM are components of the same build. “But we’ve removed/redacted/quarantined…” I hate to get so technical, but … bullshit. There are LLMs and architectures that are secure (and better at certain tasks, and are faster, and can be run locally, etc.).

And finally, investors, you really want to invest in Yahoo! in the late 90s? There are a dozen start-ups polishing off their betas that will own the market in a few years while you’ll own equity in a company that thinks acquiring Tumblr is a strategic imperative.

--

--

No responses yet