I admit that the title of this piece is sensationalistic, provocative, and borderline clickbait. But the more I dug into GPT-3, the more the title began to feel weirdly accurate and somehow necessary. The image that kept coming back to me is a parrot version of the creature in “Jurassic Park” (also artificial) that seems kind of cute and friendly then transforms into a terrifying lizard that blinds its victims with toxic spray.
As I said, on the dramatic side, but there you go. And I think the whole rollout and reaction to GPT-3 is instructive for business leaders as they think about what is hype and what works in the real world, an issue particularly important in the world of AI these days.
GPT-3: How Big is Big?
GPT-3 (Generative Pre-trained Transformer v.3) is a massive general purpose language prediction and generation model developed by OpenAI, an organization dedicated to “discovering and enacting the path to safe artificial general intelligence.”
The key to GPT-3 is its size. It is the latest attempt in a competition of tech titans to build an ever-larger “brute force” language model. Microsoft’s most recent model was ten times the size of GPT-2 while Google announced a model that is six-times larger than GPT-3. The size of GPT-3 model is notable in that it:
- contains 175 billion parameters and is more than 100-times larger than its predecessor GPT-2,
- is trained on a 500-billion-word data set sourced largely from the “Common Crawl” internet and content repository; and
- costs an estimated US$5-10MM to train and, in the process, generates the carbon footprint of a trip to the moon and back in your average gas-powered car.
All of that size results in GPT-3 being pre-trained to generate responses to language prompts posed as questions (“What is the population of Michigan?), simple requests (“Write an interesting poem about volleyball”) and the like without additional training. OpenAI called GPT-3 (somewhat counter-intuitively) a “few-shot” learner for this very reason.
The Impetus for GPT-3 Hype
GPT-3 can carry on language interactions that might lead a human to fail the Turing test (i.e., mistake GPT-3 responses for human responses) and write long-form language articles that are grammatically and syntactically coherent and, in some cases, convincing (although the language can wander).
It has also surprised with its ability to perform simple arithmetic, write snippets of code and execute other tasks that suggest progress on the path to artificial general intelligence — which is the mission of OpenAI. Today, it is being offered as a limited commercial access API, although OpenAI and Microsoft seem intent to push its commercial adoption.
The reaction after its release was a wave of media coverage calling it “revolutionary” and a breakthrough toward artificial general intelligence. AI enthusiasts clamored for access and were quick to produce clever things through GPT-3 or slip GPT-3 content into blogposts.
Journalists did the same with entire sections of articles about GPT-3 actually produced by GPT-3 (with “and that last section was written by GPT-3” as a standard kicker in the article). The publicity was so overwrought that OpenAI director Sam Altman even issued a statement that “the GPT-3 hype is way too much…it still has serious weaknesses and sometimes makes very silly mistakes.”
More Skeptical Voices and Concerns Emerge
This initial wave of enthusiasm was quickly followed by less favorable reviews from AI researchers which tended to feature slang references to “bull manure” and were punctuated with examples from GPT-3 that ranged from “silly mistakes” to things that were outright disturbing.
Most notably, GPT-3 generated language with alarming frequency that reflects the stereotypes and bias found in the internet content on which it was trained (hence, venom-spitting). This problem is inherent in all Large Language Models, as they require massive training sets that are only practically found using Internet data.
Many prominent AI researchers even questioned whether GPT-3 represented anything meaningful in the journey to artificial general intelligence. Timnit Gebru, a former researcher at Google, labeled the entire category of Large Language Models (which includes GPT-3) “stochastic parrots,” or very clever manipulations of language that sound right and seem to make sense but do not resemble anything near the language understanding possessed by humans.
She ended up leaving Google (or was forced out) over the publication of this work and its broader point that the Large Language Model approach was inherently fraught with concerns about bias, energy consumption and, most importantly, the opportunity cost of not pursuing other approaches.
As noted, Google has long pioneered Large Language Models and announced its own trillion parameter model after the release of GPT-3. Gary Marcus, a prominent critic of “brute force” approaches, called GPT-3 “a bloviator…[that] has no idea what it is talking about.”
Some Practical Implications
To its credit, the GPT-3 research team itself acknowledged many of these concerns in the paper announcing its release. But it’s important to summarize the issues in commonsense terms to offer business leaders some practical advice when considering AI as a solution to real-world problems.
Don’t let your AI make decisions that you can’t explain.
GPT-3 is so large and complex that it is the ultimate black box AI approach. The US, UK and EU have released regulations and guidance about bias, accountability, and explainability in AI with the US Federal Trade Commission. Their message to business leaders is rather blunt: “Hold yourself accountable, or be ready for the FTC to do it for you.”
The GPT-3 team states clearly that “its decisions are not easily interpretable,” which is more than a bit of an understatement. But to be fair, that is a problem with many AI approaches and one that merits serious attention before deploying AI in any real-world situation;
Garbage in, garbage out applies to AI.
A classic in data analysis and equally true for AI. A model based on the contents of the Internet has absorbed a lot of things that are not true or that are vile and malicious — big shock!
As noted, Large Language Models are particularly prone to what is alarmingly called “neural toxic degeneration,” which means there is a high probability they will go to very dark places very quickly, often with seemingly innocuous prompts. You can see that in action with GPT-3 courtesy of the Allen Institute for AI.
Noah Smith, a researcher with Allen, said bluntly, “You don’t have to try hard to get these models to say things that are mind-bendingly awful.” Even the GPT-3 team warns that “biases in the data that may lead the model to generate stereotyped or prejudiced content.”
Make sure your AI knows its limits…or that you do.
A frustrating problem with GPT-3 is that it answers everything…and I mean everything. For example:
Q: How many rainbows does it take to jump from Hawaii to seventeen?
GPT-3: It takes two rainbows to jump from Hawaii to seventeen.
The problem is more obvious where the question is real and the answer seems right. For example: the population of Michigan is 10.3 million; Alaska became a state in 1906; and nine hundred thousand ninety-nine is the number that comes before one million. All of these answers are wrong, and scarily wrong because they actually sound plausible (think about the nine hundred thousand and ninety-nine again).
A recent research paper noted that “Worryingly, we also find that GPT-3 does not have an accurate sense of what it does or does not know.” One GPT-3 user said more bluntly, “The problem with GPT-3 is that it doesn’t error out, it just produces garbage — and there’s no way to detect if it’s producing garbage.”
We are already swimming in misinformation and I see no reason why we need an AI technology that can produce it at scale (to the tune of 4.5 billion words a day of GPT-3 generated content and counting).
You are masters of your domain.
This is related in the sense that AI works best when you direct it in a purposeful, intentional way based on knowledge of your domain. The research paper I cited earlier noted that “unlike human professionals, GPT-3 does not excel at any single subject.” The GPT-3 teams acknowledges candidly that it “lacks a notion of what is most important to predict, and what is less important.”
Natural language AI is most useful when it is built on, augments, and captures in a repeatable way domain knowledge. This requires engineering guardrails (like the combination of AI approaches we use in our Hybrid NL), embedded knowledge and humans in the loop. We built our platform on all three of those pillars for that reason and because it allows businesses to build accretive and durable competitive advantage with their AI tools.
Where I Stand on GPT-3
To be clear, GPT-3 is astonishing in its ability to mimic human language constructs. It will be fascinating to see what the next evolution of massive language models can produce, and the idea of an API-driven natural language service that can power applications is a good one (we have our own NL API). However, the broader gain is much less clear, and the potential to go awry seems real.
OpenAI would not initially release the source code for GPT-2 out of concern for generating misinformation, but it seems to believe it has sufficient controls in place to allow hundreds of applications to run safely on GPT-3. I hope that they have set the bar as high as their mission of “safe artificial general intelligence” suggests.
As far as GPT-3 representing a real advance toward artificial general intelligence, I’m in the camp of skeptics like Gary Marcus, Judea Perl and others.
- Yejin Choi said about commonsense reasoning, “We cannot just get there by making the tallest building in the world taller. Therefore, GPT-4, -5, or -6 may not cut it.”
- Gary Marcus and Ernest Davis were even more blunt when they said, “[GPT-3 is] a fluent spouter of bulls***, but even with 175 billion parameters and 450 gigabytes of input data, it’s not a reliable interpreter of the world.”
- On the real-world application of GPT-3, the team that developed it said, “A limitation associated with models at the scale of GPT-3…is that they are both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of models of this scale in their current form…”
Inference is what helps most AI make sense of language, offer insights, and power decisions (particularly when coupled with embedded knowledge). Understandably, it is where most businesses focus their efforts to create value with AI language software.
Finally, a parting thought from the GPT-3 team that applies more broadly to AI: “The outputs of language models are stochastic, however, and though developers can constrain these…they are not able to perform consistently without human feedback.” In layman’s terms this means there is no magical AI black box that delivers “hands-free” capabilities.
When you read about the next “breakthrough” AI capability making such claims, you might think back to the “bull manure” references. And if someone tries to sell you technology based on that capability, my advice is to grab your wallet and back up to the door.
- It starts to get very alarming when you realize that GPT-3 now produces 4.5 billion words a day, much of which is presumably getting recycled and regurgitated into the same Internet dataset that will train future models of its ilk.
- OpenAI was founded by folks like Peter Thiel and Elon Musk (who later resigned from the board), and in 2019 received a US$1BN investment from Microsoft. Well-funded, they have worked on language models as part of their exploration of artificial general intelligence and announced GPT-3 in May 2020. Microsoft subsequently obtained exclusive licensing rights to it (which Elon Musk criticized as counter to the original mission of “open AI”). The language model is currently available as an API through an application process managed by OpenAI.