Language Is Data

“Language is so tightly woven into human experience that it is scarcely possible to imagine life without it.”
— Steven Pinker

“Time flies like an arrow, and fruit flies like a banana.”
— Groucho Marx

A defining characteristic of humans is our remarkable language ability. We use language to convey information with extraordinary depth, nuance and precision. From the earliest age, we construct basic representations of our thoughts, emotions and observations (ask any parent of a two-year-old flexing newfound language skills).

We quickly master hierarchical relationships in language and can generalize them with skill, going from broad to very narrow with great accuracy (i.e., from living thing to animal to dog to our own beloved mutt “Rover”). By age six we possess a vocabulary of more than 10,000 words and a well-read student will have increased that tenfold by high school. And we learn to navigate the ambiguity of multiple senses of words that give language its flexibility, drawing on context to derive meaning.

Consider the following:

Banking a shot at the pool table before walking down the bank of the Thames to get some more money at the bank, because your shot missed, even though you had banked on winning the bet.

Now look back upon Groucho Marx’s wordplay at the start of this piece. It gives most of us only a moment’s pause before we fully understand it. That is pretty amazing.

Listen in to Walt’s recent conversation on the EM360 podcast where he discusses the future of artificial intelligence and its relationship with NLP.

A Perspective on Language

Language is the data of how humans interact. In formal terms, we call it natural language to distinguish from artificial languages created for specific purposes, such as software programming (Java is a coding language). But like the fish that is unaware of water, we rarely think about how language surrounds us…until we go to work. That’s where the trouble starts.

An avalanche of email; more information than we can possibly read; customers that need attention; news that can have real impact; complex and critical documents that power processes where details matter. All in natural language.

The Challenge of Natural Language

The ease of creating language via technology has unleashed massive volumes of language data (i.e., 65 billion WhatsApp messages sent daily). More bad news: the very flexibility that makes language so useful to us confounds technology. The formal term is Moravec’s Paradox — what is easy for people is often hard for machines.

Advances in computing, storage and processing of structured data have revolutionized industries and powered the growth of information technology. But estimates note that 80-90% of all data generated is unstructured (often in the form of language), and it is growing at twice the rate of structured data.

The result is that this huge, growing and largely untapped source of language data still depends almost entirely on people — often those with a specialized understanding that only time and training can provide. This is incredibly difficult to scale or replicate with the challenges of finding the signal through the noise and reducing both inconsistency and variability.

So how does the enterprise structure language data into knowledge and insight for faster, smarter and more consistent decisions? How do you find the signal through the noise to accelerate applications that involve language? How do you get machines to understand language in a remotely similar way to human?

The answer: artificial intelligence. Except…not really…or at least not with most of the approaches to date.

Bigger is Not Always Better

We don’t have to stretch our imaginations to find everyday examples of the limitations of technology dealing with language. Here’s a small sampling:

Trying to find an email that has some critical information.
Searching for what seems like a pretty straightforward and common piece of information only to learn that it is not one of the more frequently asked questions in the FAQ section.
The great chatbot revolution of 2016 that ended roughly in 2016.

These examples do not even begin to touch the more complex cases of language data embedded in complex documents. Think insurance policies, where specialized terms are both bountiful and critical (“…except where said coverage is expressly excluded, limited, or otherwise specified in Annex D”), or when language information emerges in real time and changes rapidly (look no further than the changing language around the name of the global pandemic from COVID-19 to the coronavirus to SARS-COV-2, etc.).

Artificial intelligence, driven by machine learning and its many variants, has made real gains in providing predictive insights from structured — generally numerical — data such as inventory patterns, sensor data for predictive failure, buying behavior and many more. But Moravec’s Paradox continues to hold in efforts to have machines replicate the intelligence that, to us, comes relatively easily (e.g., the ability to recognize our surroundings well enough to drive a car safely under a wide variety of conditions, or the ability, given the right time and training, to understand the critical information contained in very complex, domain-specific language).

This challenge remains, even as machine learning models reach comically large size. The OpenAI GPT-3 model contains 175 billion parameters (think of them as sensitivities) and was trained on the entire contents of the Internet. Google, shortly afterwards, released a model with one trillion parameters. And while GPT-3 has attracted an enormous amount of attention with its ability to mimic the use of human language, it has a host of very real challenges that raise questions over its practical value…but that’s for another post.

Taking the Commonsense Route to Better Language Understanding

What, then, is to be done? There is growing recognition that there is no “grand unified model” for language (at least not anytime soon), and the most powerful new approaches in pure machine learning come with their own newly created problems that render them impractical at a minimum and dangerous at worst.

Now, many of the most passionate and celebrated researchers in machine learning are coming around to the need for a more commonsense approach centered around a combination of AI techniques that employ both machine learning and symbolic knowledge representation (often called “GOFAI” or good old-fashioned AI). In doing so, they create an approach centered around human knowledge.

That, I am happy to say, is good news for us. We made a bet on this approach more than ten years ago, as we started to build our AI-enabled language technology. We have since brought it to bear across Fortune 1000 companies in just about every vertical and language use case imaginable.

We now power language understanding for any application or process across any domain. Language is data. It’s time to put it to use.

Give us a call…or write. We’ll understand.