Never Stop Expecting More from Your Unstructured Data

Luca Scagliarini - 2 March 2021

Since text-based information began to explode, the need for software to manage it has exploded along with it. Traditionally, this software has been based on so-called keyword technologies as this was the simplest and most reasonably effective method for accessing and analyzing information. In reality, when information continued to grow, this proved not to be completely true.

Keyword technologies use algorithms that focus on matching rather than making sense of the exact meaning of each searched word. This was useful when we didn’t have alternatives or until the volume of information to process was somehow limited. However, more and more organizations face an ever-growing volume of unstructured data (e.g., text, email, business documents, etc.) within their databanks, file sharing systems and CRMs. Without real understanding of this data, these companies are severely limited in their ability to gain insight and extract knowledge to support effectively decision making.

The question is, how can these organizations effectively use their unstructured data to derive something meaningful?

Making Sense of Unstructured Data with NLU (Natural Language Understanding) Technologies

NLU technology understands text in a way that emulates human comprehension of information. For example, it can identify that a text is about “education” and “sports,” even if there is no explicit mention of those two words. Instead, it uses the concepts correlated to them (education: school, tutoring, teacher, math, etc.; sports: game, team, score, football, quarterback, etc.) to make the connection.

More importantly, it comprehends conversational language and all its ambiguities (slang, abbreviations, multi-language text) to understand not just the words, but the intent of the user. This proved particularly valuable in gauging voter intent for the 2020 United States presidential election. By analyzing the sentiment of more than 500,000 Tweets, we, with the help of social research firm Sociometrica, provided a more accurate indication of intent than traditional polls.

Sociometrica performed a similar analysis a couple years ago of Twitter users’ travel intent (i.e., popular destinations and other travel logistics) with regards to Rome, Italy. This particular analysis showcased the technology’s ability to analyze unstructured text and its strength in establishing connections between not just words, but concepts.

The majority of the 30,000+ comments that were analyzed focused on the topics our system categorizes as “transportation,” such as flights, air travel, taxis, buses and the subway. In this way, NLU technology categorized the comments based on their stated or implied subject matter and established a hierarchy of the top concepts mentioned by commenters.

By the same token, NLU technology differentiated between ambiguous information to determine the proper context of the value judgment “cheap.” This term could be intended as good, frugal or indicating poor quality, but without the ability to capture the overall context from a correct linguistic analysis (morphology and grammar, syntax, lexicon), it is difficult to make a distinction between two or more meanings.

Here, it’s not about the guesswork of keywords, but rather the ability to distinguish one word with many meanings and a group of words that have, or are correlated to, the same meaning.

The Business Advantage of NLU Technology

In business, there is a constant need for analysts to understand more and more information. And while there are plenty of good systems to analyze structured data, the keyword-based systems used to process unstructured information require constant development (and anticipation) of new keyword lists and information to avoid performance deterioration.

The problem is most organizations do not have the time or resources to do the regular document training necessary to fulfill their needs for deeper knowledge. It is next to impossible to manually think through all the terms that could have the same meaning or the many ways in which something can be said, in English or any other language. Not to mention, there a number of insights or low-lying trends that have been underestimated or overlooked.

So is NLU the panacea for every analyst then? Of course not. Keywords can be useful, but we are aware of their limitations. Integrating NLU with keyword technology through faceted search is a hybrid solution that further refines search along different paths according to a certain order or category. It could even be a solution to start the migration to a full NLU-based search.

Ultimately, companies expect more from their unstructured data. And they should. But it’s not enough to simply access unstructured data; it needs to be accurately processed. Let NLU guide you forward.