We stand with Ukraine

NLP Stream: Successful Data Discovery with Taxonomies

Siloed, unstructured, language-based enterprise data is challenging to gain insights from for better decision making. Teams that combine personal and open-source approaches tend to spend more time managing their technology stack than using that knowledge to create value.

Gaining control over language assets such as documents, emails, reports, and webpages can help teams to build more efficient semantic search, intelligent applications, and customized knowledge bases. This can be done through a combination of symbolic (rules-based) and machine learning approaches to provide the highest degree of accuracy, explainability and flexibility.

Tune in to hear Jacob Berk discuss several real-world natural language examples where Hybrid AI is used for successful data discovery. In the livestream you will learn how to:

    • Identify relevant concepts and topics by applying automatic semantic analysis
    • Improve knowledge discovery and natural language applications by building your own knowledge graphs
    • Automate document analysis by semantically classifying large volumes of unstructured data with taxonomies

Transcript:

Brian Munz:

Hey, everybody. Welcome to the Expert AI NLP stream. My name is Brian Munz. I’m product manager at Expert AI, and every Thursday at 11:00 EST we meet here and have a live stream where we talk about all things NLP. Sometimes we have stuff that’s more in the sky and stuff, but today we’re going to see some really interesting kind of examples of real world use cases and how that comes into play with taxonomies. And so today, without further ado, I wanted to introduce Jacob Berk, who works in our consulting services team at Expert AI, so take it away.

Jacob Berk:

Hey, everyone. My name’s Jacob Berk. I’m a solution consultant here at Expert AI, like Brian just said, and today I’m going to be discussing data discovery with taxonomies. And so in order to do this, I’m going to demonstrate Expert AI’s natural language platform, which is a web-based application interface to design, develop, test, and measure natural language understanding models, and then ultimately deploy them into production. So I’m just going to share my screen here.

Jacob Berk:

So like I said, this is an environment to design and develop natural language understanding models. There’s two sides to this platform. There is an authoring side, which is where that development of the models comes in. You can build taxonomies, add annotations, measure the efficiency of the model and accuracy. And then ultimately, the second side is the production side, and that is where you are going to deploy your models into production. Then you can ultimately export them into APIs or whichever way you might want to be able to access those from other systems.

Jacob Berk:

Today, for taxonomy management, we’re going to focus on the authoring side, and so I’d like to show three things here. So the first is just loading a corpus, or training library, and showing how the platform can help accelerate the building of a taxonomy from the ground up. The second I’d like to show is using our semantic approach to understand what is in a Corpus of documents, and help to automatically build a taxonomy rather than having to do it from scratch. And so this is basically understanding what’s in documents, and then automatically classifying those into different categories. And then lastly, I want to look at importing or borrowing a already prebuilt taxonomy and how you can ultimately use our tools to modify, extend, and leverage that prebuilt module to add different capabilities to it.

Jacob Berk:

So without further ado, let’s head into this first example, which is importing a corpus. And so we’ll just call this News Stories August 11. And so what I’m going to do here is I’m going to import about 678 news stories that I have, and we’ll see how our Expert AI technology is able to analyze this and help us to get started on building a taxonomy. You can see we have a bunch of different capabilities, like language detection, but for purposes of this demo and just to make sure we go a little bit more quickly, I’m going to turn these off. And you’ll see in near real time, we’re able to begin processing these news stories. And so we’ll see up here in a second about how many news stories will we have processed. So right now we have 91, 98. So it’s going to take a couple minutes, we’ll get to all 678 documents here.

Jacob Berk:

And so as we begin to load these, you can see that we display a number of different pieces of information from these hundreds of documents. And so we’ll start to see the clusters of terms, words, entities, so this is known as named entity recognition or data mining. Let me refresh again here, see how many we’ve been able to load, and we’re now at 278 documents. And so this is where we’ll start to understand exactly what is in some of these documents and what are the main relevant terms and entities. And then on the left side here, this is where we’ll start to get some of the main topics which can actually help us in the creation of building a taxonomy. So this is going to give us some ideas of what the classification or categories might be if we were to build a taxonomy with these 678 documents.

Jacob Berk:

Just refreshing one more time, we’re at 474, so just going to give it another few seconds here so we can get to all 678, and then we’ll run through some of the interesting capabilities that can help us to understand a little bit more what’s in these documents and better build a taxonomy or classification. And just how this is working while we’re waiting for these to load is we have a cord disambiguator in the background, which is analyzing the language data in these documents in a variety of layers, including semantics, syntactic, morphological, and parts of speech. And that is informed by Expert AI’s knowledge graph, and these work together to best create an inference on what the meaning of each word, phrase, token, sentence, and paragraph is throughout all of these documents. And so we should be just about loaded here. Yep.

Jacob Berk:

So as you can see, we have all these relevant terms, entities, synonymous concepts, which are normalized concepts within Expert AI’s knowledge graph. And again, what’s going to be probably most important and most powerful to informing our taxonomy creation here is the topics that’s on the left side. So you can see here, if we click into this, we can see the 355 documents that are related to politics. If we wanted to go into armed forces, and you can see up top here we’re actually combining those to find what documents belong to both the topic politics and armed forces. So there might be a taxonomy in which documents can belong to just only one classification or a variety of classifications, so this can start to help us understand where our documents lie in these different categories or topics.

Jacob Berk:

Getting into some entity recognition here. So you can see, we have a variety of different entities that come up throughout these documents. We also have different types that are associated with each of these entities. So we have about 28 different types that are predefined. So these include people, organizations, geographic locations, things like that, and you can see the types are underneath the different entities here. And so we can actually use these to start to filter our documents and understand a little bit more what’s in these again.

Jacob Berk:

If we wanted to look at, for instance, the keyword politics. So this will look wherever the keyword, the exact string match, politics is located within all of these documents. And we’ll see that there’s about 36 documents where this is located. But again, if we’re looking in terms of politics as a topic, we’ll see that actually there’s 355 documents that are related in some way, shape, or form to politics.

Jacob Berk:

If you wanted to kind of go through these different entities a little bit in a little bit more organized fashion rather than kind of just searching here, you can also take a look on this left side where we have different people and how many times they come up throughout the different documents, organizations. So we have about 2,500 different organizations that when you click into them, they’re actually sub-layered into different subgroups. So there’s organizations, there’s companies, there’s mass media companies, and then we can go through this variety of entities here on our left side to understand a little bit more what is in some of our documents. And then if we want to click into any of these documents, just to read a little bit more and see again what the entities throughout this particular document are, we can do that as well.

Jacob Berk:

Another powerful example of the way in which we can look at keywords versus the specific types of entities is if we look up Barack Obama, for instance, again, as a keyword, we’ll find all the instances, which are eight documents, where Barack Obama, the exact string match, is located within these documents. But again, we have the powerful entity type that we can search on instead of that. And so NPH is just a person, and we’ll see that he’s actually mentioned in 17 documents and not just eight.

Jacob Berk:

And you’ll see in several of these examples, President Obama, obviously we don’t have the exact string match Obama but we are able to understand that President Obama and Barack Obama are, in fact, the same person. And so this is a very simplistic example, but when we kind of stretch this out into different ways which this might be powerful, sometimes there are pronouns to describe different people, for instance, he or she instead of saying the person’s name, and so we are actually able to analyze the entire document and understand the context of different sentences and tokens to basically map a pronoun to a particular person and understand who we’re talking about in within each sentence and understand the relations there. So this is, again, extremely powerful just to begin developing a taxonomy and understanding, again, most importantly or most powerfully what the topics are across this variety of news stories. But also, if we want to just do some more in-depth searching, we can do that using some of our different capabilities here, which are synonymous concepts or different density types and things like that.

Jacob Berk:

And so I’m going to now show a couple different ways in which we can use our tools to build and develop different taxonomies. So the first is to create automatically a taxonomy using Expert AI’s Magic taxonomy, which I’ll get into a second, and then also importing or borrowing a taxonomy that was already prebuilt on industry standard taxonomy. And so we’ll head over here and, like I said, we’re going to start with our Magic taxonomy and we’re going to use the same news stories that we used in order to do that. So we’ll create a categorization project here, we’ll name it, News Stories Categorization August 11.

Jacob Berk:

We’ll have to choose language here. As you can see, we support a variety of languages, but for right now we’re just going to stick with English. And as you can see, there’s three different ways we can go about creating a taxonomy. We can create a taxonomy from the ground up, and so that requires a little bit of time in order to annotate documents and then ultimately build a model that we want to use for this classification or this taxonomy. We can import a taxonomy, which I’ll get to in just a few moments here. But right now, I want to focus on building a Magic taxonomy, which is basically using Expert AI’s powerful semantic understanding to analyze documents, and we’re going to use the same corpus that we just used for the last, for when we were going through those news stories, to build this Magic taxonomy.

Jacob Berk:

What it will do is it will run through automatically using machine learning techniques to understand what is in these documents and start to build a taxonomy based on what is in those documents. So we’ll upload a training library here. Again, we’re just going to use the news articles. And while this is processing, again, this is working in the background using Expert AI technology to understand what is in all of these documents, again, these 678 different documents that we just took a look at, and start to classify those into different clusters or taxonomies in which we can start to determine how we might want to analyze these. And we can build off of that leveraging what the Magic taxonomy’s built for us, but also do some manual customization if we have an idea of what our taxonomy might want to look like.

Jacob Berk:

Here, we can manually configure if we want to have a certain number of classifications, we can toggle that on or off. For right now, we’ll just go with a complete automatic building of Magic taxonomy. So this is going to take a couple minutes to build here, so obviously we’re running through a decent amount of documents, so we’ll let this run in the background. And while this is happening, I’ll move on to our third example, which is loading or borrowing a taxonomy that is open source.

Jacob Berk:

So we’re going to create a thesaurus project here, and we’re going to ultimately import that. And so this is going to be… I found an astronomy taxonomy online and found some documents to run through that so we can see exactly how this astronomy taxonomy is working. Again, we’re just going to use English. And so we imported the source here, which is a industry standard RDF file. So this is in XML format. So this taxonomy is about astronomy, and so there’s going to be hundreds, if not, there might be thousands of different concepts layered in this already prebuilt taxonomy that we’re going to import into Expert’s interface here. And then we’re going to see how we can leverage some of our tools to extend the capabilities of this taxonomy.

Jacob Berk:

And as you can see with each of the different ways, analyzing and importing taxonomies and documents here, we have this wizard that brings us step-by-step through our different capabilities here. And so, as you can see, that was pretty quick. We were able to import this RDF taxonomy that we found online. And so you can see here, there are hundreds of different classifications, some of these even have subgroups where we can look further into them; and so this is a pretty extensive taxonomy here. And so we’ll start to get into labels and relations, but we want to first upload a library so we can start to take a look at that. And so I’m not going to use the news articles that we just took a look at, but I’m actually going to upload a corpus of astronomy related documents. Again, just for purposes of going a little bit more quickly, we’ll disable some of these auto detect language and auto detecting coding capabilities. And so this is going to take a second here to process, but it’s only 22 documents so it should be pretty quick.

Jacob Berk:

And so we’ll open the project. And as you can see here, again, we had the same taxonomy that we just saw in the wizard earlier, and we want to now see all of the different ways we can kind of extend this taxonomy. This is something that we found that was open source, but we might want to customize it a little bit here and there. So we can check out some of the custom properties so we can start to create some custom properties ourselves. We can start to create some different relations between different classifications that we have over here. If we want to search through some of these to find Earth, for instance, we’ll see that Earth has a variety of different classifications. So it’s the planet Earth, then we have the atmosphere and things like that, and so we might want to add different properties to Earth to know that it’s a planet, right?

Jacob Berk:

So we’ll go over here and we can create a custom property. And so we’ll call this, we’ll make it a Boolean, but we can define it as a string or a number, we might want to say is planet. And so the description here would just be is this categorization a planet? So when we create that, we can then go back again, type in Earth. Click that here. And then, there we go.

Jacob Berk:

And then add here, true for Earth is a planet. And then we can go through some of the other planets that we might have within our taxonomy and then add that custom property to that. We can also add relations if we might want to relate Earth to the Milky Way galaxy to understand that’s within the Milky Way galaxy or things like that. There’s different ways we can add relations there, very similar to how we just added a custom property. And so you can see how powerful these tools can be in extending an already prebuilt taxonomy and just adding a little bit more detail if there’s something that we want to begin to extend and leverage from.

Jacob Berk:

So with that, I’m going to go back to our previous example where we were using our Magic taxonomy to create automatically a taxonomy from the 678 news stories that we looked at earlier. And if we open this project here, so this is done loading, and that was creating a taxonomy from the ground up for 678 news stories that only took a few minutes here, so it’s pretty powerful and pretty quick. And so if we look at some of the classifications that this Magic taxonomy came up with for us, we’ll see these might not be the names we want but it gives us a good idea of the classification of the different documents that are within our corpus. And so we can see commander, combat, battle, soldier, we can click into this one and we might actually want to rename this to be something else, something like military or combat or something like that.

Jacob Berk:

And so we can see that this might not be perfect from the start, but the Magic taxonomy is extremely powerful in helping us to begin to create a taxonomy without having to go through the documents manually and determine what’s exactly in those. We can start to have a very good foundation of what the taxonomy might be using, some automatic semantic and syntactic capabilities that Expert AI is able to use when going through all of these different documents and news stories.

Jacob Berk:

So I know we went through a lot there, spent about 20 minutes looking at some of the different ways in which we can start to create some taxonomies, building it from the ground up starting with a corpus and determining the different topics that are in there, automatically building it using a Magic taxonomy, importing and extending a already prebuilt taxonomy, so I’ll pause there and see if there’s any questions.

Brian Munz:

Yeah. I mean, so this has all been very interesting, especially in considering kind of possibilities of importing your own taxonomies. I mean, is there kind of a… Where do people go to find these? Is there a community of people that all kind of share these open source taxonomies?

Jacob Berk:

Yeah, there’s a variety of different sources online to get different data. Kaggle is a really good source if you want to find different data sets in order to train models, and then there’s different communities online in which you can kind of go and find already prebuilt models. And then again, can use different pieces of technology, like Expert, for instance, to extend already prebuilt models in order to kind of get a more customized version of what you might want without having to start and create a taxonomy from the ground up.

Brian Munz:

Yeah. Yeah, no, it’s interesting because you can see some of the obvious ways that this could be implemented as a use case would be in search, of course, because even just loading into the disambiguator, you come out with things that could be useful within the search context. Like the example of Obama as a president versus a person, knowing the difference is, I think, extremely helpful in search. What are some of the other just very common use cases for someone building out something more custom? Because I get asked that a lot of like, “Well, if you come out with these entities, it knows that it’s an organization or it knows that it’s a person, what would a company want to use it for?” So that’s just something I’m asked pretty often.

Jacob Berk:

Yeah, that’s a great question. So first of all, I think something that’s pretty powerful is normalization. So using a very recent example of COVID, for instance, it might be COVID, COVID-19, SARS2, whatever it might be, but they all really mean the same thing, so the first step is really determining the normalized meaning of each word rather than saying, “Hey, these are all different words and we want to categorize them differently.” So when we start to normalize them, we can then better use that information to categorize the different documents, or perhaps there’s a variety of different…

Jacob Berk:

It doesn’t have to necessarily be documents, it could be different, again, like we just looked at, news articles, it can be email management, it can be social media, combing through tweets or different pieces of social media and basically determining what is within whatever the text information is that we’re looking at, again, normalizing it so we are able to understand that even though strings might not match exactly, they actually mean semantically the same thing. And then we can go through a classification exercise or a taxonomy exercise, like we did in a variety of ways here, in order to classify different text data into a nice organized fashion. And there are a variety of use cases, again. Email management is something that’s extremely powerful, just to name one or two, social media management and things like that.

Brian Munz:

Yep. No, that makes sense. I mean, we’ve talked about it a few times on our live streams, but it’s interesting to see how important NLP is kind of a… It always was, but how closely tied it is to analytics in terms of a lot of time is spent… I came from the data world and a lot of time is spent within normalization of data and understanding unstructured data, and so it’s a pretty important, I think, feature to highlight. So this has been super interesting, so thanks for presenting it. It’s really interesting to see it live and in the wild and the live demo, which is always nerveracking, but everything went well, so thanks for presenting today.

Jacob Berk:

Yeah, of course. Thanks, Brian.

Brian Munz:

Yep. Yep, so that’s it for this week. Next week we’re going to be talking to Expert AI’s CEO, Walt Mayo. We’re going to have a conversation and the title is From Black Box to Green Glass: The Responsible AI Imperative. And so what does that mean? You’ll have to tune in and see what we mean by black box to green glass. But yeah, so we’ll hopefully see you next week. Same time, same place. And thanks again, Jacob, for presenting. See you.

 

Related Reading