NLP Stream: Making a Smaller World with NLP and Linked Data

Linked data and the Semantic Web have come a long way in helping better the world. However, it remains particularly challenging to extract concepts and metadata from unstructured data and transform them into standardized concepts.

In this presentation, Brian Munz discusses the idea behind linked data (JSON-LD in particular) and share how natural language processing can be used to help take advantage of it. From easily enhancing the SEO of a website to making your application more interoperable, natural language processing can make your projects better understood by humans and machines alike.

Transcript:

Brian Munz:

Hi everybody and welcome to the [inaudible 00:00:37] Stream, which is our weekly live stream about all things NLP. And as usual, I’m your host Brian Munz for Expert.ai, I’m a product manager there. This week it’s going to be me presenting so buckle up. No. I’m going to give a talk today about some things that are a little bit more technical and maybe somewhat confusing. Hopefully, I can explain it in a way that it’ll make sense. And I want to show how NLP applies to a certain area of study. Without further ado, I will jump right into it if I can share my screen. Sorry about this. Here we go. Okay.

Brian Munz:

Today, I’m going to talk about making a smaller world with NLP and Linked Data. What I’m really going to talk about is Linked Data, and, in particular, I’m going to focus on JSON-LD. They’re not new technologies, but they’re ones that have gained usage and attention slowly but they’re very powerful, and I want to just talk about how NLP can apply to these topics. And so, obviously, everyone who’s watching this probably … If you know NLP at all you know what it is, but I just wanted to show some points about it in order to illustrate the similarities when I start talking about Linked Data. As you know, artificial intelligence is to intelligence is, in particular, NLP is to a machine understanding language. It deals in semantics and particular disambiguation of language. So Expert.ai is a symbolic approach we … Of NLP, and we benefit from knowledge graphs, and so that’s an important concept to Linked Data. As you’ll see in a second it very much applies. And they’re not one in the same but they’re very much related.

Brian Munz:

But an important point of this is that the disambiguation part and the bit of semantics is … What I mean by saying this technology is not for you is that as a person you can look at these two images and know that they are spelled the same way but know that they are different easily. And so it’s easier for a human, of course, to understand language. And so NLP, and this goes without saying, is for computers and it’s to help them to understand language. And in that way, it speeds up, and at the very base level, enables us to speed things along where you don’t need a human to do it again.

Brian Munz:

Linked Data, it was a project that began in 2006 by Tim Berners-Lee, who is a famous person within the history of the web. He is director of the W3C, which is this sort of … Basically, it’s this organization where they decide on pretty much all of the standards of the web, and then that is then built into web browsers. And so he had this idea or at least popularized it, which was linking data. What this means is not just having a database of information, which makes sense, but also linking those pieces of data and having the ability for a computer to not only … To get a better understanding of what these concepts are by pointing them to other concepts. You can see a network chart here.

Brian Munz:

In order for this to work though, you have to link these concepts and know what these concepts are. Or, a computer has to understand what these concepts are. If we see John Lennon, we know that’s a person, and we know what a person is, and we also know what attributes can belong to a person. They have a name, they have a birth date. All of these things are attributes and ways that we can link between different sets of data. As I said before with NLP, this relates to semantics and disambiguation because what you’re trying to do is disambiguate knowledge for a computer to understand. And I’m going to, of course, give some examples later of how this is useful, but it’s a very important concept of trying to basically create this vast network of everything and linking everything together in order for machines to understand what all these things are again. Again, it’s not for us, it’s for us to make things easier for a machine to understand.

Brian Munz:

And so not to get again, too technical, but within the delivery method of how this Linked Data is presented, right. One of the main use cases for Linked Data is interoperability which is a big word. What it basically means is, if everyone knows … If all the machines know what the thing you’re dealing with is, you can more easily pass that information around. Again, with the bass versus the base, if everyone has agreed that it is … A bass is a fish, then you can pass that along, and then the other machine process will understand it better. And so in order to do this, you have to have a format that computers can understand. So when you’re passing this information back, you can’t just send it over and whatever type of format is text you want.

Brian Munz:

A format which is more popularly used in development is JSON-LD. And it’s not all that complicated, it’s just that JSON is the format of data in the JavaScript. So JSON stands for JavaScript Object Notation, and it’s just the format, and you’ll see it in a few minutes, it’s not all that crazy. And then the Linked Data piece is where in that block of data it points to all the proper places. Something I should have mentioned before is, part of this Linked Data project is … All of this stuff is online and I can even show you if I can get to it. Yes, I’m getting to it now.

Brian Munz:

One of the more common examples of this is something like DBpedia. So basically Dbpedia is a database sort of like Wikipedia, but it’s a database of pretty much trying to have everything. And so all of this stuff, again, needs to be online because this is supposed to be for a machine. And so when a machine receives this block of JSON-LD, which has … And I’ll show an example of it in a second. Has all this information in it, it can reference this website, and this website will be a reference to what exactly we’re talking about when we say the word, John Lennon. All the information is on here. And it’s here formatted in the proper way that basically, all of the machines can agree upon and understand what it is. In that way, as mentioned before, you are disambiguating for a machine because it, obviously, can’t understand.

Brian Munz:

And so here’s a quick example of JSON-LD. This is something that would again, be passed probably through an API. For example, if you have a web application that wants to return a person. So it’s asking, send me this person to the web application, if the person logged in, for example, and it could come back in this format, which would be a format that would be … If it’s JSON-LD, it’s something that’s agreed upon, the format is known. All this really is saying, that it’s a person and that they live with this person … This address, and that their name is Jane Doe. The important thing here is this part around context, and so I’ll show you it’s … There’s this website schema.org, and within Linked Data it always points to a place. Like I mentioned before, it points to a location that the two machines or the data is saying, “This is where you should learn what this thing is, what a person is.” Again, I hope this isn’t too confusing.

Brian Munz:

So there’s this site called schema.org, which again, schema is attempting to take basically all of these concepts and split them out into what they are to help it to be understandable for a computer. Maybe I’ll make this larger. You can see on here, you have a person and so what attributes are there of a person? You have names, you have tons of stuff. Job titles, everything. And then all of these attributes also need to be defined. So what is an address? The components of an address is postal address, which is this, and so it has all the things that could be within an address.

Brian Munz:

In this way, even though this seems confusing and it might be a lot of work, what it actually is doing is it’s taking all of the ambiguity out of this and it’s … Instead of just showing … Sending a block of text, as we know in the world of NLP, you just send an address through. You then have to do this work to understand that this is an address. You have to take that text, you have to try to understand it. This is removing that ambiguity. And so in this way, this is one of those ways that you could see the two pieces fitting together. Within NLP, you could identify an unstructured text, this is an address, then you could parse that out into these proper schema types, and make this into JSON-LD that is now Linked Data, it’s easier for a machine to understand.

Brian Munz:

I’ll show an example here. Of course, this is Expert.ai, it’s our NLAPI. I probably need to refresh it. Hopefully, the site is not being weird while I’m trying to demo it. It may be a slow connection so I may come back to that. It’s fine. All that to say, JSON-LD is to help computers better understand and handle information through standardization of language, and that is what NLP also does. And so in that way, within the implementation of JSON-LD, NLP can be a big help for that. And I’ll try to show a few real-world examples of how NLP could be used to enable Linked Data, especially JSON-LD like I said because I’m focusing more on the developer side of things. The real world, hopefully, that wasn’t too confusing but maybe this will help show how this is actually used.

Brian Munz:

If you actually start to look up Linked Data or JSON-LD, the most common way that it’s used is within SEO. And so Google has been a large part of adoption of, of course, of JSON-LD and also, the development of it. So if you think about Google, their entire value is in being able to understand what is on a webpage and then index that. And so, of course, there’s things in a webpage, which I’ll show in a second here. Let’s just go to it, that’s a bad example. So if you go to a webpage, your average webpage, there’s things that already exist where this is … Don’t bother reading this, I’m just trying to make a point. This metadata is telling basically Google what to expect on the page. But what would be even more useful is again, taking out all ambiguity and also relating these things to actual known concepts.

Brian Munz:

For example, if we take an article like this, so the Washington Post about Joe Manchin reaching a deal with Democrats, do you want Google engine who understand what exactly is on this page? And it could scan the whole page, it could use some of the metadata, and it does, but its preferred method and its fastest method, in order to remove all ambiguity and have the best possible results, is to use JSON-LD. And so if we looked at what a schema may look, this is what JSON-LD would look on a … I’m going to blow it up. On a webpage. Like I showed you earlier where it would show the reference, and so we’re knowing … The information we’re showing here is from schema.org so the reference of a news article is on schema.org, which I thought I had up here.

Brian Munz:

And so on a news article, what are the pieces that define a news article? All of this stuff is information we could give to Google to understand what is contained on this page. And then all of this stuff is stuff that can be used for Google to properly index so that when someone does a search they are retrieving stuff they feel the most confident about. For example, if taking out some of the more basic pieces, you have a headline, of course, you have the date published and modified. The description. Maybe we should look down here actually. Over here. All right. So over here you have author. And so it goes without saying that an author is a person, but this again is stuff that removes all ambiguity. It has three authors, and it tells you that the author is a person. That the publisher of it is Washington Post. And you can again see here later genre, keywords, URL, et cetera. This is all stuff that helps Google.

Brian Munz:

A way that NLP could potentially come into play is … One thing NLP does very well is it takes unstructured data and extracts the important pieces out of it. I’m having a hard time getting to this site so I may have to just explain it. I’m not sure what’s going on. Within NLP, what I had done, which unfortunately is not working, is I took this article and ran it through to extract the key entities to understand what the main topics contained in the article. A very common use case for NLP is taking articles and extracting those ideas. Within the genre of politics, it was able to identify that the topic within this article was politics, and you could have this supplied within the JSON-LD, which then again provides less ambiguity for the machines to understand what’s on the page. It extracts the keywords. These keywords are things that will be used by Google, of course, to … So if someone types in Joe Manchin, this should be something that’s within there. They have a complex algorithm, but those keywords can be extracted through NLP as entities within this article.

Brian Munz:

We could even if we had wanted to, represent these keywords as entities like this where we would be able to show that Biden is a person, and again, keep linking to that data so that the references are easy for a machine to understand. Let me check one more time but I don’t think it’s going to work. Okay, that’s fine. All that to say, this is just another way to extract this concepts. The article description, which is a very important thing within SEO as well, because it’s sort of a one-line way to describe the concept of the article. What could also be done is, within NLP you can identify as the most important sentences in the article and so it’d be very easy to just identify that, put that within the description of your SEO, and it’s good to go.

Brian Munz:

And so, of course, a lot of that stuff is handled as the person enters the article in. And so I’m just using this as an example of ways this can can coincide because there aren’t as many articles, people do need to enter stuff in themselves, however, there’s other use cases like this where … To touch on interoperability is, you would want to have these large volumes of documents potentially, or large volumes of unstructured data, and so you could use NLP to draw these concepts out. And again, if you were trying to send that to another machine to understand, you need to properly have it properly understand what these concepts are.

Brian Munz:

And so the second quick use case I wanted to talk about. I can’t totally show it because it’s not readily available but I’ll talk a little bit about it. There’s a company called Measure IO that I’ve been involved with in the past who they are using JSON-LD and Linked Data to … As well as blockchain to address the supply chain issue. Which is that, with imports and exports, traceability is very important because, through different problems within forms and systems and things, stuff can get lost, things can get delayed, it’s difficult for the different systems to communicate with each other, and it results in a huge percentage of food waste actually. Trying to improve this and trying to have these different systems work together and understand, and also give the confidence that all of the different stops in the supply chain are signed properly, understood, and traced is very important and transforming them som JSON-LD can have an effect on that.

Brian Munz:

I’ll go back. What is done often within Linked Data is you need to have this vocabulary. So as I showed within schema, we have things like person or whatever, but what about supply chain and more specific type use cases? You can go within open source and there’s all of these W3C factions and communities where what they just start to do is define the vocabulary, and they basically have their own reference points to say, “Okay, if this is a form of a certain step of the supply chain, what does that form look? What are the elements?” And again, within that form, you will have a person who approves it, who is then a person. Again, all of that data becomes linked so that when that system sends it along and says, “This form has been submitted,” it can go to the next step along the line, and it’ll be much easier for that step to A have confidence that it’s valid, and B understand which each of the pieces of the form are, and this would speed up the whole process immeasurably.

Brian Munz:

Again, this way hopefully, it illustrates that a part of it, of course, is simply in defining things, make things easier for a machine, but it’s also just a way that if everyone … If all of these applications and business systems can sort of agree on a format in terms, and the data being linked giving more and more context, that it makes things much easier for a machine to understand, and it also can eliminate problems that can greatly affect the world. For example, in this case, if this project is successful, there’s a lot less food waste and it improves the efficiency of the supply chain which is something, of course, that’s greatly needed.

Brian Munz:

Hopefully, that was helpful. Maybe I will go back and try to show what I was talking about. Sorry about that. Hopefully, we’re working now. One thing I was going to show, of course, was … If this doesn’t work. Okay. Again, interoperability. For example, if it’s with identifying personally identifiable information, as it pulls things out you want to understand whether it is a name. I’m going to try one more time. Name and address. All right. Bad luck for me I guess. Hopefully, at least you understood the concepts.

Brian Munz:

The main point I wanted to make was that the two things are both around making things easier for computers to understand, and identifying and just ambiguating the language in order to have this global network of understanding of what these concepts are. That it’s not just a sort of pie-in-the-sky idea for nerds to think about, it’s actually a very useful technology and something where I think NLP and Linked Data make a good combination together. All right. So I will stop sharing my screen. Next week, it will not be me it will be … Let me see really quick. We’ll be enhancing the metadata from your files with the enrichment API, that’s going to next Thursday, same time same place. Thanks for joining, and I will talk to you next week.