We stand with Ukraine

NLP Stream: Enhance the Metadata from your Files with the Enrichment API

Information retrieval processes, such as search engines and recommendation systems, are crucial to streamlining data exploration and discovery processes for both internal and external users alike. However, their efficacy hinges on the quality of the metadata being used. Unfortunately, enriching metadata is a monumental challenge to take on manually. Semantic enrichment — the process of generating new metadata from your unstructured — data can change all that.

Watch Jose Manuel discuss the semantic enrichment process and a demonstration of how the European Open Science Cloud (EOSC) is using expert.ai technology to provide users with the data and service infrastructure to support scientific initiatives in Europe.

Transcript:

Brian Munz:

Hey everyone. And welcome to the NLP live stream. As usual, I am Brian Munz, I’m a product manager at expert.ai and this live stream is something we do every Thursday at 11 Eastern time. And we like to have on experts, people in the field of NLP to talk about topics that are relevant to NLP, and hopefully interesting in general, about the world of NLP.

Brian Munz:

So, make sure to join us every week.

Brian Munz:

And this week is no different. We have a very interesting topic. We have a returning superstar in Jose Manuel Gomez-Perez, who has spoken before on a variety of topics. And so without further ado, please take it away.

Jose Manuel Gomez-Perez:

Okay. Thank you, Brian. Thank you for the kind of introduction. Let me share my screen.

Jose Manuel Gomez-Perez:

Yeah. Okay.

Brian Munz:

Okay.

Jose Manuel Gomez-Perez:

We should be seeing my screen right now.

Brian Munz:

Yeah.

Jose Manuel Gomez-Perez:

Fine. Very good. So, today we’re going to talk about extracting value from documents, which is something that we usually do in this kind of NLP stream. But we’re going to focus on the specific type of documents, which are scientific, scientific-related, science-related.

Jose Manuel Gomez-Perez:

I would like to start with a reflection on how the research life cycle works. Typically what scientists or researchers in empirical sciences or observational sciences do is to start from some kind of background information, like literature, bibliography, and so on. Then a hypothesis is formulated. We have also a number of assumptions on input data, and then methods, particularly computational methods these days, where you use this data in order to try to prove your hypothesis. So, what comes next is the experiment or the observation. Out of your experiment or observation, you produce a number of results, which is data. And then, this is interpreted in a scientific way and published, communicated to the rest of the community through scholarly communication.

Jose Manuel Gomez-Perez:

The idea is that this whole thing is incremental so that other researchers or scientists can use these results in order to continue doing research. And it’s very important that all these research data, research findings are fair, which means that they have to be findable, accessible, interpretable, and reusable. And what we are going to talk about today is related to this.

Jose Manuel Gomez-Perez:

AI has become imperative in science in the recent years. If you look at the scientific enterprise a century ago or even more, it was always about the single office, if you remember about Galileo, Einstein, all these people working individually, trying to produce some breakthrough. Then we have coauthors, more and more people working together. And eventually you have different communities of scientists working in conjunction to produce results related to very complex challenges in AI.

Jose Manuel Gomez-Perez:

And the thing is that science has become so complex that scientists require means to simplify the kind of things that are necessary in order to produce the different results. And this is something that, for example, Yohana Gil, in her presidential address at the AAAI conference, emphasized now a couple years ago. She knows a lot about this because she’s worked in AI and science for a very long time.

Jose Manuel Gomez-Perez:

Our focus is on language. I’m going to talk about the metadata chasm in scientific literature.

Jose Manuel Gomez-Perez:

Most of the knowledge that is contained in a scientific document, imagine a paper, technical report, a PowerPoint presentation even, remains hidden side. So, how do we surface such knowledge, especially in a way that machines can actually read this document, this information can actually have an idea of what is contained there?

Jose Manuel Gomez-Perez:

The things needed in order to achieve this scientific metadata is very, very important. However, this metadata is typically provided by the authors and tends to be focused on the container rather than the content, things like name of the authors, date, these kind of things.

Jose Manuel Gomez-Perez:

Of course, this is easier and cheaper for publishers. And although this trend is changing, it’s still there.

Jose Manuel Gomez-Perez:

So, it’s very important to extract content metadata from the publications themselves, from the scientific documents themselves, in order to, for example, maximize reuse by other scientists, guarantee that there is a possibility to discover through automatic means like a scientific search engine and recommendation systems, the knowledge that is contained in these documents, and also for interoperability between heterogeneous systems and data sets.

Jose Manuel Gomez-Perez:

Our focus here is to produce AI that assists scientists by enriching and understanding scientific information. And this is actually one of AI’s grand challenges, or particularly important or difficult challenges to accomplish.

Jose Manuel Gomez-Perez:

What I’m going to present today is work that we’ve been doing in the context of Project Reliance, which is funded by European Commission Horizon 2020. And the idea is to make scientific information more fair in the context of our science communities, particularly this kind of domain.

Jose Manuel Gomez-Perez:

So, about enriching scientific document. This is a typical paper from the community of sea observation. This one is about particle transport in the Bari Canyon. And if we want to extract information from here, basically we have a process or a service that we call document enrichment. What we do is, we take the text of the whole paper, the document enrichment service ingests this document, and then information is extracted from it in the form of what we call content metadata.

Jose Manuel Gomez-Perez:

And what is content metadata?

Jose Manuel Gomez-Perez:

Well, it’s a collection of information that, for example, refers to the topics mentioned in the document, the key elements that are contained in the document, entities, and other things like for example, scientific claims, the scientific challenges and solutions which are referred to in the document paper, how innovative this document is, and the key questions that somebody could ask that can be replied or answered using this document as a base.

Jose Manuel Gomez-Perez:

Of course, in order to be able to have an NLP system which is able to extract this kind of information from scientific documents, you need to adapt to the domain. In our case, what we did was to extract the document corpus and structured resources from the earth and environmental sciences literature that was given to us, for example, by European Space Agency through our collaboration with them, or literature to which we have access through resources like Springer Nature SciGraph, which is a knowledge graph of scientific publications provided by Springer Nature, or the scientific database Elsevier Scopus or the OpenAIRE knowledge graph, which connects periods with technologies with authors with publications and makes it available publicly.

Jose Manuel Gomez-Perez:

The thing is that this document corpus that we extracted from all these resource contains about 50,000 articles with 13 million tokens and more than 270,000 unique terms. So, that’s pretty big and means the extension of our knowledge graph, the expert.ai knowledge graph, with new 50,000 lemmas, 172,000 expressions, and nearly 60,000 entities.

Jose Manuel Gomez-Perez:

So, we started the process of customizing the expert.ai knowledge graph with all this information. And we also used this document corpus to train a number of deployment models for different tasks like the ones I mentioned before.

Jose Manuel Gomez-Perez:

The interesting thing here is that these two things, the models and the knowledge graph, do not work isolation. So, for different tasks, they work in a combined way.

Jose Manuel Gomez-Perez:

This is the snapshot of document enrichment. I’m going to give you a quick demo of it.

Jose Manuel Gomez-Perez:

Okay. So, here, what we can do is we can select one of these documents, which are typically scientific papers from different communities. For example, this one is about evolution of marine noise pollution management. And we’ll click on analyze. Our system provides us with the metadata that is extracted from the document.

Jose Manuel Gomez-Perez:

So, in this case, we have three main modes, three main types of metadata here: topics, key elements, and entities. And for example, the topics are provided by different resources and different taxonomies. We are using here the knowledge graph itself. Also, we are using a general proposed taxonomy, like IPTC, and specialized standard scientific taxonomies, like in fields of research, which is provided by Springer Nature and the NASA subjects and scope taxonomy.

Jose Manuel Gomez-Perez:

If we click here, we see that this particular document is about biology. It’s about marine biology and it’s also about trade, because it studies the effect of noise pollution in the sea, basically. And not only from an environmental point of view, but also from a commercial point of view. According to IPTC meta topics, this document is about marine science. It’s related to animal and also to budget and budgeting because of this commercial interest. And fields of research, it says that this is about environmental science and management, which is totally right, and acoustics, because it also relates to how the effect of noise impacts on whales, in this case.

Jose Manuel Gomez-Perez:

If we go to key elements, we have a number of metadata, which is extracted from the text, for example, main sentences. And this provides you with a summary of, what are the most relevant sentences contained in this document? So, it’s kind of a summary of the whole paper in this case.

Jose Manuel Gomez-Perez:

The main lemmas, which are canonical representations of words, this means that, for example, if the word technologies appear in the text, the lemma corresponding to it would be technology, and would all be counted once as a lemma. Same with European Union. We have European Union and EU in the text. It’s only counted once as European Union.

Jose Manuel Gomez-Perez:

The main phrases, which are multi world specials contained in the text, like Marine Strategy Framework Directive, and the main concepts. Here, we have things like synonyms. For example, we have effects and outcome as synonyms because they are linked in our knowledge graph.

Jose Manuel Gomez-Perez:

If we go to entities, we will see, for example, that organizations like the European Union are mentioned in the text and also locations like United States of America and Europe. It’s interesting to see how Europe is a location and European Union is an organization. The system is that smart.

Jose Manuel Gomez-Perez:

If we also want to see a person here, for example, let’s say Obama, click on analyze, the system understands that I’m referring to Barack Obama and it types it as a person. It’s also interesting to see that the entities are linked to some resources, like for example Wiki data and Geo Names. So, you can see here, if clicking the link, we open the page in Wiki data related to United States America. And the same with geo names. It shows as the entry corresponding to the United States.

Jose Manuel Gomez-Perez:

Okay. So, this is a good demo of our document enrichment system. Let’s continue.

Jose Manuel Gomez-Perez:

Very important from the point of view of achieving the objectives of the fair objectives, like making research that are findable, accessible, interpretable, reusable, is the discussion beliefs being held in the scientific community about how we can go beyond the PDF when it comes to publishing scientific results. The idea is that not only a document can be published, but also data of software or intermediate results.

Jose Manuel Gomez-Perez:

The idea research objects is very related to that.

Jose Manuel Gomez-Perez:

And in the end, a research are semantic aggregations of scientific information that contain all the materials, methods, and results of a scientific investigation. And this includes publications, data results, things like workflows, scientific workflows, slides, all the metadata, or execution locks, for example. These are very interesting because they are all sharing with other scientists. And you don’t have to wait for the whole research to be finished, but you can version different states of your research and share those with the research community, because you can also sign a DOI and identify to those individual research objects. And this can be cited.

Jose Manuel Gomez-Perez:

And that’s great because it also enables reuse. This is very useful for ensuring credit and attribution of scientific research, which is also one of the main stoppers for sharing scientific results and supporting reproducibility and long-term preservation.

Jose Manuel Gomez-Perez:

So, how does the research object enrichment service work?

Jose Manuel Gomez-Perez:

It’s basically a layer on top of our document enrichment. We extract all the resources in the research object that have some text inside. We analyze them with the document enrichment. And then we run an algorithm to aggregate all the metadata that is extracted from each individual resource into a whole research object level set of metadata.

Jose Manuel Gomez-Perez:

All this information, all these research updates, are obtained from a platform called RoHub, which is basically the reference platform these days to make available these kinds of resources.

Jose Manuel Gomez-Perez:

And the research object enrichment is like the document enrichment that are actually in use. It is integrated in the RoHub. And for example, in this case, this is a research object related to the Virunga Volcano in Congo.

Jose Manuel Gomez-Perez:

If I click here, I will go directly to the microsite in our hub that contains this research object. And see, this is a description of the research object. This is a figure showing us the eruption sites and so on. These are the contents of these services droplet, which contains a number of resources, in this case, a website, a document, and also a map. And this is activity and version information related to the research object.

Jose Manuel Gomez-Perez:

More interesting for us in this case is the metadata that has been extracted here from this research object. We have things like topics like vulcanology or geology. We have frequent inspirations here, like monitoring infrastructure or strength of the volcano, these kind of things, organizations like the Virunga Super Site, places, Germany, [inaudible 00:17:36], and concepts like vulcanism or eruption, these kind of things.

Jose Manuel Gomez-Perez:

Okay. One of the good things of being able to extract metadata from this kind of resources from publications or, in general, any kind of scientific document, and research objects, which encapsulate or contain all the information that is related to the research, is that you can very easily build applications to make this information easily findable, for example, through a search or recommendation.

Jose Manuel Gomez-Perez:

Here in this slide, we are seeing an example of a recommendation system built on top of this metadata and in combination with content hosted by our hub, which we call the collaboration spheres.

Jose Manuel Gomez-Perez:

And here, the idea is that we want to reduce cognitive load that scientists sometimes experience when they have to dive into gigantic databases looking for documents, looking for interesting research or potential collaborators for them. Instead of having to type a query of, “I want this and this and that,” as you’d be doing in Google or any other search engine, what the system does is to allow or enable search by example. You do exploratory research by means of selecting a number of research objects or scientists, which you may think are relevant for your work or for what you want to do. And the system interprets the content of these research objects or the research objects provided by these other scientists. And it proposes content which is similar to the contents of interest that you have created.

Jose Manuel Gomez-Perez:

Let’s do a quick demo.

Jose Manuel Gomez-Perez:

Okay. I have to log in. Here I am. Okay.

Jose Manuel Gomez-Perez:

These are the collaboration spheres. Let’s see. Let me grab one of these research objects here, maps of hard structures in the Lagoon of Venice. The main topics here are map, lagoon, the Mediterranean Sea. The areas of knowledge or topics are geography, architecture, infrastructure industry. This is metadata that has been extracted automatically from the content of the research object.

Jose Manuel Gomez-Perez:

If I take this to the center of the spheres, what I automatically get this a recommendation of other research objects which are similar or related to this one, based on the metadata that was previously extracted from them. And for example, I see that this one is talking about deep sea habitat suitability, which is related to the research object that I’m using as an example to query the system here.

Jose Manuel Gomez-Perez:

So, what if I combine them?

Jose Manuel Gomez-Perez:

I can drag this to the center of the spheres and I can continue my exploration. And I say, “Ah, the citizen science and jellyfish distribution, this is interesting.” If I click here, I should be able to go to RoHub and show the research object that I just accessed. And this research, which is the result of my exploration, also contains metadata, a lot of metadata, that has been extracted from it.

Jose Manuel Gomez-Perez:

You see? Very good.

Jose Manuel Gomez-Perez:

Okay. All this has been made available in the European Open Science Cloud as three main services, the enrichment, the search, and the recommendation services.

Jose Manuel Gomez-Perez:

If you go to this website, you can see a lot of information here. You can have direct access to the demos. You can also have examples about how to use the API, featured research objects, which are particularly nice to see because there are representative, and also, we offer a Jupyter notebook, which shows how you can use this API from a Python notebook and execute it live. So, you can click here and play with it. For any questions, there’s also this help desk email address that we are offering to the users of these services.

Jose Manuel Gomez-Perez:

This is an ideas marketplace, like I mentioned here. If we go to the marketplace, this is our microsite in the European Open Science Cloud. And you can see here a description, what we produce, the services that we offer, which in this case, in the context of EOSC, are the enrichment API recommendation and the search service.

Jose Manuel Gomez-Perez:

Okay. And for here, you can just directly click and access the services.

Jose Manuel Gomez-Perez:

And that was all that I wanted to share with you today. If there are any questions, I’ll be happy to take them. Thank you.

Brian Munz:

Yeah, no, thanks. That was great. And it’s interesting because it’s a perfect example of one of the use cases of NLP when it comes to data. Right? Because of course in the past, it’s always been possible to have tags and things on articles. And then, within search, you can search amongst those tags and all that metadata exists.

Brian Munz:

However, you’re only as good as the people who are tagging, as well as, that’s a very manual process. And so, that’s part of it.

Brian Munz:

And then, also I think something with NLP that is very important is how it normalizes all of the data. You don’t have a person. It normalizes all the tags and it makes the data more understandable and usable by a functionality. And that’s [crosstalk 00:24:20], especially in science.

Jose Manuel Gomez-Perez:

Yeah. Yeah. And also, if you just use the text, you depend on, how can you encourage the people to actually produce those tags? I mean, it’s not so easy.

Brian Munz:

And also, there’s so much research that a person may have found something that was not the focus of the research that’s going to be useful to somebody else. Right? So, if that’s extracted by NLP, you have someone searching along and then they find that this concept was covered. And while the researcher is focused on one thing, this person’s focusing on something else. In that way, it sort of flattens a lot of things and allows people to find new insights.

Jose Manuel Gomez-Perez:

Actually, we have very nice stories related to that, which is we call this a cross-fertilization. In the project, we have people who specialize in volcanoes or other scientists are working with sea observation and others are climatologists. And by applying these kinds of techniques where they’re, what they are finding out is resources produced by different community that they can use in their own work, which is something that is not so frequent, not so easy to achieve. I mean, we all work in our own things. And then we need somebody else to tell us, “Hey, have you seen that, what those guys are doing?”

Jose Manuel Gomez-Perez:

So, it’s good fun.

Brian Munz:

Yeah, yeah, it’s hard to keep up. You can’t keep up with the volume anymore, so it’s a way to sort of try to stay ahead of things in your particular area. So, thanks for this presentation. It was very interesting. Always good to see what you’ve got to show.

Brian Munz:

I wanted to point out again the site that you mentioned, if you want to see a demo of all of the things that were shown here. And then next week, we will be back as usual with, “Successful Data Discovery with Taxonomies.” And so make sure to join us then.

Brian Munz:

But until then, again, thanks Jose. And we will see everyone next week.

Jose Manuel Gomez-Perez:

A pleasure.

Brian Munz:

Bye.

 

Related Reading