We stand with Ukraine

expert.ai’s Response to CORD-19, the White House’s COVID-19 Open Research Dataset

Expert.ai Team - 17 June 2020


The White House launched a call to action to Artificial Intelligence experts to develop new techniques for accessing and mining data that would help the science community respond to the COVID-19 crisis. Immediately after the call, the White House and a coalition of leading research groups published the first COVID-19 Open Research Dataset, CORD-19, a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2 and related coronaviruses.

Data scientists and Natural Language Processing (NLP) enthusiasts have submitted text and data mining tools they developed in response to this call to action via the Kaggle platform that is making these tools openly available for researchers around the world.

As researchers in the field know, keeping up with the growing body of literature about coronaviruses is a real challenge. However, Artificial Intelligence and Natural Language Understanding technologies can help automatically infer new insights in support of the ongoing fight against COVID-19.

One of the main technical issues in mining this data is the nature of the content itself. A recent statement from the Allen Institute of AI, which has partnered with leading research groups to prepare and distribute the CORD-19 data set, noted that this dataset contains more than 128,000 articles, including over 59,000 with full text, as of May 19, 2020.

Why AI-based NLU?

When he referred to the new CORD-19 data set as the “most extensive collection of machine readable coronavirus literature to date,” US CTO Michael Kratsios characterized the project as a “call to action” for the AI community, which can employ machine learning techniques to surface unique insights in the body of data.

Shortly after the release of CORD-19, expert.ai joined the effort by providing an enriched version of the dataset with further metadata generated through its word-sense disambiguation technology.

Our Artificial Intelligence technology (a mix of NLU and machine learning algorithms) has been applied to enrich each full-text paper with the biomedical entities named in the text and their relations, linking all the extracted information to domain-specific controlled vocabularies such as MeSH, SNOMED CT, UMLS-TUI and UMLS-CUI.

The expert.ai dataset CORD-19_ExpertSystem-MeSH, which is intended to be paired with the latest release of the COVID-19 Open Research Challenge (CORD-19) dataset, is being continuously updated as CORD-19 grows, and two notebooks have been created to help the community get started and make the best use of our data, released under Creative Commons CC BY-SA license.

We are currently committed to keep our CORD-19_ExpertSystem-MeSH dataset updated until the end of the challenge on Kaggle (June 16), so that the community can have access to the latest data.

For more information about our work


Andrea Belli
Director R&D – Technology, expert.ai