Entity extraction, also known as entity name extraction or named entity recognition (NER), is an information extraction technique that identifies key elements from text then classifies them into predefined categories. This makes unstructured data machine-readable (or structured) and available for standard natural language processing (NLP) actions such as retrieving information, extracting facts and answering questions. So how exactly does it work?
Why Entity Extraction Matters
Text appears as unstructured data in several formats such as document files, spreadsheets, web pages and social media. Your ability to identify the entities within the documents — people, places, organizations, concepts, numerical expressions (e.g., dates, times, currency amounts, phone numbers, etc.) as well as temporal expressions (e.g., dates, time, duration, frequency, etc.) — enables you to understand the information they contain and put them to good use.
Whether it’s an analyst with hundreds of documents to review, or an investigative journalist with several terabytes of data to sort through (i.e., on the scale of Wikileaks or the Panama Papers), they may not initially know what the information contains, nor what they should be looking for.
Entity extraction can provide a useful view of unknown data sets by immediately revealing who and what the information focuses on — at a minimum. This enables analysts to view all entity types (e.g., names of people, companies, brands, cities, countries, and even phone numbers) in a structured corpus that they can use as a point of departure for further analysis and investigation.
Entity Extraction at Work
Entity extraction technologies must address a number of language issues to correctly identify and classify entities. While it is easy for a human to distinguish between different types of names (e.g., person, place, organization, product, etc.), the ambiguities of language make this an especially complex task for machines.
One of the primary challenges for machines is part of speech tagging. This is the process of breaking down sentences into their proper parts of speech (e.g., nouns, verbs, adjectives, adverbs, etc.) based on word definitions and context. With this information, machines can identify noun phrases which, in turn, help to identify the primary entities. Key to success though is context.
An NER system based on keywords cannot properly differentiate between all the possible meanings of a word, nor how it is used. For example, “orange” could represent the color, fruit, county or school mascot, but a keyword search has no way to distinguish between them.
Extraction rules are what fuel the extraction of entities in text and can be based on either pattern matching, linguistics, syntax, semantics or a combination of approaches. Entity extraction based on semantic technologies uses logic to disambiguate meaning and understand context, enabling many useful downstream operations that are valuable to a variety of business functions across a number of industries. These include:
- Entity Relation Extraction: This function reveals direct relationships, connections or events shared between different entities as well as complex relationships through inferred, indirect connections. This helps to better summarize information in a quick and efficient manner.
- Linking: This function establishes links between knowledge banks. For example, it could identify all of the places mentioned in a corpus and link them to the corresponding location on a map, or it could cross-reference entities with other information sources.
- Fact Extraction: Extracts all of the data associated with an entity to deliver a specific response to a query from a corpus. This goes beyond responding to a query with a list of documents containing the “answers” that you then must search through yourself for the answer.
These three processing actions provide a strong baseline of capabilities that can transform business processes across an organization. However, their efficacy depends on knowledge, which only a semantic approach can provide.
Knowledge-Based Entity Extraction
Though there are many pre-trained extraction models available, only those models that employ symbolic AI can contextualize content well enough to deliver superior extraction accuracy. At expert.ai, we deliver context via our custom knowledge graph which provides a domain-independent representation of the real world through concepts, their related meanings and the different relationships between them.
With knowledge-based entity extraction, you can learn much more about the entities within your documents than you can with a machine learning-based NER model. This enables you to drive deeper insights and make smarter business decisions. What more can you ask for?