Entity extraction, also known as entity name extraction or named entity recognition, is an information extraction technique that identifies key elements from text, then classifies them into predefined categories. This makes unstructured data machine readable (or structured) and available for standard processing actions such as retrieving information, extracting facts and answering questions. So how exactly does it work?
Why Entity Extraction Matters
Text appears as unstructured data in several formats such as document files, spreadsheets, web pages and social media. Your ability to identify the entities within the documents — people, places, organizations, concepts, numerical expressions (e.g., dates, times, currency amounts, phone numbers, etc.) as well as temporal expressions (e.g., dates, time, duration, frequency, etc.) — enables you to understand the information they contain and put them to good use.
Whether it’s an analyst with hundreds of documents to review, or an investigative journalist with several terabytes of data to sort through (i.e., on the scale of Wikileaks or the Panama Papers), they may not initially know what the information contains, nor what they should be looking for.
Entity extraction can provide a useful view of unknown data sets by immediately revealing who and what the information focuses on — at a minimum. This enables analysts to view all the names of people, companies, brands, cities, countries, and even phone numbers in a (structured) corpus that they can use as a point of departure for further analysis and investigation.
Entity Extraction at Work
Entity extraction technologies must address a number of language issues to correctly identify and classify entities. While it’s easy for a human to distinguish between different types of names (e.g., person, place, organization, product, etc.), the ambiguities of language make this an especially complex task for machines.
A system based on keywords cannot properly differentiate between all the possible meanings of a word, nor how it is used. For example, “orange” could represent the color, fruit, county or company, but a keyword search has no way to distinguish between them.
Extraction rules are what fuel the extraction of entities in text and can be based on either pattern matching, linguistics, syntax, semantics or a combination of approaches. Entity extraction based on semantic technologies disambiguates meaning and understands context, enabling many useful downstream operations that are valuable to a variety of business functions across a number of industries. These include:
- Entity Relation Extraction: Reveals direct relationships, connections or events shared between different entities as well as complex relationships through inferred, indirect connections.
- Linking: Establishes links between knowledge banks. For example, it could identify all of the places mentioned in a corpus and link them to the corresponding location on a map, or cross-reference entities with other information sources.
- Fact Extraction: Extracts all of the data associated with an entity to deliver a specific response to a query from a corpus. This goes beyond responding to a query with a list of documents containing the “answers”.
These three processing actions provide a strong baseline of capabilities that can transform business processes across an organization. However, their efficacy depends on knowledge, which only a semantic approach can provide. Let entity extraction work for you. We can show you how.