Glossary of AI Terms
Artificial intelligence (AI) and natural language (NL) technologies are critical to the enterprise business but, for many, are difficult to assess due to their complexity and nuance. No one, however, should be excluded from such an important conversation. For this very reason, we have compiled a glossary of AI- and NL-specific terms to help simplify the conversation.
The following list of terms covers words and phrases that are essential to building and expanding your knowledge of natural language and artificial intelligence technologies. With them, you can confidently navigate your journey toward adopting and implementing natural language processing and natural language understanding solutions at your enterprise organization.
Accuracy is a scoring system in binary classification (i.e., determining if an answer or output is correct or not) and is calculated as (True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives).
Want to get more about accuracy? Read this article on our Community.
Information you can leverage to support decision making.
In linguistics, an anaphora is a reference to a noun by way of a pronoun. For example, in the sentence, “While John didn’t like the appetizers, he enjoyed the entrée,” the word “he” is an anaphora.
Auto-complete is a search functionality used to suggest possible queries based on the text being used to compile a search query.
In linguistics, a cataphora is a reference placed before any instance of the noun it refers to. For example, in the sentence, “Though he enjoyed the entrée, John didn’t like the appetizers,” the word “he” is a cataphora.
Categorization is a natural language processing function that assigns a category to a document.
Want to get more about categorization? Read our blog post “How to Remove Pigeonholing from Your Classification Process“.
A category is a label assigned to a document in order to describe the content within said document.
A co-occurrence commonly refers to the presence of different elements in the same document. It is often used in business intelligence to heuristically recognize patterns and guess associations between concepts that are not naturally connected (e.g., the name of an investor often mentioned in articles about startups successfully closing funding rounds could be interpreted as the investor is particularly good at picking his or her investments.).
The combined application of different AI techniques to improve the efficiency of learning in order to broaden the level of knowledge representations and, ultimately, to solve a wider range of business problems in a more efficient manner.
Computational linguistics is an interdisciplinary field concerned with the computational modeling of natural language.
Find out more about Computational linguistics on our blog reading this post “Why you need text analytics“.
Computational semantics is the study of how to automate the construction and reasoning of meaning representations of natural language expressions.
Learn more about Computational semantics on our blog reading this post “Word Meaning and Sentence Meaning in Semantics“.
A controlled vocabulary is a curated collection of words and phrases that are relevant to an application or a specific industry. These elements can come with additional properties that indicate both how they behave in common language and what meaning they carry, in terms of topic and more.
While the value of a controlled vocabulary is similar to that of taxonomy, they differ in that the nodes in taxonomy are only labels representing a category, while the nodes in a controlled vocabulary represent the words and phrases that must be found in a text.
A corpus is a balanced collection of documents that should be representative of the documents an NLP solution will face in production, both in terms of content as well as distribution of topics and concepts.
The process of uncovering data insights and getting those insights to the users who need them, when they need them.
The lack of data that could possibly satisfy the need of the system to increase the accuracy of predictive analytics.
Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. In other words, deep learning models can learn to classify concepts from images, text or sound.
In this blog post “Word Meaning and Sentence Meaning in Semantics” you can find more about Deep Learning.
“Did You Mean” is an NLP function used in search applications to identify typos in a query or suggest similar queries that could produce results in the search database being used.
Disambiguation, or word-sense disambiguation, is the process of removing confusion around terms that express more than one meaning and can lead to different interpretations of the same string of text.
Want to learn more? Read our blog post “Disambiguation: The Cornerstone of NLU“.
An entity is any noun, word or phrase in a document that refers to a concept, person, object, abstract or otherwise (e.g., car, Microsoft, New York City). Measurable elements are also included in this group (e.g., 200 pounds, 14 fl. oz.)
Entity extraction is an NLP function that serves to identify relevant entities in a document.
An F-score is the harmonic mean of a system’s precision and recall values. It can be calculated by the following formula: 2 x [(Precision x Recall) / (Precision + Recall)]. Criticism around the use of F-score values to determine the quality of a predictive system is based on the fact that a moderately high F-score can be the result of an imbalance between precision and recall and, therefore, not tell the whole story. On the other hand, systems at a high level of accuracy struggle to improve precision or recall without negatively impacting the other.
Critical (risk) applications that value information retrieval more than accuracy (i.e., producing a large number of false positives but virtually guaranteeing that all the true positives are found) can adopt a different scoring system called F2 measure, where recall is weighed more heavily. The opposite (precision is weighed more heavily) is achieved by using the F0.5 measure.
Read this article on our Community to learn more about F-score.
Hybrid AI is any artificial intelligence technology that combines multiple AI methodologies. In NLP, this often means that a workflow will leverage both symbolic and machine learning techniques.
Want to learn more about hybrd AI? Read this blog post “What Is Hybrid Natural Language Understanding?“.
A knowledge graph is a graph of concepts whose value resides in its ability to meaningfully represent a portion of reality, specialized or otherwise. Every concept is linked to at least one other concept, and the quality of this connection can belong to different classes (see: taxonomies).
The interpretation of every concept is represented by its links. Consequently, every node is the concept it represents only based on its position in the graph (e.g., the concept of an apple, the fruit, is a node whose parents are “apple tree”, “fruit”, etc.). Advanced knowledge graphs can have many properties attached to a node including the words used in language to represent a concept (e.g., “apple” for the concept of an apple), if it carries a particular sentiment in a culture (“bad”, “beautiful”) and how it behaves in a sentence.
Learn more about knowledge graph reafding this blog post “Knowledge Graph: The Brains Behind Symbolic AI” on our blog.
A supervised learning algorithm that uses ensemble learning method for regression. Ensemble learning method is a technique that combines predictions from multiple machine learning algorithms to make a more accurate prediction than a single model.
Linked data is an expression that informs whether a recognizable store of knowledge is connected to another one. This is typically used as a standard reference. For instance, a knowledge graph in which every concept/node is linked to its respective page on Wikipedia.
Machine learning is the study of computer algorithms that can improve automatically through experience and the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as “training data,” in order to make predictions or decisions without being explicitly programmed to do so. In NLP, ML-based solutions can quickly cover the entire scope of a problem (or, at least of a corpus used as sample data), but are demanding in terms of the work required to achieve production-grade accuracy.
Want to get more about machine learning? Read this post “What Is Machine Learning? A Definition” on our blog.
Data that describes or provides information about other data.
A machine learning model is the artifact produced after an ML algorithm has processed the sample data it was fed during the training phase. The model is then used by the algorithm in production to analyze text (in the case of NLP) and return information and/or predictions.
A subfield of artificial intelligence and linguistics, natural language processing is focused on the interactions between computers and human language. More specifically, it focuses on the ability of computers to read and analyze large volumes of unstructured language data (e.g., text).
Read our blog post “6 Real-World Examples of Natural Language Processing” to learn more about Natural Language Processing (NLP).
A subset of natural language processing, natural language understanding is focused on the actual computer comprehension of processed and analyzed unstructured language data. This is enabled via semantics.
Learn more about Natural Language Understanding (NLU) reading our blog post “What Is Natural Language Understanding?”.
An ontology is similar to a taxonomy, but it enhances its simple tree-like classification structure by adding properties to each node/element and connections between nodes that can extend to other branches. These properties are not standard, nor are they limited to a predefined set. Therefore, they must be agreed upon by the classifier and the user.
Read our blog post “Understanding Ontology and How It Adds Value to NLU” to learn more about the ontologies.
A Part-of-Speech (POS) tagger is an NLP function that identifies grammatical information about the elements of a sentence. Basic POS tagging can be limited to labeling every word by grammar type, while more complex implementations can group phrases and other elements in a clause, recognize different types of clauses, build a dependency tree of a sentence, and even assign a logical function to every word (e.g., subject, predicate, temporal adjunct, etc.).
Find out more about Part-of-Speech (POS) tagger in this article on our Community.
Given a set of results from a processed document, precision is the percentage value that indicates how many of those results are correct based on the expectations of a certain application. It can apply to any class of a predictive AI system such as search, categorization and entity recognition.
For example, say you have an application that is supposed to find all the dog breeds in a document. If the application analyzes a document that mentions 10 dog breeds but only returns five values (all of which are correct), the system will have performed at 100% precision. Even if half of the instances of dog breeds were missed, the ones that were returned were correct.
Want to learn more about precision? Read this article on our Community.
Given a set of results from a processed document, recall is the percentage value that indicates how many correct results have been retrieved based on the expectations of the application. It can apply to any class of a predictive AI system such as search, categorization and entity recognition.
For example, say you have an application that is supposed to find all the dog breeds in a document. If the application analyzes a document that mentions 10 dog breeds but only returns five values (all of which are correct), the system will have performed at 50% recall.
Find out more about recall on our Community reading this article.
The identification of relationships is an advanced NLP function that presents information on how elements of a statement are related to each other. For example, “John is Mary’s father” will report that John and Mary are connected, and this datapoint will carry a link property that labels the connection as “family” or “parent-child.”
Subject-Action-Object (SAO) is an NLP function that identifies the logical function of portions of sentences in terms of the elements that are acting as the subject of an action, the action itself, the object receiving the action (if one exists), and any adjuncts if present.
Read this article on our Community to learn more about Subject-Action-Object (SAO).
Semantics is the study of the meaning of words and sentences. It concerns the relation of linguistic forms to non-linguistic concepts and mental representations to explain how sentences are understood by the speakers of a language.
Learn more about semantics on our blog reading this post “Introduction to Semantics“.
Sentiment is the general disposition expressed in a text.
Read our blog post “Natural Language Processing and Sentiment Analysis” to learn more about sentiment.
Sentiment analysis is an NLP function that identifies the sentiment in text. This can be applied to anything from a business document to a social media post. Sentiment is typically measured on a linear scale (negative, neutral or positive), but advanced implementations can categorize text in terms of emotions, moods, and feelings.
Similarity is an NLP function that retrieves documents similar to a given document. It usually offers a score to indicate the closeness of each document to that used in a query. However, there are no standard ways to measure similarity. Thus, this measurement is often specific to an application versus generic or industry-wide use cases.
A common data model for knowledge organization systems such as thesauri, classification schemes, subject heading systems, and taxonomies.
Structured data is the data which conforms to a specific data model, has a well-defined structure, follows a consistent order and can be easily accessed and used by a person or a computer program. Structured data are usually stored in rigid schemas such as databases.
A symbolic methodology is an approach to developing AI systems for NLP based on a deterministic, conditional approach. In other words, a symbolic approach designs a system using very specific, narrow instructions that guarantee the recognition of a linguistic pattern. Rule-based solutions tend to have a high degree of precision, though they may require more work than ML-based solutions to cover the entire scope of a problem, depending on the application.
Want to learn more about symbolic methodology? Read our blog post “The Case for Symbolic AI in NLP Models“.
A taxonomy is a predetermined group of classes of a subset of knowledge (e.g., animals, drugs, etc.). It includes dependencies between elements in a “part of” or “type of” relationship, giving itself a multi-level, tree-like structure made of branches (the final node or element of every branch is known as a leaf). This creates order and hierarchy among knowledge subsets.
Companies use taxonomies to more concisely organize their documents which, in turn, enables internal or external users to more easily search for and locate the documents they need. They can be specific to a single company or become de-facto languages shared by companies across specific industries.
Find out more about taxonomy reading our blog post “What Are Taxonomies and How Should You Use Them?“.
A test set is a collection of sample documents representative of the challenges and types of content an ML solution will face once in production. A test set is used to measure the accuracy of an ML system after it has gone through a round of training.
A training set is the pre-tagged sample data fed to an ML algorithm for it to learn about a problem, find patterns and, ultimately, produce a model that can recognize those same patterns in future analyses.
Read this article on our Community to learn about training set.
Triples are a common way to represent SAO and relation information, often with the purpose of storing, indexing and retrieving the data via a database.
Unstructured data do not conform to a data model and have no rigid structure. Lacking rigid constructs, unstructured data are often more representative of “real world” business information (examples – Web pages, images, videos, documents, audio).