We stand with Ukraine

Text classification algorithms in data mining

Expert.ai Team - 25 November 2016

Text classification systems have been adopted by a growing number of organizations to effectively manage the ever growing inflow of unstructured information. The goal of text classification systems is to increase discoverability of information and make all the knowledge discovered available or actionable to support strategic decision making.

A text classification system requires several elements:

  1. It acquires documents
  2. It contains an agreed hierarchy (tree) that describes the most relevant topics for the organization
  3. It includes sample documents that identify the type of content to be assigned to each category/node of the hierarchy
  4. It applies a software that uses text classification algorithms to acquires content from the appropriate data sources, process it and then assign the content to the correct category.

Is there a perfect text classification algorithm?

Essentially there are really just three main text classification algorithms in data mining: the “bag of keywords” approach, statistical systems and rules-based systems. Getting past all the marketing buzz to choose the best approach can be difficult. However, your selection of the best solution should be based on facts (and not claims). Here is a brief summary of each approach:

  1. Manual approach. The “bag of keywords” approach is the simplest. It requires compiling a list of “key terms” that qualifies the type of content in question to a certain topic. If one or more qualifying keywords is present in a document, the document is assigned to the topic. One of the many problems with this approach is that identifying and organizing a list of terms (preferred, alternate, etc.) is labor intensive, and the natural ambiguity of language (one keyword can have different meanings) causes lots of false positives. This makes the bag of keywords approach not only not scalable, but also inaccurate.
  2. Statistical approach. A statistical approach is based on the manual identification/tagging of a “training set” of documents that covers the same topic. It uses a text classification algorithm (Bayesian, LSA, or many others) to look at the frequency of these terms to infer the key elements of the document. And it uses key terms and frequency to build implicit rules in order to classify other content. This approach has no understanding of meaning. In addition, systems using these types of text classification algorithms are essentially a “black box”; no one can explain why specific terms are selected by the algorithm or how they are being weighted. In the event the classification is incorrect, there is no accurate way to modify a rule for better results. This approach has also few other drawbacks:
    1. If you are not happy with the results, you can only manually select new training documents and start again.
    2. Because content changes frequently (in some fields more than in others), even if you are happy with the original results, you will have to regularly retrain the system to keep up, including all of the manual and cumbersome work required to tag the new samples.
    3. Most organizations struggle to find enough documents for each node of the category to train the system, which inevitably causes accuracy and scalability issues.
      This text classification algorithm has received a lot of attention because it has been positioned as fully automatic (ignoring that the training phase requires a significant amount of time and manual work). In reality, the idea of a text classification algorithm that can magically categorize content has not only proven to be unrealistic and unreliable, but also quite expensive in most situations
      This text classification algorithm in data mining has received a lot of attention because it has been positioned as fully automatic (ignoring that the training phase requires a significant amount of time and manual work). In reality, the idea of a text classification algorithm that can magically categorize content has not only proven to be unrealistic and unreliable, but also quite expensive in most situations.
  3. Rules-based approach. This text classification algorithm is based on linguistic rules that capture all of the elements and attributes of a document to assign it to a category. Rules can be written manually or generated with an automatic analysis and only then validated manually (a time savings of up to 90%). Rules can be understood and improved (unlike the black box system used with stattistics-based algorithms). A rules-based approach is flexible, powerful (much more than a bag of keywords) and easy to express. This text classification algorithm performs at its best if the lingustic engine is based on a true semantic technology because the deeper understanding of text (meaning, relevancy, relationship between concepts, etc.) makes it possible to leverage many elements to work faster and obtain better results. This deeper understanding is assured only if the engine can count on a rich and domain independent knowledge graph (semantic network). Read also about text mining vs data mining! A knowledge graph:
    1. Provides a true representation of the language and how meaningful words are used in the language in their proper context.
    2. Supports writing simpler rules because you can work with a higher level of abstraction.
    3. Takes advantage of the basic features of semantic technology for understanding the meaning of words in context. This provides superior precision and recall because semantic technology allows words to be understood in their proper context.
    4. Makes it easier to improve accuracy over time. Once the system is deployed, documents that do not “fit” into a specific category are identified and automatically separated, and the system administrator can fully understand why they were not classified. The administrator can then make an informed decision on whether to modify an existing rule for the future, or create a new class for content that was not previously identified.