[Natural Language Processing Time]: Text Classification Issue

The task of textual documents classification (cataloguing, editorial classification), which means attributing a document to one or several thematic columns, is a burning task because of the abundant volume growth of the available full-text information. This task has an important real-life applicability. For example, new artistic works could be divided by genres, and scientific articles could be divided by subjects. Another important case is spam filtration, where email messages are being divided into two categories: “spam” and “not spam”.

We can solve the task of cataloguing with the help of experts, who will divide the documents to the appropriate topics and subjects, based on their own experience and practical knowledge. However, this method is ineffective in terms of time and human resources involved. Another way of solving categorization task would be by applying automatic classification algorithms that are used in machine learning.

To classify documents, typically a native Bayes classifier or its modifications are being used. Also, the methods based on thematic modelling algorithms LDA (Latent Dirichlet allocation) and PLSA (Probabilistic Latent Semantic Analysis) have shown good results. Among metric algorithms for solving categorization tasks the most common methods are: the method of “k” nearest neighbors (k-means), Rocchio algorithm and Support Vector Machine method (SVM). In order to apply machine learning algorithms to the text documents collections, the documents are usually being represented in the form of a real number vector. To increase the efficiency of the applied methods, the methods of feature extraction are being used.

Let’s bring in some notations. Let D be a multitude (a collection) of text documents, W – a multitude (a dictionary) of all used words inside those documents, and C – a pre-fixed multitude of documents’ categories. Every document d ∈ D represents a sequence of words (w1, . . . ,wnd) from the dictionary W, where nd is a document’s lengths (counted by the words it contains). The same word could be repeated in the document many times.

The classification task is basically a task of Boolean value assumption to each excisive couple {d, c} ∈ D ×C. Boolean value “1” means that a document d relates to the category “c”, and Boolean value “0” means the opposite. Formally, the categorization task is a task of recreating of the unknown objective function F:

F : D × C → {1, 0}.

Let’s form two natural assumptions that are typically used when solving a classification task.

All classes are symbolic labels only, and their values carry no overtones.
When solving a classification task, no additional data source is there, except for the document’s text itself. Among other factors, there are no files with the documents’ metadata (publication date, document’s type, etc).

Let’s analyze different variations of categorization task set up, for example, every document can be related to only one category (single-label categorization), or the document can be related to several topics simultaneously (multi-label categorization).

The tasks of the hierarchical categorization of texts implies that every document can be related to categories, sub-categories, sub-categories’ sub-categories, etc.

Thus, we can talk about the categories’ “tree”. For the task of hierarchical categorization, several variations of set ups could be made:

Every document always relates to only one of the derived sub-categories. It means that, in the end, this document will be related to only one “leaf”.
Every document can relate to several derived sub-categories. As a result, such document can reach the “leaves” by several paths/ways.
The document can relate to neither of derived sub-categories. As a result, such document might not reach the “leaves” and will remain in the main theme.

Would you like to learn more?.. Stay tuned for our next articles on Natural Language Processing at Mellivora Software’s blog!

Chalk and talk session by Mellivora’s NLP expert Olga Kanishcheva, a PhD in Computer Science of the Intellectual Computer Systems and a lecturer at Kharkiv Polytechnic Institute.