IDP

Document Classification with Python - Types, NLP & Naive Bayes algorithms

D

ocument classification in a huge enterprise setting calls for the use of automation using different methods. These methods are used to simplify the process, reduce the time taken exponentially, improve the accuracy, and convert the data into an analysis-ready format to be used at the earliest. Automated Document Classification is possible using algorithms that work with NLP & AutoML, and work based on a Neural Network (Deep Learning), Naive Bayes classifiers, or a very simple Logistic Regression algorithm if the data is not too large (not an exhaustive list).

In this article, we’ll go over what Document Classification means, and discuss different different aspects of it including:

  • Who is document classification for
  • What are the different ways to classify documentsnt classification for
  • How to hard code the automation of document classifiers with NLP (Natural Language Processing) in Python with an example

By the end of the article, you will have a thorough understanding of Automation in Document Classification. For the scope of this article, we won’t be discussing the manual way of classifying documents.

So, let’s jump right into it:

What is Document Classification?

01

Document Classification or Document Categorization is a process to assign different classes or categories to documents as required, eventually helping with storage, management,  and analysis of the documents. It has become an important part of the computer sciences and the daily functioning of many companies today.

Document classification has been a long-due development in the world of automation and data, with documents of every kind (structured and unstructured) being developed throughout all industries. Every document shares hands with multiple entities and teams before going for analysis and manually classifying these documents to go into the right stream of analysis is a task indeed.

Think of 10,000 large documents that need you to classify. It's next to impossible to be able to grow rapidly while being slowed down by such a repetitive task.

Types of Document Classification

02

Document classification algorithms function based on different recognition methods. The recognition techniques work based on text classification or visual classification. The types of recognitions involved in classifying documents in the aforementioned types of learnings are given as under:

Visual Recognition

At times, documents in question are so different from each other that there is no need to read their text to classify them - they can be classified by just looking at their structure and style. For example - an invoice and a tax form are so different from each other that you don’t have to read and analyse their entire text to classify them. They can be classified solely based on their structure.
With the capabilities of Computer vision or OCR, a document is broken down into pixels to learn about its structure, style, and layout. The pixels are analysed to make up an image and are then identified as objects when together, and subsequently classified.
A

Computer Vision features recognition

Computer Vision has grown out as a branch of computer science today where computers are being taught to make sense of an image. From Self Driving Cars to AI recognition in your smartphone, it all involves computer vision. The possibilities with computer vision are only growing with the years and it has a wide range of applications such as facial recognition, character recognition, pattern recognition, etc.
The backend CV algorithm is complex and depends on the use case to work precisely. It requires a lot of data and sometimes can be trained to even recognize hand gestures etc. in the simplest CV algorithms. The more modern approaches of self-driving cars etc. use Deep Learning models that involve CNNs, LSTM, Transformers, etc
In computer vision and image processing, a feature is simply a piece of information about the image being processed. This information is used to classify different building blocks in documents. Based on the format of a particular document type, different blocks of information are recognized by the CV algorithm eventually using this information to classify documents.

Textual Recognition

Textual recognition works on the idea to recognize text with a definitive context associated with it. This is then used with lexical processing to understand the underlying genre, theme, and emotion of the sentence to lead the organization to pick off the class that the document might belong to (to a certain level of accuracy).
Textual recognition can take many forms - Rule-Based, OCR, Document classification with NLP, etc., some of which are explained as under
A

Rule-based text recognition

Rule-based text recognition recognizes words in a document in different ways such as isolation, explicit word segmentation, simultaneous recognition, etc. It can also be based on searching certain terms in a document to understand where they might belong.
‘Rules’ in a rule-based text recognition system guide the system to identify semantically relevant elements of a text to classify it into relevant categories based on its content. Each rule consists of an antecedent i.e. a pattern with a category or classification.
For example, if you want to classify topics into two groups: Food and Careers. First, define words that fall into each category (for example - Dark chocolate, Lettuce, Fries, etc. fall into Food, and Engineers, doctors, accountants, etc. fall into careers)
Counting the instances of these words in an incoming text based on the trained algorithm will simply see which type of words occur more than another and then classify the text accordingly.
For example -  a sentence that says, “Careers in the industry of engineers and doctors are seeing a massive trend of eating more dark chocolate.” - the classifier will classify this document with the text as one that falls into the ‘careers’ category.
Rule-based systems are not black box algorithms and can be developed easily. These algorithms have certain disadvantages as they require domain knowledge and are time-consuming.
Since generating rules for a complex system can be quite challenging, it needs a lot of data too.  Rule-based systems are also difficult because, in their upkeep, they require a lot of new rules which don’t scale well with existing rules at times.
B

Optical Character Recognition

Rule-based text recognition recognizes words in a document in different ways such as isolation, explicit word segmentation, simultaneous recognition, etc. It can also be based on searching certain terms in a document to understand where they might belong.
‘Rules’ in a rule-based text recognition system guide the system to identify semantically relevant elements of a text to classify it into relevant categories based on its content. Each rule consists of an antecedent i.e. a pattern with a category or classification.
For example, if you want to classify topics into two groups: Food and Careers. First, define words that fall into each category (for example - Dark chocolate, Lettuce, Fries, etc. fall into Food, and Engineers, doctors, accountants, etc. fall into careers)
Counting the instances of these words in an incoming text based on the trained algorithm will simply see which type of words occur more than another and then classify the text accordingly.
For example -  a sentence that says, “Careers in the industry of engineers and doctors are seeing a massive trend of eating more dark chocolate.” - the classifier will classify this document with the text as one that falls into the ‘careers’ category.
Rule-based systems are not black box algorithms and can be developed easily. These algorithms have certain disadvantages as they require domain knowledge and are time-consuming.
Since generating rules for a complex system can be quite challenging, it needs a lot of data too.  Rule-based systems are also difficult because, in their upkeep, they require a lot of new rules which don’t scale well with existing rules at times.
C

Document Classification with NLP

Natural Language Processing algorithms differentiate between documents by using different lexical and semantic processes that can be combined with techniques like a bag of words, tokenization, and word stemming and using stopword removal processes to arrive at an algorithm that can differentiate between different classes of documents based on the words in the document.
It is easy to find a platform to conduct document auto-classification to skip the entire hassle of having to code the feature recognition engine or a textual classifier with NLP using a coding language like Python, both of which are given in the next sections of this article.
SUGGESTED
Business Process Discovery and Automation:
Here's All You Need to Know
Given the immense benefits it has to offer, more and more companies are inclined towards automating their accounts payable (AP) processes.

Platforms for Document Classification

03

Ever since Frederick Wilfrid Lancaster argued about the differentiation between Subject Indexing and Document classification, the latter has grown significantly with the advancements in the world of data science and computing. With higher processing speeds and computing capabilities, today companies have taken NLP into their hands and they analyze trillions of documents daily to drive real-time data decisions
Some platforms (with their links), that companies today use to perform document classification are:
    Suggested CASE-STUDY
    Automating Portfolio Management for Westland Real Estate Group
    The portfolio includes 14,000 units across all divisions across Los Angeles County, Orange County, and Inland Empire.
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.

    How to make your Document Auto-Classification algorithm using Python?

    04

    In this section, we’ll go over some code and break it down to understand how you can make your auto-classification algorithm with Python.
    First, we start with importing the following libraries:
    
    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import MinMaxScaler, Normalizer 
    from nltk.text import wordnet
    from nltk import SnowballStemmer
    import spacy
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.model_selection import train_test_split, RandomizedSearchCV
    from sklearn.naive_bayes import MultinomialNB
          
    from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, precision_recall_curve
    
    from wordcloud import WordCloud
    import matplotlib.pyplot as plt
          
    import warnings
    warnings.filterwarnings('ignore')
    NUM_OF_THREADS = 10
    
    
    Our data, let’s say, consists of comments from different users that we are going to try to label as toxic or not toxic. For this particular example, we’ll work on the data provided in Kaggle at the Toxic Comment Classification challenge given as under:
    We have some utility, models, visualizations, evaluations, and other NLP libraries that have been imported.
    We’ll now import our data into the notebook:  (this format helps to import data into a Kaggle Notebook)
    
    data = pd.read_csv("../input/traindata/train.csv")
    data.dropna(inplace=True)
    data.reset_index(inplace=True, drop=True)
    Text = "comment_text"
    
    
    On stack overflow, one can easily find the commands to remove certain regex pattern matching.
    
    def regex(text):
    
        text = text.apply(lambda x: re.sub("\S*\d\S*"," ", x))
        text = text.apply(lambda x: re.sub("\S*@\S*\s?"," ", x))        
        text = text.apply(lambda x: re.sub("\S*#\S*\s?"," ", x))         
        text = text.apply(lambda x: re.sub(r'http\S+', ' ', x))          
        text = text.apply(lambda x: re.sub(r'[^a-zA-Z0-9 ]', ' ',x))     
        text = text.apply(lambda x: x.replace(u'\ufffd', '8'))           
        text = text.apply(lambda x: re.sub(' +', ' ', x))                  
        return text
    
    Data[Text] = apply_regex(data[Text])
    
    
    Essentially this now removes text such as numbers and words that are concatenated with numbers, emails, hashtags, URLs, multiple spaces, etc. All of these are not necessary for our model to verify if a comment is toxic or not.
    Preprocess the data using an NLP library - SpaCy to remove the stop words that are predefined in the library:
    
    preprocess = spacy.load("en_core_web_sm")
    stop_words = preprocess.Defaults.stop_words
    
    
    Apply stemming using the following code:
    
    stemmer = SnowballStemmer(language="english")
    
    def applyStemming(listOfTokens):
        return [stemmer.stem(token) for token in listOfTokens]
    
    data['stemmed'] = data['tokenized'].apply(applyStemming)
    
    
    Check the sample of the data using the code here:
    
    data.sample(10)
    
    
    Some visualizations that you can make to check the non-toxic words are given under (Note: the visuals for toxic text have a lot of unprofessional words so it's best if you don’t execute it, this code is here only for demonstration and is picked off from one of the other competition participants at Kaggle. Best to stick to the non-toxic word cloud, thanks!)
    
    wordcloud_pos = WordCloud(collocations=False, 
                                width=1500, 
                                height=600, 
                                max_font_size=150
                            ).generate(pos_text)
    plt.figure(figsize=(15, 10))
    plt.imshow(interpolation = “bilinear”)
    plt.axis("off")
    plt.title(f"Most common words associated with non-toxic comment", size=20)
    plt.show()
    
    
    Splitting the train test data:
    
    X_train, X_test, y_train, y_test = train_test_split(data[“stemmed"], data["label"])
    
    
    Now converting the data into the TF-IDF embeddings is a basic frequency-based approach.
    Term Frequency–Inverse Document Frequency (TF-IDF) is a very common method that is used to compute word importance across documents in data. The assumption is that the more times a word appears in a document, the more important that word is for that document compared to the rest.
    The TF-IDF assigns, accordingly, a weight to each word based on its order of occurrence frequency. In the end, the words assigned a lower weight are words that occur in all documents in general. A bag of word representation is used to show an array containing the scores of each word and the word order is lost and context is not considered this way to improve the computing speed on a local machine.
    Think of 10,000 large documents that need you to classify. It's next to impossible to be able to grow rapidly while being slowed down by such a repetitive task.
    Use the following code to fit and transform the data into an array as required.
    
    tfid = TfidfVectorizer(lowercase=False, max_features=500)
    
    train_vectors_tfidf = tfid.fit_transform(X_train).toarray()
    test_vectors_tfidf = tfid.transform(X_test).toarray()
    
    
    Use the following code to normalize the TF-IDF vectors:
    
    norm_TFIDF = Normalizer(copy=False)
    norm_train_tfidf = norm_TFIDF.fit_transform(train_vectors_tfidf)
    norm_test_tfidf = norm_TFIDF.transform(test_vectors_tfidf)
    
    
    In terms of the algorithm being used, we’ll use a Naive Bayes classifier as the algorithm on our model.
    
    model = MultinomialNB()
    
    
    A custom function to receive back a dataframe with all our evaluation metrics
    
    def classifier(y_test, predictions, modelName):
    tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
    prec=precision_score(y_test, predictions)
    rec=recall_score(y_test, predictions)
    f1=f1_score(y_test, predictions)
    acc=accuracy_score(y_test, predictions)
    # specificity
    spec=tn/(tn+fp)
    
    score = {'Model': [model], ‘acc’ : [acc], 'f1': [f1], ‘rec’: [rec], 'Prec': [prec],'Specificity': [spec], 'TP': [tp], 'TN': [tn], 'FP': [fp], 'FN': [fn], 'y_test size': [len(y_test)]}
    df_score = pd.DataFrame(data=score)
    return df_score
    
    
    A custom function to receive back a dataframe with all our evaluation metrics
    
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    scores = classifier(y_test, preds, name)
    
    
    To check the model scores:
    
    scores
    
    
    Starting with cleaning the data and performing a common NLP pipeline, the embedding methods are used to form a frequency-based basic TF-IDF approach. A different baseline model would give a different outcome and that might also change with other hyper parameters that work with the best performing model. The end is to just depict how accurately you can take a document classifier or a comment classifier in this case and make a working model.
    Note: This is only an example demonstration and of course, to put the classifier to production use cases, a much more advanced algorithm will be required which might not be possible for everyone to develop. This is where alternative ways of using a document classifier come into the picture which leads to optimization of the process as a whole.

    Document Auto-Classification with Docsumo

    05

    Docsumo’s textual data processing capabilities are one of the best out there because the entire work of having to take documents, and make a working model out of the textual data using Machine Learning algorithms is reduced to a couple of clicks and your output will be presented to you within seconds.
    The classifier works with different forms of multiple industries and can be customized as well. The documents are uploaded, put through screening in the algorithm, and based on the confidence of the outcome, some are passed through as ‘approved’ and the rest as “under review”. Documents under review are shown to the user to be approved and if all’s well, your algorithm should revert with an outcome in less than a minute with Docsumo.
    The only thing to note here is that the algorithm needs at least 20 documents of a kind with diversification between the different document types (forms, feedbacks, letters, etc.) and also diversification within each kind as much as possible to accurately and efficiently predict the document type of a new additional document.
    This is how it works:
    1

    Training

    Docsumo uses text-based machine learning algorithms to define and classify different documents. The system can be trained with as little as 20 documents for each document type. For example, if we want to auto-classify 4 document types, we need to train each document type API with 20 documents each. A 90/10 ratio of text classification is maintained for training & prediction that means while training the ML model 90% of identified text is used for training, and 10% of it is used for prediction simultaneously for enhanced accuracy of the model. The model works best for a varying set of documents, that means documents that have less in common produce better results.
    2

    Prediction

    Once the model is trained with 20 documents for each doctype, you’re all set to use the document classification feature. Documents can be uploaded in bulk into the auto-classification API, it tries to predict types of documents based on the training.
    3

    Extraction

    After the documents are classified, data extraction from these documents happens as usual.
    SUGGESTED
    Business Process Discovery and Automation:
    Here's All You Need to Know
    Given the immense benefits it has to offer, more and more companies are inclined towards automating their accounts payable (AP) processes.

    Document Auto-Classification with Docsumo

    05

    • Hardcoding an algorithm can cost your organization a huge sum to set up a server, get the developers, work on preparing the data for the algorithm to work on, and can be time taking while none of these costs are incurred when using a service like Docsumo instead.
    • It is almost impossible to consider manually entering millions of rows of data from millions of documents and hard coding an algorithm to do that is expected to end up in some errors. With Docusmo you can always go back and review the outcome that the algorithm delivers to ensure that accuracy is not compromised.
    • You can not define a predefined function to directly verify things like the difference between Gross Income and tax to be the Net income. You can identify them in the document to be at a specific location but not double-check the values using custom settings because it requires very complicated semantic analyses on the back-end and you can add multiple such checks in Docsumo.
    Note: This is only an example demonstration and of course, to put the classifier to production use cases, a much more advanced algorithm will be required which might not be possible for everyone to develop. This is where alternative ways of using a document classifier come into the picture which leads to optimization of the process as a whole.

    Get detailed automation stories & case studies to expand your process automation knowledge

    Given the immense benefits it has to offer, more and more companies are inclined towards automating their accounts payable (AP) processes.
    Thank you!
    We have received your submission!
    Oops! Something went wrong while submitting the form.

    Recommended articles

    OCR

    Analysis and Benchmarking of OCR Accuracy for Data Extraction Models

    In this article, we discuss how OCR works, metrics to measure OCR accuracy, limitations of OCR models, and how to overcome the limitations of OCR.
    September 28, 2022
    /
    read
    ACCOUNTS-PAYABLE

    Intelligent invoice processing: A complete guide

    Step-by-step guide to intelligent invoicing - read this blog to learn about challenges associated with manual invoicing workflow and how to get rid of them with intelligent invoice processing.
    Pankaj Tripathi
    July 12, 2022
    /
    7 min
    read