Blog

A Comprehensive Guide to Natural Language Processing Algorithms

Learn about the simpler text processing cousins of LLMs like GPT-4

Andrew Tate

Tokenization

Tokenization is the process of breaking down phrases, sentences, paragraphs, or a corpus of text into smaller elements like words or symbols. These smaller elements of the text are called tokens. For example, a string “What is NLP ?”, once tokenized looks like [“what”, “is”, “NLP”, “?”].

Once tokenized, you can count the number of words in a string or calculate the frequency of different words as a vector representing the text. As this vector comprises numerical values, it can be used as a feature in algorithms to extract information.

We’ll use NLTK (Natural Language Toolkit) in Python for our tokenization. First, install the toolkit:

$ pip install nltk

Then you can import it into your code:

import nltk

This module contains two essential methods for tokenization:

word_tokenize() creates word tokens of the given text.
sen_tokenize() converts the text into sentence tokens.

Let’s check both of these methods of tokenization one by one. For this, we will use an example text string as follows:

text_to_tokenize = '''The most popular platform for creating Python programmes that use human language data is called NLTK. Along with a collection of text processing libraries for categorization, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum, it offers simple interfaces to more than 50 corpora and lexical resources, including WordNet.'''

Word Tokenization

NLTK provides the word_tokenize() method to convert the text into word tokens. You simply need to pass the text to this method as follows:

# download the punkt tokenizer
nltk.download('punkt')
# apply NLTK word tokenizer
word_tokens = nltk.word_tokenize(text_to_tokenize)
print('Tokenized Words:', word_tokens)

The output will be:

Sentence Tokenization

Similarly, you can use the sen_tokenize() method to receive sentence tokens from the text.

# apply NLTK sentence tokenizer
sentence_tokens = nltk.sent_tokenize(text_to_tokenize)
print(sentence_tokens)

Word Frequency Counter

You can use these tokens further for tasks like stemming and lemmatization and word frequency counting to make some sense of this text data. The nltk.probability.FreqDist() method allows you to calculate word frequencies just by passing the list of word tokens:

# Check the frequency distribution of words
from nltk.probability import FreqDist
fdist = FreqDist(word_tokens)
print('Word Count Dictionary:', dict(fdist))

Now you can gain insights about common and least common words in your dataset to help you understand the corpus.

Sentiment Analysis

Sentiment Analysis is a task of NLP that involves analyzing a piece of text to determine the overall sentiment or attitude conveyed by the text. In the context of movie reviews, sentiment analysis can be used to classify a review as either positive or negative based on the language used in the review.

Sentiment Analysis helps businesses to understand users’ behavior and emotions toward their products and services. Nowadays almost all kinds of organizations use sentiment analysis in one way or the other to make informed decisions about their products and services based on user’s responses.

NLTK provides a method for sentiment classification. This follows on from tokenization as the classifiers expect tokenized input.

Let’s use NLTK to perform sentiment analysis on the IMDB rating dataset. To begin with, let’s import all the necessary dependencies for the sentiment analysis:

# import dependencies
import re
import nltk
import pandas as pd
# NLTK dependencies
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Now, you need to load the dataset on which you want to perform the sentiment analysis (IMDB in this case).

# read the dataset
imdb_df = pd.read_csv('IMDB_Dataset.csv')
# use only the first 5000 rows of data
imdb_df = imdb_df[:5000]
# check the first few rows
imdb_df.head()

In the above code, we are first reading the dataset (CSV format) using the read_csv() method from Pandas. As this dataset contains more than 50k IMDB reviews, we will just want to test the sentiment analyzer on the first few rows, so we will only use the first 5k rows of data.

Finally, the head() method shows the first 5 rows of data by default.

Next, we will write a function to preprocess this data to be used by the NLTK sentiment analyzer:

Convert the input text string to lower and tokenize this input.
Remove the stopwords like ‘the’, ‘is’, ‘an’, etc. as these words do not contribute much to any NLP tasks.
Lemmatize the words to have a single form across all the inputs.
Finally, remove the unwanted characters from the input string.

# create preprocess_text function
def preprocess_text(text):        
	# Tokenize the text        
	tokens = word_tokenize(text.lower())        
	# Remove stop words        
	filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]
	# Lemmatize the tokens        
	lemmatizer = WordNetLemmatizer()        
	lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
	# Join the tokens back        
	processed_text = ' '.join(lemmatized_tokens)     
	# filter only the text data and special characters        
	processed_text = re.sub('[^A-Za-z0-9]+', ' ', processed_text)  
	return processed_text

The preprocess_text() method accepts the input text and applies all the preprocessing steps and returns the preprocessed text. You can apply this preprocessing on your dataset text as follows:

# apply the function preprocess_text to the dataset
imdb_df['review'] = imdb_df['review'].apply(preprocess_text)imdb_df.head()
Once done, you need to create an object of the NLTK’s SentimentIntensityAnalyzer():
# initialize NLTK sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

The Sentiment Analyzer from NLTK returns the result in the form of probability for Negative, Neutral, Positive, and Compound classes. But this IMDB dataset only comprises Negative and Positive categories, so we need to focus on only these two classes.

# create get_sentiment function
def get_sentiment(text):
	scores = analyzer.polarity_scores(text)
	if scores['neg'] > scores['pos']:
		sentiment = 'negative'
	elsif scores['pos'] > scores['neg']:
		sentiment = 'positive'
	else:
		sentiment = 'neutral'
	return sentiment

In the above function, we use the polarity_scores() method from NLTK that gives the probability of different sentiments. As we need to focus on only two classes, we will just compare the probability of those two classes, and whichever will have a higher value will be the sentiment of a given text. Now we simply need to apply this method to our reviews and store the results as follows:

# apply get_sentiment function
imdb_df['sentimentPredicted'] = imdb_df['review'].apply(get_sentiment)imdb_df.head()

Finally, you can check the accuracy of your predictions with the help of the Scikit-Learn module of Python. You can install it as follows:

$ pip install scikit-learn

Once installed you can use classification_report() to check the detailed report of your predictions as follows:

## check performance
from sklearn.metrics
import classification_report
print(classification_report(imdb_df['sentiment'], imdb_df['sentimentPredicted']))

As you can see, the accuracy of our analyzer is 68%. This is good considering we haven’t trained any classifier on our own. To improve the accuracy of sentiment classification, you can train your own ML or DL classification algorithms or use already available solutions from HuggingFace.

Named Entity Recognition

Named Entity Recognition (NER) is a technique in NLP that involves identifying and categorizing different entities of an unstructured text into a set of predetermined groups such as people, organizations, locations, dates, and so on. The goal of NER is to identify and classify these named entities within a text, and to associate them with specific categories or types.

For example, in the sentence “Steve Jobs was the CEO of Apple”, the named entity “Steve Jobs” can be identified as a person, while “Apple” can be identified as an organization.

NER has many possible use cases including, text extraction from unstructured data, information extraction from customer data for Customer Relationship Management (CRM), and identifying suspicious activities for Fraud Detection.

Many NLP libraries, such as NLTK, Spacy, and Stanford CoreNLP, provide pre-trained models for NER that can be used for various applications. NER is usually a three-step process that includes:

Tokenize a string of sentences or a paragraph.
Identify parts of speech of different words in the input text.
Identify and classify the list of words into multiple named entities.

To begin with, let’s import some of the dependencies for performing NER.

import nltk
from nltk import word_tokenize
from nltk import pos_tag
from nltk import ne_chunk

Then, you can define a string or any existing dataset for which you will want to perform the NER.

ner_sentence = '''Tom Brady won his seventh Super Bowl title in 2021 with the Tampa Bay Buccaneers, becoming the oldest quarterback to win a Super Bowl at the age of 43.'''

Now, the first step is to tokenize the text into words using NLTK’s word_tokenize() method.

# tokenization
tokens = word_tokenize(ner_sentence)
print(tokens)

Then the second step is to identify the parts of speech associated with different words using the pos_tag() method of NLTK.

# POS tagging
pos_tags = pos_tag(tokens)
print(pos_tags)
Finally, identify and classify the named entities of the text using the ne_chunk() method.
named_entities = ne_chunk(pos_tags)
print(named_entities)

You can also visualize these named entities using the following code:

def process_content(pos_tags):
	try:
		namedEnt = nltk.ne_chunk(pos_tags, binary=False)
		namedEnt.draw()
	except Exception as e:
		print(str(e))
	# draw NER graph
	process_content(pos_tags)

Doing so would let you see that Tom Brady is classified as Person while Tampa Bay Buccaneers is classified as an Organization. This is how easy it's to identify the named entities in a text.

Topic Modeling

Topic modeling is the process of automatically identifying the underlying themes or topics in a set of documents, based on the frequency and co-occurrence of words within them. This way, it discovers the hidden patterns and topics in a collection of documents.

The output of a topic modeling algorithm is typically a set of topics, each represented by a list of keywords and their associated probabilities. The topics may or may not be labeled, depending on whether the algorithm was provided with any prior knowledge about the topics.

Some common applications of topic modeling include content recommendation, search engine optimization, and trend analysis. It's also widely used in academic research to identify the main themes and trends in a field of study.

In this section, we’ll use the Latent Dirichlet Allocation (LDA) algorithm on a Research Articles dataset for topic modeling.

For this we aren’t using NLTK for the modeling. Instead we’ll use the Gensim library that has an implementation of LDA:

$ pip install gensim

To begin with, let’s import the NLTK and Gensim dependencies to perform the topic modeling:

# load dependencies
import pandas as pd
import numpy as np
# NLTK dependencies
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
# Gensim dependencies
from gensim import corpora
from gensim.models.ldamodel import LdaModel

Once imported, you can read the dataset and check its first few rows as follows:

# load the dataset
df = pd.read_csv("Research_Articles.csv")
# select the first 1000 rows of data
df = df[:1000]
# check the first few rows of data using head() method
df.head()

In this dataset, our only focus is the ABSTRACT column for topic modeling. Next, process the text data to tokenize text, remove stopwords and lemmatize it using the NLTK library.

# tokenize and preprocess the text
stop_words = stopwords.words('english')
lemma = WordNetLemmatizer()
def preprocess_text(text):
	tokens = nltk.word_tokenize(text.lower())
	tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
	tokens = [lemma.lemmatize(tok) for tok in tokens]
	return tokens
# preprocess text of ABSTRACT column
documents = df["ABSTRACT"].apply(preprocess_text).tolist()

Once the text is preprocessed, you need to create a dictionary and corpus for the LDA algorithm.

# create a dictionary and corpus for the Latent Dirichlet Allocation
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(doc) for doc in documents]

Then train the LDA model so that this trained model can be used for generating topics.

# train the LDA model
num_topics = 10
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=10)
Finally, you can now check the top 10 topics using LDA as follows:
# print the topics and their top words
for topic_id in range(num_topics):
	top_words = [word for word, prob in lda_model.show_topic(topic_id, topn=10)]      
	print("Topic {}: {}".format(topic_id, ", ".join(top_words)))

You can also check 10 topics along with the probabilities associated with each word as follows;

# check 10 topics with probability
lda_model.print_topics(num_topics=10, num_words=10)

This is how you can use topic modeling to identify different themes from multiple documents.

Text Summarization

Summarizing text is a core task in linguistic processing.

Automatic text summarization can be carried out in two different ways: extractive and abstractive. Extractive summarization involves selecting and combining the most important sentences or phrases from the original text, while abstractive summarization involves generating new sentences that capture the essence of the original text.

You can train a text summarizer on your own using ML and DL algorithms, but it will require a huge amount of data. Instead, you can use an already trained model available through HuggingFace or OpenAI. In this section, you will see how you can perform text summarization using one of the available models from HuggingFace. To begin with, you need to install the Transformers Python package that allows you to use HuggingFace models.

$ pip install transformers

Once installed, you need to load this package in your Python code.

from transformers import pipeline

Then you need to define the text on which you want to perform the summarization operation.

ARTICLE = """A rapidly expanding discipline, artificial intelligence (AI) has the potential to revolutionise a number of sectors, including healthcare, banking, and transportation. AI systems are capable of processing vast amounts of data and making predictions and decisions based on that data, often with greater accuracy and speed than humans. However, there are also concerns about the ethical implications of AI, such as the potential for bias or misuse of the technology. As AI continues to develop and become more integrated into our daily lives, it's important to carefully consider both the benefits and risks."""

For now, we will use the pre-trained Bert model from Facebook available on HuggingFace. To use it, you need to create a pipeline with this model as follows:

# load model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

When you run the above code for the first time, it will download all the necessary weights and configuration files to perform text summarization.

Finally, you just need to pass the text to your summarizer object to summarize it.

# call summarizer
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

There you have it– that’s how easy it's to perform text summarization with the help of HuggingFace.

Semantic Analysis

Semantic analysis, also known as semantic parsing or natural language understanding, is a process of analyzing text to extract meaning from it. It involves identifying the relationships between words and phrases in a sentence and interpreting their meaning in a given context.

Here we’ll check the synonyms of different words in input strings. To begin with, you need to import some of the NLTK methods and an example string:

# load dependencies
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
# Sample text
text = "A Comprehensive Guide to the Top NLP Algorithms for 2023"

After this, you need to tokenize the given input text, as usual, using the word_tokenize() method.

# tokenize the text
tokens = word_tokenize(text)

Then you need to identify parts of speech of different words in the input using the pos_tag() method.

# Perform part-of-speech tagging
pos_tags = nltk.pos_tag(tokens)print(pos_tags)
Finally, you can use the wordnet module to generate the synonyms of the input text words.
# Identify synonyms and antonyms for the words in the text
for word, pos in pos_tags:
	synsets = wordnet.synsets(word)
	if synsets:
		synset = synsets[0]
		synonyms = synset.lemma_names()
		print(word, ":", synonyms)

From here you can get antonyms of the text instead, perform sentiment analysis, and calculate the frequency of different words as part of semantic analysis.

Clustering

If you have literally billions of documents, you can’t go through them one by one to try and extract information. You need to have some way to understand what each document is about before you dive deeper. That’s where clustering algorithms help.

Clustering is a common unsupervised learning technique that involves grouping similar items in a cluster. In NLP, clustering is grouping similar documents or words into clusters based on their features. The goal of clustering is to identify patterns and relationships in the data without prior knowledge of the groups or categories. Once you obtain a cluster of similar documents, you can use NLP methods like text summarization and topic modeling to analyze this text properly.

The prominent clustering techniques k-means, hierarchical, and density-based can all be applied to natural language processing (NLP). The following steps are often included in these algorithms:

Preprocessing: Clean and transform the data to make it suitable for clustering. For text data, this may involve tokenization, normalization, and vectorization.
Feature extraction: Identify the relevant features or attributes of the data that will be used for clustering. For text data, this may involve using techniques like bag-of-words or tf-idf to represent the documents as vectors.
Clustering: Apply the clustering algorithm to group the data into clusters based on their features. The choice of algorithm and the number of clusters can be based on domain knowledge or data exploration.
Evaluation: Assess the quality of the clusters based on their coherence, separation, and interpretability.

Let’s look at the implementation of the K-Means clustering for some example sets of document strings. The scikit-learn module contains the implementation of different clustering algorithms. let’s first import the clustering dependencies.

# import dependencies
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

Then, let’s define some of the example text strings to test the clustering.

# Sample corpus of documents
corpus = [        "The sky is blue and beautiful",        "Love this blue and beautiful sky",        "The fox leaps in the direction of the very sluggish dog",        "A black fox is quite quick and the dog is very lazy",        "The dark-colored dog and fox get along well",        "The red apple is succulent and ripe",        "Delicious red apples are the best",        "Stay healthy by eating an apple a day",

After this, you need to vectorize the text using the TfidfVectorizer as follows:

# Vectorize the corpus
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

In the above code, first an object of TfidfVectorizer is created, and then the fit_transform() method is called for the vectorization. After this, you can pass the vectorized text to the KMeans() method of scikit-learn to train the clustering algorithm.

# Cluster the documents using k-means
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

Once you run the above code, different clusters are formed by k-means which can be presented as follows:

# Print the clusters and their respective documents
for i in range(kmeans.n_clusters):
        print("Cluster", i+1)
        for j in range(len(corpus)):
            if kmeans.labels_[j] == i:
                print("-", corpus[j])
        print()

As you can see, all the clusters formed using k-means look quite relevant. This was just a simple example of applying clustering to the text, using sklearn you can perform different clustering algorithms on any size of the dataset.

Text Analysis

Now that you have seen multiple concepts of NLP, you can consider text analysis as the umbrella for all these concepts. It's the process of extracting useful and relevant information from textual data. It involves using various techniques and algorithms such as sentiment analysis, named entity recognition, topic modeling, text classification, and text summarization to analyze and understand the patterns and structure of text data.

Text analysis typically involves several steps, including text preprocessing (such as tokenization, stemming, and stop word removal), feature extraction (such as bag-of-words or word embeddings), and applying various algorithms and techniques (such as machine learning algorithms, statistical techniques, or rule-based approaches) to analyze the data and extract insights.

You can find the entire code used in this article here.

The ever-increasing power of NLP

You now know the different algorithms that are widely used by organizations to handle their huge amount of text data. Libraries like NLTK, SpaCy, Gensim, and TextBlob. make the work a lot easier.

One of the most significant recent advancements in NLP has been the development of pre-trained language models, such as BERT and GPT-4, which can understand and generate natural language with remarkable accuracy and fluency. These models have been applied to a wide range of tasks, from sentiment analysis and text classification to language translation and question answering, and have shown remarkable results.

Another recent advancement in NLP is the use of transfer learning, which allows models to be trained on one task and then applied to another, similar task, with only minimal additional training. This approach has been highly effective in reducing the amount of data and resources required to develop NLP models and has enabled rapid progress in the field.

Different organizations are now releasing their AI and ML-based solutions for NLP in the form of APIs. So it's been a lot easier to try out different services like text summarization, and text classification with simple API calls. In the years to come, we can anticipate even more ground-breaking NLP applications.

This is something we think a lot about at Hex, where we're creating a platform that makes it easy to build and share interactive data products which can help teams be more impactful.

If this is is interesting, click below to get started, or to check out opportunities to join our team.

✨ Get started for free

👩‍💻 Open roles

Made with

🍩

☕

🥟

🍺

🍰

🔮

🔒

🥖

🍷