Text preprocessing is a crucial step in natural language processing (NLP) and machine learning projects that deal with text data. It involves cleaning and transforming raw text into a form that can be readily used by machine learning models or other downstream applications.
Here are some commonly used text preprocessing techniques:
- Lowercasing: Convert all characters in the text to lowercase. This helps in maintaining consistency and reducing the vocabulary size.
- Tokenization: Splitting text into individual words or tokens.
- Removing Punctuation: Strip the text of any punctuation marks.
- Removing Stop Words: Stop words are common words like ‘and’, ‘the’, ‘is’, etc. that are often filtered out because they occur frequently and don’t carry significant meaning on their own.
- Stemming: Reduce words to their root form. For example, “running” becomes “run”.
- Lemmatization: Similar to stemming but ensures the root word belongs to the language. For example, “running” becomes “run”, and “better” becomes “good”.
- Removing HTML Tags: If dealing with web data, we often have to remove HTML tags.
- Removing Numbers or converting them to words.
- Handling Emojis and Special Characters.
- Removing URLs.
- Removing Frequent or Rare words.
Let’s demonstrate each of these techniques step by step using Python:
Setup
First, let’s define a sample text and import necessary libraries.
# Sample text
sample_text = "Hello World! I'm learning about NLP. It's fascinating. Visit https://example.com for more info. ?"
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from bs4 import BeautifulSoup
import re
import string
# Downloading necessary datasetsnltk.download(‘stopwords’)
nltk.download(‘wordnet’)
sample_textIt seems we can’t download the necessary datasets for stopwords
and wordnet
due to lack of internet access in this environment. However, I can still demonstrate the techniques using custom stop words and other functions.
Let’s proceed with the preprocessing steps:
1. Lowercasing
Convert the sample text to lowercase.
# Convert the sample text to lowercase
lowercased_text = sample_text.lower()
lowercased_textThe text has been converted to lowercase.
2. Tokenization
Let’s tokenize the text, i.e., split it into individual words or tokens.
# Tokenizing using Python's string methods
tokens = lowercased_text.split()
tokensThe text has been tokenized into individual words.
3. Removing Punctuation
Now, we’ll strip each token of punctuation marks.
# Removing punctuation from the tokens
tokens_without_punctuation = [token.translate(str.maketrans('', '', string.punctuation)) for token in tokens]
tokens_without_punctuationThe punctuation has been removed from the tokens.
4. Removing Stop Words
Stop words are common words that don’t carry significant meaning on their own. In a typical setting, we’d use NLTK’s list of stop words. However, due to constraints, I’ll define a small custom list of stop words to demonstrate.
# Custom list of stop words
custom_stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
"yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers",
"herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves",
"what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are",
"was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does",
"did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until",
"while", "of", "at", "by", "for", "with", "about", "against", "between", "into",
"through", "during", "before", "after", "above", "below", "to", "from", "up", "down",
"in", "out", "on", "off", "over", "under", "again", "further", "then", "once"]
tokens_without_stopwords = [token for token in tokens_without_punctuation if token not in custom_stopwords]
tokens_without_stopwordsStop words have been removed from the tokens.
5. Stemming
Stemming is a technique to reduce words to their root or base form. For instance, “running” becomes “run”. We’ll use the Porter Stemmer for this demonstration.
# Initialize Porter Stemmer
stemmer = PorterStemmer()
# Apply stemming to the tokensstemmed_tokens = [stemmer.stem(token) for token in tokens_without_stopwords]
stemmed_tokensThe tokens have been stemmed, reducing them to their root forms. For instance, “learning” has been stemmed to “learn”.
6. Lemmatization
Lemmatization is similar to stemming but ensures the root word belongs to the language. For instance, “running” would become “run” and “better” might become “good”. However, due to our current constraints, I’ll skip the demonstration of this technique, as it requires the WordNetLemmatizer from NLTK.
7. Removing HTML Tags
If the text comes from web sources, it might contain HTML tags. We’ll use the Beautiful Soup library to demonstrate removing HTML tags. Let’s assume our sample text contains some HTML content.
# Sample text with HTML tags
html_text = "<html><head></head><body><p>Hello World!</p><br><p>I'm learning about NLP.</p></body></html>"
# soup = BeautifulSoup(html_text, “html.parser”)
text_without_html = soup.get_text()
text_without_htmlThe HTML tags have been successfully removed from the text using Beautiful Soup.
8. Removing Numbers or Converting Them to Words
Depending on the use case, you might want to remove numbers from the text or convert them into words. Let’s demonstrate both approaches.
# Sample text with numbers
text_with_numbers = "I have 5 apples and 3 oranges."
# Removing numberstext_without_numbers = re.sub(r’\d+’, ”, text_with_numbers)
text_without_numbersNumbers have been removed from the text.
If you prefer to convert numbers to words, you’d typically use a library like word2number
or inflect
. However, due to our environment constraints, I won’t be able to demonstrate this at the moment.
9. Handling Emojis and Special Characters
In modern texts, especially from social media sources, you might encounter emojis and other special characters. Depending on your goals, you might want to remove or translate them. For simplicity, I’ll demonstrate removing them.
# Sample text with emojis
text_with_emoji = "I love Python! ??"
# Removing emojis and other special characterstext_without_emoji = text_with_emoji.encode(‘ascii’, ‘ignore’).decode(‘ascii’)
text_without_emojiThe emojis have been removed from the text.
10. Removing URLs
Let’s demonstrate how to remove URLs from the text.