Text preprocessing is a crucial step in natural language processing (NLP) and machine learning projects that deal with text data. It involves cleaning and transforming raw text into a form that can be readily used by machine learning models or other downstream applications.

Here are some commonly used text preprocessing techniques:

  1. Lowercasing: Convert all characters in the text to lowercase. This helps in maintaining consistency and reducing the vocabulary size.
  2. Tokenization: Splitting text into individual words or tokens.
  3. Removing Punctuation: Strip the text of any punctuation marks.
  4. Removing Stop Words: Stop words are common words like ‘and’, ‘the’, ‘is’, etc. that are often filtered out because they occur frequently and don’t carry significant meaning on their own.
  5. Stemming: Reduce words to their root form. For example, “running” becomes “run”.
  6. Lemmatization: Similar to stemming but ensures the root word belongs to the language. For example, “running” becomes “run”, and “better” becomes “good”.
  7. Removing HTML Tags: If dealing with web data, we often have to remove HTML tags.
  8. Removing Numbers or converting them to words.
  9. Handling Emojis and Special Characters.
  10. Removing URLs.
  11. Removing Frequent or Rare words.

Let’s demonstrate each of these techniques step by step using Python:

Setup

First, let’s define a sample text and import necessary libraries.

# Sample text
sample_text = "Hello World! I'm learning about NLP. It's fascinating. Visit https://example.com for more info. ?"
# Importing necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from bs4 import BeautifulSoup
import re
import string# Downloading necessary datasets
nltk.download(‘stopwords’)
nltk.download(‘wordnet’)sample_text


RESULT


“Hello World! I’m learning about NLP. It’s fascinating. Visit https://example.com for more info. ?”

It seems we can’t download the necessary datasets for stopwords and wordnet due to lack of internet access in this environment. However, I can still demonstrate the techniques using custom stop words and other functions.

Let’s proceed with the preprocessing steps:

1. Lowercasing

Convert the sample text to lowercase.

# Convert the sample text to lowercase
lowercased_text = sample_text.lower()
lowercased_text


RESULT


“hello world! i’m learning about nlp. it’s fascinating. visit https://example.com for more info. ?”

The text has been converted to lowercase.

2. Tokenization

Let’s tokenize the text, i.e., split it into individual words or tokens.

# Tokenizing using Python's string methods
tokens = lowercased_text.split()
tokens


RESULT


[‘hello’,
‘world!’,
“i’m”,
‘learning’,
‘about’,
‘nlp.’,
“it’s”,
‘fascinating.’,
‘visit’,
‘https://example.com’,
‘for’,
‘more’,
‘info.’,
‘?’]

The text has been tokenized into individual words.

3. Removing Punctuation

Now, we’ll strip each token of punctuation marks.

# Removing punctuation from the tokens
tokens_without_punctuation = [token.translate(str.maketrans('', '', string.punctuation)) for token in tokens]
tokens_without_punctuation


RESULT


[‘hello’,
‘world’,
‘im’,
‘learning’,
‘about’,
‘nlp’,
‘its’,
‘fascinating’,
‘visit’,
‘httpsexamplecom’,
‘for’,
‘more’,
‘info’,
‘?’]

The punctuation has been removed from the tokens.

4. Removing Stop Words

Stop words are common words that don’t carry significant meaning on their own. In a typical setting, we’d use NLTK’s list of stop words. However, due to constraints, I’ll define a small custom list of stop words to demonstrate.

# Custom list of stop words
custom_stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
"yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers",
"herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves",
"what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are",
"was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does",
"did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until",
"while", "of", "at", "by", "for", "with", "about", "against", "between", "into",
"through", "during", "before", "after", "above", "below", "to", "from", "up", "down",
"in", "out", "on", "off", "over", "under", "again", "further", "then", "once"]
# Removing stop words from the tokens
tokens_without_stopwords = [token for token in tokens_without_punctuation if token not in custom_stopwords]tokens_without_stopwords


RESULT


‘hello’,
‘world’,
‘im’,
‘learning’,
‘nlp’,
‘fascinating’,
‘visit’,
‘httpsexamplecom’,
‘more’,
‘info’,
‘?’]

Stop words have been removed from the tokens.

5. Stemming

Stemming is a technique to reduce words to their root or base form. For instance, “running” becomes “run”. We’ll use the Porter Stemmer for this demonstration.

# Initialize Porter Stemmer
stemmer = PorterStemmer()
# Apply stemming to the tokens
stemmed_tokens = [stemmer.stem(token) for token in tokens_without_stopwords]stemmed_tokens


RESULT


[‘hello’,
‘world’,
‘im’,
‘learn’,
‘nlp’,
‘fascin’,
‘visit’,
‘httpsexamplecom’,
‘more’,
‘info’,
‘?’]

The tokens have been stemmed, reducing them to their root forms. For instance, “learning” has been stemmed to “learn”.

6. Lemmatization

Lemmatization is similar to stemming but ensures the root word belongs to the language. For instance, “running” would become “run” and “better” might become “good”. However, due to our current constraints, I’ll skip the demonstration of this technique, as it requires the WordNetLemmatizer from NLTK.

7. Removing HTML Tags

If the text comes from web sources, it might contain HTML tags. We’ll use the Beautiful Soup library to demonstrate removing HTML tags. Let’s assume our sample text contains some HTML content.

# Sample text with HTML tags
html_text = "<html><head></head><body><p>Hello World!</p><br><p>I'm learning about NLP.</p></body></html>"
#
Removing HTML tags using Beautiful Soup
soup = BeautifulSoup(html_text, “html.parser”)
text_without_html = soup.get_text()text_without_html


RESULT


“Hello World!I’m learning about NLP.”

The HTML tags have been successfully removed from the text using Beautiful Soup.

8. Removing Numbers or Converting Them to Words

Depending on the use case, you might want to remove numbers from the text or convert them into words. Let’s demonstrate both approaches.

# Sample text with numbers
text_with_numbers = "I have 5 apples and 3 oranges."
# Removing numbers
text_without_numbers = re.sub(r’\d+’, , text_with_numbers)text_without_numbers


RESULT


‘I have apples and oranges.’

Numbers have been removed from the text.

If you prefer to convert numbers to words, you’d typically use a library like word2number or inflect. However, due to our environment constraints, I won’t be able to demonstrate this at the moment.

9. Handling Emojis and Special Characters

In modern texts, especially from social media sources, you might encounter emojis and other special characters. Depending on your goals, you might want to remove or translate them. For simplicity, I’ll demonstrate removing them.

# Sample text with emojis
text_with_emoji = "I love Python! ??"
# Removing emojis and other special characters
text_without_emoji = text_with_emoji.encode(‘ascii’, ‘ignore’).decode(‘ascii’)text_without_emoji


RESULT


‘I love Python! ‘

The emojis have been removed from the text.

10. Removing URLs

Let’s demonstrate how to remove URLs from the text.

Leave a Reply

Your email address will not be published. Required fields are marked *

DeepNeuron