Text preprocessing is a fundamental step in most natural language processing (NLP) tasks. It involves transforming raw text into a format that is more suitable for the task at hand, whether it’s information retrieval, text classification, sentiment analysis, etc. Here are some common text preprocessing techniques:

  1. Lowercasing
  2. Tokenization
  3. Stopword Removal
  4. Stemming
  5. Lemmatization
  6. Removing Punctuation
  7. Removing HTML tags
  8. Removing Accented Characters
  9. Expanding Contractions
  10. Spell Checking

Let’s explore these techniques step by step with data and Python code:

1. Lowercasing

This involves converting all characters in the text to lowercase. This helps in maintaining consistency and avoids having multiple copies of the same words.

text = "Text Preprocessing is FUN!"
text_lower = text.lower()

2. Tokenization

Tokenization splits the text into sentences or words. This helps in understanding the structure of the text and is a precursor to many other preprocessing steps.

from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)

3. Stopword Removal

Stopwords are words that do not contain important meaning and are usually removed from texts. They include words like “and”, “the”, “is”, etc.

from nltk.corpus import stopwords

stop_words = set(stopwords.words(‘english’))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

4. Stemming

Stemming involves reducing a word to its base or root form. For example, “running” becomes “run”

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

5. Lemmatization

Lemmatization, like stemming, involves reducing a word to its base form, but it takes into account the word’s meaning and its context.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

6. Removing Punctuation

import string

no_punctuation = text.translate(str.maketrans(, , string.punctuation))

7. Removing HTML tags

This is especially useful when dealing with web data.

import re

html_text = “<html><body><p>Text Preprocessing is FUN!</p></body></html>”
clean_text = re.sub(‘<[^<]+?>’, , html_text)

8. Removing Accented Characters

This helps in standardizing the text.

import unidecode

accented_text = “résumé”
unaccented_text = unidecode.unidecode(accented_text)

9. Expanding Contractions

For example, “isn’t” becomes “is not”.

def expand_contractions(text):
contractions_dict = {
"isn't": "is not",
"aren't": "are not",
# add more contractions as required
}
for contraction, expansion in contractions_dict.items():
text = text.replace(contraction, expansion)
return text
expanded_text = expand_contractions(“This isn’t a drill!”)

10. Spell Checking

There are various libraries available for spell checking. One such library is pyspellchecker.

from spellchecker import SpellChecker

spell = SpellChecker()
misspelled = spell.unknown([‘some’, ‘misspeled’, ‘words’])
corrected_words = {word: spell.candidates(word) for word in misspelled}

Leave a Reply

Your email address will not be published. Required fields are marked *

DeepNeuron