Text preprocessing is a fundamental step in most natural language processing (NLP) tasks. It involves transforming raw text into a format that is more suitable for the task at hand, whether it’s information retrieval, text classification, sentiment analysis, etc. Here are some common text preprocessing techniques:
- Lowercasing
- Tokenization
- Stopword Removal
- Stemming
- Lemmatization
- Removing Punctuation
- Removing HTML tags
- Removing Accented Characters
- Expanding Contractions
- Spell Checking
Let’s explore these techniques step by step with data and Python code:
1. Lowercasing
This involves converting all characters in the text to lowercase. This helps in maintaining consistency and avoids having multiple copies of the same words.
text = "Text Preprocessing is FUN!"
text_lower = text.lower()
2. Tokenization
Tokenization splits the text into sentences or words. This helps in understanding the structure of the text and is a precursor to many other preprocessing steps.
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
3. Stopword Removal
Stopwords are words that do not contain important meaning and are usually removed from texts. They include words like “and”, “the”, “is”, etc.
from nltk.corpus import stopwords
stop_words = set(stopwords.words(‘english’))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
4. Stemming
Stemming involves reducing a word to its base or root form. For example, “running” becomes “run”
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
5. Lemmatization
Lemmatization, like stemming, involves reducing a word to its base form, but it takes into account the word’s meaning and its context.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
6. Removing Punctuation
import string
no_punctuation = text.translate(str.maketrans(”, ”, string.punctuation))
7. Removing HTML tags
This is especially useful when dealing with web data.
import re
html_text = “<html><body><p>Text Preprocessing is FUN!</p></body></html>”
clean_text = re.sub(‘<[^<]+?>’, ”, html_text)
8. Removing Accented Characters
This helps in standardizing the text.
import unidecode
accented_text = “résumé”
unaccented_text = unidecode.unidecode(accented_text)
9. Expanding Contractions
For example, “isn’t” becomes “is not”.
def expand_contractions(text):
contractions_dict = {
"isn't": "is not",
"aren't": "are not",
# add more contractions as required
}
for contraction, expansion in contractions_dict.items():
text = text.replace(contraction, expansion)
return text
expanded_text = expand_contractions(“This isn’t a drill!”)
10. Spell Checking
There are various libraries available for spell checking. One such library is pyspellchecker
.
from spellchecker import SpellChecker
spell = SpellChecker()
misspelled = spell.unknown([‘some’, ‘misspeled’, ‘words’])
corrected_words = {word: spell.candidates(word) for word in misspelled}