Exploring Text Processing in Python with 're': Basic Examples and Data-driven Insights - Best IT Training Institutes in Chennai with Placement

Table of Contents

Example 1: Removing URLs

text_with_url = "Check out our website: www.example.com for more information"

clean_text = re.sub(r'http\S+', '', text_with_url)

print(clean_text)

text_with_url: Contains the input text with a URL.
re.sub(r'http\S+', '', text_with_url): This uses the re.sub() function to replace any sequence of non-whitespace characters (\S+) that start with ‘http’ with an empty string (''), effectively removing the URL from the text.
print(clean_text): Prints the text after removing the URL.

Example 2: Removing hashtags and mentions

text_with_hashtag = "Excited for the #weekend! @friend"

clean_text = re.sub(r'#[\w_]+', '', text_with_hashtag)

clean_text = re.sub(r'@[\w_]+', '', clean_text)

print(clean_text)

text_with_hashtag: Contains the input text with hashtags and mentions.
re.sub(r'#[\w_]+', '', text_with_hashtag): This removes hashtags by replacing sequences of word characters and underscores ([\w_]+) starting with ‘#’ with an empty string.
re.sub(r'@[\w_]+', '', clean_text): This removes mentions by replacing sequences of word characters and underscores ([\w_]+) starting with ‘@’ with an empty string.
print(clean_text): Prints the text after removing hashtags and mentions.

Example 3: Removing special characters and punctuation

text_with_special_chars = "Hello, world! How's everything?"

clean_text = re.sub(r'[^A-Za-z\s]', '', text_with_special_chars)

print(clean_text)

text_with_special_chars: Contains the input text with special characters and punctuation.
re.sub(r'[^A-Za-z\s]', '', text_with_special_chars): This removes any character that is not an uppercase or lowercase letter ([^A-Za-z]) and not a whitespace (\s).
print(clean_text): Prints the text after removing special characters and punctuation.

Example 4: Extracting email addresses

text_with_email = "Contact us at support@example.com for assistance"

email = re.search(r'[\w\.-]+@[\w\.-]+', text_with_email)

if email:

print(email.group())

text_with_email: Contains the input text with an email address.
re.search(r'[\w\.-]+@[\w\.-]+', text_with_email): This searches for a pattern of word characters, dots, and hyphens ([\w\.-]+) before the ‘@’ symbol, followed by another pattern of word characters, dots, and hyphens after the ‘@’ symbol.
if email:: Checks if a match was found.
print(email.group()): Prints the matched email address.

Example 5: Splitting text into sentences

text_with_sentences = "This is the first sentence. And here's the second one."

sentences = re.split(r'[.!?]', text_with_sentences)

print(sentences)

text_with_sentences: Contains the input text with multiple sentences.
re.split(r'[.!?]', text_with_sentences): This splits the text into sentences using the characters ‘.’, ‘!’, and ‘?’ as delimiters.
print(sentences): Prints the list of sentences.

Example 6: Finding phone numbers

text_with_phone = "Call us at 123-456-7890 for inquiries"

phone_number = re.search(r'\d{3}-\d{3}-\d{4}', text_with_phone)

if phone_number:

print(phone_number.group())

text_with_phone: Contains the input text with a phone number.
re.search(r'\d{3}-\d{3}-\d{4}', text_with_phone): This searches for a pattern of three digits (\d{3}), followed by a hyphen, another pattern of three digits, another hyphen, and finally, a pattern of four digits.
if phone_number:: Checks if a match was found.
print(phone_number.group()): Prints the matched phone number.

Each example uses regular expressions to perform specific text processing tasks, such as removing, extracting, or splitting text based on patterns defined by the regular expressions. The re module provides the tools to work with regular expressions in Python

You can run these examples in a Python environment to see how the re module is used for various text processing tasks. Remember that regular expressions are a powerful tool for text manipulation, and you can adapt these examples to your specific needs and dataset.

Here are a few more basic examples using Python’s re module with sample data:

import re

# Example 1: Matching words starting with a specific letter
text = “Apples are red, bananas are yellow.”
matches = re.findall(r’\b[aA]\w+’, text)
print(matches)

# Example 2: Extracting numbers from a string
text_with_numbers = “The price is $19.99 and the quantity is 25.”
numbers = re.findall(r’\d+\.\d+|\d+’, text_with_numbers)
print(numbers)

# Example 3: Removing consecutive duplicate words
text_with_duplicates = “This is is a test test sentence.”
clean_text = re.sub(r’\b(\w+)\s+\1\b’, r’\1′, text_with_duplicates)
print(clean_text)

# Example 4: Replacing multiple spaces with a single space
text_with_spaces = ” Too many spaces.”
clean_text = re.sub(r’\s+’, ‘ ‘, text_with_spaces)
print(clean_text)

# Example 5: Extracting domain names from URLs
urls = [“https://www.example.com”, “http://blog.example.net”]
domains = [re.search(r’https?://(www\.)?([\w.-]+)’, url).group(2) for url in urls]
print(domains)

# Example 6: Splitting text using a delimiter
text_with_delimiter = “apple,banana,cherry,orange”
items = re.split(r’,’, text_with_delimiter)
print(items)

Feel free to modify and experiment with these examples to gain a better understanding of how the re module can be used for various text manipulation tasks. Regular expressions offer a versatile way to handle text processing challenges, and you can adjust the patterns to suit your specific requirements.

Let’s go through each of the examples step by step:

Example 1: Matching words starting with a specific letter

text = "Apples are red, bananas are yellow."

matches = re.findall(r'\b[aA]\w+', text)

print(matches)

In this example, we want to find words that start with either lowercase ‘a’ or uppercase ‘A’. Here’s the breakdown:

\b: This is a word boundary anchor that matches the position between a word character and a non-word character.
[aA]: This character class matches either ‘a’ or ‘A’.
\w+: This matches one or more word characters (letters, digits, or underscores).

So, the pattern \b[aA]\w+ searches for words starting with ‘a’ or ‘A’, and re.findall() returns a list of matched words.

Example 2: Extracting numbers from a string

text_with_numbers = "The price is $19.99 and the quantity is 25."

numbers = re.findall(r'\d+\.\d+|\d+', text_with_numbers)

print(numbers)

Here, we’re looking for numbers (both integers and decimals) within the given string:

\d+: This matches one or more digits.
\.\d+: This matches a decimal point followed by one or more digits.

The pattern \d+\.\d+|\d+ captures both decimal numbers and integers. re.findall() returns a list of matched numbers.

Example 3: Removing consecutive duplicate words

text_with_duplicates = "This is is a test test sentence."

clean_text = re.sub(r'\b(\w+)\s+\1\b', r'\1', text_with_duplicates)

print(clean_text)

In this case, we want to remove consecutive duplicate words in the text:

\b(\w+)\s+\1\b: This pattern captures a word, followed by one or more spaces, and then the same word (captured using a group reference \1).
re.sub() replaces the matched pattern with just the first occurrence of the word, effectively removing the duplicate.

Example 4: Replacing multiple spaces with a single space

text_with_spaces = "   Too     many    spaces."

clean_text = re.sub(r'\s+', ' ', text_with_spaces)

print(clean_text)

Here, we’re replacing multiple consecutive spaces with a single space:

\s+: This matches one or more whitespace characters (spaces, tabs, newlines).
re.sub() replaces all consecutive whitespace characters with a single space.

Example 5: Extracting domain names from URLs

urls = ["https://www.example.com", "http://blog.example.net"]

domains = [re.search(r'https?://(www\.)?([\w.-]+)', url).group(2) for url in urls]

print(domains)

This example aims to extract domain names from URLs:

https?: This matches “http” or “https”.
(www\.)?: This optionally matches “www.”.
([\w.-]+): This captures one or more word characters, dots, or hyphens (the domain name).

The re.search() function is used to find the domain name in each URL, and .group(2) extracts the captured domain name.

Example 6: Splitting text using a delimiter

text_with_delimiter = "apple,banana,cherry,orange"

items = re.split(r',', text_with_delimiter)

print(items)

In this case, we’re splitting the text using a comma as the delimiter:

,: This matches the comma character.

re.split() divides the text into separate items based on the comma delimiter and returns a list of items.

These examples showcase different ways you can use the re module to manipulate and extract information from text data. Regular expressions provide a powerful toolset for text processing tasks.

Best IT Training Institutes in Chennai with Placement | DeepNeuron