Example 1: Removing URLs
text_with_url = "Check out our website: www.example.com for more information"
clean_text = re.sub(r'http\S+', '', text_with_url)
print(clean_text)
text_with_url
: Contains the input text with a URL.re.sub(r'http\S+', '', text_with_url)
: This uses there.sub()
function to replace any sequence of non-whitespace characters (\S+
) that start with ‘http’ with an empty string (''
), effectively removing the URL from the text.print(clean_text)
: Prints the text after removing the URL.
Example 2: Removing hashtags and mentions
text_with_hashtag = "Excited for the #weekend! @friend"
clean_text = re.sub(r'#[\w_]+', '', text_with_hashtag)
clean_text = re.sub(r'@[\w_]+', '', clean_text)
print(clean_text)
text_with_hashtag
: Contains the input text with hashtags and mentions.re.sub(r'#[\w_]+', '', text_with_hashtag)
: This removes hashtags by replacing sequences of word characters and underscores ([\w_]+
) starting with ‘#’ with an empty string.re.sub(r'@[\w_]+', '', clean_text)
: This removes mentions by replacing sequences of word characters and underscores ([\w_]+
) starting with ‘@’ with an empty string.print(clean_text)
: Prints the text after removing hashtags and mentions.
Example 3: Removing special characters and punctuation
text_with_special_chars = "Hello, world! How's everything?"
clean_text = re.sub(r'[^A-Za-z\s]', '', text_with_special_chars)
print(clean_text)
text_with_special_chars
: Contains the input text with special characters and punctuation.re.sub(r'[^A-Za-z\s]', '', text_with_special_chars)
: This removes any character that is not an uppercase or lowercase letter ([^A-Za-z]
) and not a whitespace (\s
).print(clean_text)
: Prints the text after removing special characters and punctuation.
Example 4: Extracting email addresses
text_with_email = "Contact us at support@example.com for assistance"
email = re.search(r'[\w\.-]+@[\w\.-]+', text_with_email)
if email:
print(email.group())
text_with_email
: Contains the input text with an email address.re.search(r'[\w\.-]+@[\w\.-]+', text_with_email)
: This searches for a pattern of word characters, dots, and hyphens ([\w\.-]+
) before the ‘@’ symbol, followed by another pattern of word characters, dots, and hyphens after the ‘@’ symbol.if email:
: Checks if a match was found.print(email.group())
: Prints the matched email address.
Example 5: Splitting text into sentences
text_with_sentences = "This is the first sentence. And here's the second one."
sentences = re.split(r'[.!?]', text_with_sentences)
print(sentences)
text_with_sentences
: Contains the input text with multiple sentences.re.split(r'[.!?]', text_with_sentences)
: This splits the text into sentences using the characters ‘.’, ‘!’, and ‘?’ as delimiters.print(sentences)
: Prints the list of sentences.
Example 6: Finding phone numbers
text_with_phone = "Call us at 123-456-7890 for inquiries"
phone_number = re.search(r'\d{3}-\d{3}-\d{4}', text_with_phone)
if phone_number:
print(phone_number.group())
text_with_phone
: Contains the input text with a phone number.re.search(r'\d{3}-\d{3}-\d{4}', text_with_phone)
: This searches for a pattern of three digits (\d{3}
), followed by a hyphen, another pattern of three digits, another hyphen, and finally, a pattern of four digits.if phone_number:
: Checks if a match was found.print(phone_number.group())
: Prints the matched phone number.
Each example uses regular expressions to perform specific text processing tasks, such as removing, extracting, or splitting text based on patterns defined by the regular expressions. The re
module provides the tools to work with regular expressions in Python
You can run these examples in a Python environment to see how the re
module is used for various text processing tasks. Remember that regular expressions are a powerful tool for text manipulation, and you can adapt these examples to your specific needs and dataset.
re
module with sample data:import re
# Example 1: Matching words starting with a specific letter
text = “Apples are red, bananas are yellow.”
matches = re.findall(r’\b[aA]\w+’, text)
print(matches)
# Example 2: Extracting numbers from a string
text_with_numbers = “The price is $19.99 and the quantity is 25.”
numbers = re.findall(r’\d+\.\d+|\d+’, text_with_numbers)
print(numbers)
# Example 3: Removing consecutive duplicate words
text_with_duplicates = “This is is a test test sentence.”
clean_text = re.sub(r’\b(\w+)\s+\1\b’, r’\1′, text_with_duplicates)
print(clean_text)
# Example 4: Replacing multiple spaces with a single space
text_with_spaces = ” Too many spaces.”
clean_text = re.sub(r’\s+’, ‘ ‘, text_with_spaces)
print(clean_text)
# Example 5: Extracting domain names from URLs
urls = [“https://www.example.com”, “http://blog.example.net”]
domains = [re.search(r’https?://(www\.)?([\w.-]+)’, url).group(2) for url in urls]
print(domains)
# Example 6: Splitting text using a delimiter
text_with_delimiter = “apple,banana,cherry,orange”
items = re.split(r’,’, text_with_delimiter)
print(items)
Feel free to modify and experiment with these examples to gain a better understanding of how the re
module can be used for various text manipulation tasks. Regular expressions offer a versatile way to handle text processing challenges, and you can adjust the patterns to suit your specific requirements.
Example 1: Matching words starting with a specific letter
text = "Apples are red, bananas are yellow."
matches = re.findall(r'\b[aA]\w+', text)
print(matches)
In this example, we want to find words that start with either lowercase ‘a’ or uppercase ‘A’. Here’s the breakdown:
\b
: This is a word boundary anchor that matches the position between a word character and a non-word character.[aA]
: This character class matches either ‘a’ or ‘A’.\w+
: This matches one or more word characters (letters, digits, or underscores).
So, the pattern \b[aA]\w+
searches for words starting with ‘a’ or ‘A’, and re.findall()
returns a list of matched words.
Example 2: Extracting numbers from a string
text_with_numbers = "The price is $19.99 and the quantity is 25."
numbers = re.findall(r'\d+\.\d+|\d+', text_with_numbers)
print(numbers)
Here, we’re looking for numbers (both integers and decimals) within the given string:
\d+
: This matches one or more digits.\.\d+
: This matches a decimal point followed by one or more digits.
The pattern \d+\.\d+|\d+
captures both decimal numbers and integers. re.findall()
returns a list of matched numbers.
Example 3: Removing consecutive duplicate words
text_with_duplicates = "This is is a test test sentence."
clean_text = re.sub(r'\b(\w+)\s+\1\b', r'\1', text_with_duplicates)
print(clean_text)
In this case, we want to remove consecutive duplicate words in the text:
\b(\w+)\s+\1\b
: This pattern captures a word, followed by one or more spaces, and then the same word (captured using a group reference\1
).re.sub()
replaces the matched pattern with just the first occurrence of the word, effectively removing the duplicate.
Example 4: Replacing multiple spaces with a single space
text_with_spaces = " Too many spaces."
clean_text = re.sub(r'\s+', ' ', text_with_spaces)
print(clean_text)
Here, we’re replacing multiple consecutive spaces with a single space:
\s+
: This matches one or more whitespace characters (spaces, tabs, newlines).re.sub()
replaces all consecutive whitespace characters with a single space.
Example 5: Extracting domain names from URLs
urls = ["https://www.example.com", "http://blog.example.net"]
domains = [re.search(r'https?://(www\.)?([\w.-]+)', url).group(2) for url in urls]
print(domains)
This example aims to extract domain names from URLs:
https?
: This matches “http” or “https”.(www\.)?
: This optionally matches “www.”.([\w.-]+)
: This captures one or more word characters, dots, or hyphens (the domain name).
The re.search()
function is used to find the domain name in each URL, and .group(2)
extracts the captured domain name.
Example 6: Splitting text using a delimiter
text_with_delimiter = "apple,banana,cherry,orange"
items = re.split(r',', text_with_delimiter)
print(items)
In this case, we’re splitting the text using a comma as the delimiter:
,
: This matches the comma character.
re.split()
divides the text into separate items based on the comma delimiter and returns a list of items.
These examples showcase different ways you can use the re
module to manipulate and extract information from text data. Regular expressions provide a powerful toolset for text processing tasks.