Topic modeling is a type of statistical model used in natural language processing (NLP) to discover the abstract “topics” that occur in a collection of documents. One of the most common techniques for topic modeling is Latent Dirichlet Allocation (LDA).
Here’s a step-by-step example of topic modeling using the gensim
library in Python:
1. Import necessary libraries:
import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
If you haven’t installed gensim
and nltk
, you would need to do so. Since we can’t install packages here, I’ll assume you have them.
2. Prepare the data:
For this example, let’s consider a small dataset of texts:
docs = [
"Natural language processing is a sub-field of artificial intelligence.",
"Artificial intelligence aims to build machines that can mimic human intelligence.",
"Machine learning is a technique used in artificial intelligence.",
"Deep learning, a subset of machine learning, uses neural networks to analyze various types of data."
]
3. Preprocess the data:
We need to tokenize the documents, remove stopwords, and remove punctuation:
stop_words = set(stopwords.words('english'))
def preprocess(doc):
tokens = word_tokenize(doc.lower())
tokens = [t for t in tokens if t not in string.punctuation]
tokens = [t for t in tokens if t not in stop_words]
return tokens
tokenized_docs = [preprocess(doc) for doc in docs]
4. Create a dictionary and corpus:
dictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
5. Build the LDA model:
NUM_TOPICS = 2
lda_model = gensim.models.LdaModel(corpus, num_topics=NUM_TOPICS, id2word=dictionary, passes=15)
6. Display the topics:
topics = lda_model.print_topics(num_words=5)
for topic in topics:
print(topic)
This will print the most significant words for each of the topics. You can adjust the NUM_TOPICS
variable to extract a different number of topics.
Keep in mind that topic modeling results can vary with different runs, especially with small datasets. It’s best to use a larger corpus for more stable and meaningful topics.
If you’d like to see the results for any specific step or if you have further questions, let me know!
Let’s start by extracting the content from the provided DOCX file. After extracting the content, we’ll proceed with the topic modeling steps as outlined previously.
First, we’ll read the content from the DOCX file.
from docx import Document
# Load the DOCX file
doc = Document(“/mnt/data/Naukri_PatwariSrikara[4y_0m]Sample.docx”)
# Extract the text from the DOCX file
text = []
for paragraph in doc.paragraphs:
text.append(paragraph.text)
text_content = “\n”.join(text)
text_content[:1000] # Displaying the first 1000 characters for a glimpse
The extracted content seems to be a professional resume. Let’s proceed with the topic modeling to identify the main topics in the document.
- Tokenize and preprocess the content.
- Create a dictionary and corpus.
- Build the LDA model.
- Display the topics.I will use a more straightforward approach for tokenization that doesn’t rely on NLTK’s tokenizers. Let’s proceed with that.
# Simple tokenization function
def simple_tokenize(text):
return [word.lower() for word in text.split() if word.isalpha()]
def preprocess_v2(doc):tokens = simple_tokenize(doc)
tokens = [t for t in tokens if t not in stop_words]
return tokens
# Tokenize and preprocess the content using the simplified tokenization approachtokenized_docs = [preprocess_v2(sentence) for sentence in sentences if sentence]
# Create a dictionary and corpusdictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
# Build the LDA modelNUM_TOPICS = 5
lda_model = gensim.models.LdaModel(corpus, num_topics=NUM_TOPICS, id2word=dictionary, passes=15)
# Display the topics
topics = lda_model.print_topics(num_words=5)
topics