Topic modeling is a type of statistical model used in natural language processing (NLP) to discover the abstract “topics” that occur in a collection of documents. One of the most common techniques for topic modeling is Latent Dirichlet Allocation (LDA).

Here’s a step-by-step example of topic modeling using the gensim library in Python:

1. Import necessary libraries:

import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

If you haven’t installed gensim and nltk, you would need to do so. Since we can’t install packages here, I’ll assume you have them.

2. Prepare the data:

For this example, let’s consider a small dataset of texts:

docs = [
"Natural language processing is a sub-field of artificial intelligence.",
"Artificial intelligence aims to build machines that can mimic human intelligence.",
"Machine learning is a technique used in artificial intelligence.",
"Deep learning, a subset of machine learning, uses neural networks to analyze various types of data."
]

3. Preprocess the data:

We need to tokenize the documents, remove stopwords, and remove punctuation:

stop_words = set(stopwords.words('english'))

def preprocess(doc):
tokens = word_tokenize(doc.lower())
tokens = [t for t in tokens if t not in string.punctuation]
tokens = [t for t in tokens if t not in stop_words]
return tokens

tokenized_docs = [preprocess(doc) for doc in docs]

4. Create a dictionary and corpus:

dictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

5. Build the LDA model:

NUM_TOPICS = 2
lda_model = gensim.models.LdaModel(corpus, num_topics=NUM_TOPICS, id2word=dictionary, passes=15)

6. Display the topics:

topics = lda_model.print_topics(num_words=5)
for topic in topics:
print(topic)

 

This will print the most significant words for each of the topics. You can adjust the NUM_TOPICS variable to extract a different number of topics.

Keep in mind that topic modeling results can vary with different runs, especially with small datasets. It’s best to use a larger corpus for more stable and meaningful topics.

If you’d like to see the results for any specific step or if you have further questions, let me know!

Let’s start by extracting the content from the provided DOCX file. After extracting the content, we’ll proceed with the topic modeling steps as outlined previously.

First, we’ll read the content from the DOCX file.

 

from docx import Document

# Load the DOCX file
doc = Document(“/mnt/data/Naukri_PatwariSrikara[4y_0m]Sample.docx”)

# Extract the text from the DOCX file
text = []
for paragraph in doc.paragraphs:
text.append(paragraph.text)

text_content = “\n”.join(text)
text_content[:1000] # Displaying the first 1000 characters for a glimpse

RESULT
‘PATWARI SRIKARA\n\n Objective:\t\n
Successful in maintaining business relationships with key decision makers of target organizations for achieving the service norms and resolving critical issues, to be a part of an organization that provides me with an open environment, where I can integrate my skills and knowledge to work sincerely towards the achievement of organizational goals \n\n Professional Summary:\t\n\nHaving Overall 5 Years of Professional experience out of 3.6 years Involved in Power BI Development & admin projects with extensive usage of Power reporting, Administrating, Power BI Services, Power BI Report Server, Power Bi desktop, DAX, SSRS, Visual Studio, SSIS, SQL Server, and Reporting analyst. \nMaintains the roles and creates new user add-in power bi services and power bi report server \nAdept in power bi desktop, power bi services, and gateway & performances \nCreating workspace and apps in power bi allot and creating access to end users\nHandling Microsoft 365 to create Access, impl’

The extracted content seems to be a professional resume. Let’s proceed with the topic modeling to identify the main topics in the document.

  1. Tokenize and preprocess the content.
  2. Create a dictionary and corpus.
  3. Build the LDA model.
  4. Display the topics.I will use a more straightforward approach for tokenization that doesn’t rely on NLTK’s tokenizers. Let’s proceed with that.
# Simple tokenization function
def simple_tokenize(text):
return [word.lower() for word in text.split() if word.isalpha()]
def preprocess_v2(doc):
tokens = simple_tokenize(doc)
tokens = [t for t in tokens if t not in stop_words]
return tokens# Tokenize and preprocess the content using the simplified tokenization approach
tokenized_docs = [preprocess_v2(sentence) for sentence in sentences if sentence]# Create a dictionary and corpus
dictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]# Build the LDA model
NUM_TOPICS = 5
lda_model = gensim.models.LdaModel(corpus, num_topics=NUM_TOPICS, id2word=dictionary, passes=15)

# Display the topics
topics = lda_model.print_topics(num_words=5)
topics

RESULT
[(0,’0.056*”power” + 0.034*”business” + 0.034*”verizon” + 0.031*”reports” + 0.031*”bi”‘),
(1,’0.109*”power” + 0.087*”bi” + 0.037*”data” + 0.020*”creating” + 0.016*”security”‘),
(2,’0.025*”excel” + 0.019*”information” + 0.014*”data” + 0.013*”reports” + 0.013*”sum”‘),
(3,’0.046*”reports” + 0.024*”requirements” + 0.019*”scheduling” + 0.016*”services” + 0.015*”refresh”‘),
(4,’0.049*”solar” + 0.019*”project” + 0.019*”systems” + 0.019*”capacity” + 0.013*”products”‘)]

 

Leave a Reply

Your email address will not be published. Required fields are marked *

DeepNeuron