Spaces:
Running
NLP Concepts Documentation
TF-IDF Vectorization
What is TF-IDF?
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical method to convert text into numbers that machine learning algorithms can understand. It measures how important a word is to a document within a collection of documents.
Formula Breakdown:
- TF (Term Frequency): How often a word appears in a document
- IDF (Inverse Document Frequency): How rare/common a word is across all documents
- TF-IDF = TF × IDF
Why Use TF-IDF?
- Converts text to numbers: ML algorithms need numbers, not words
- Highlights important words: Common words like "the", "and" get low scores
- Context-aware: Same word gets different importance in different contexts
- Simple but effective: Works well for text classification tasks
How TF-IDF Works (Step-by-Step):
Step 1: Collect All Text We gather all the sentences we want to analyze:
- "I want to cancel my order"
- "Where is my order status"
- "I want to buy a product"
Step 2: Find All Unique Words The system looks through all sentences and makes a list of every unique word:
- Words found: ['buy', 'cancel', 'is', 'my', 'order', 'product', 'status', 'to', 'want', 'where']
Step 3: Count Word Frequency in Each Sentence For each sentence, count how many times each word appears:
- Sentence 1: "cancel" appears 1 time, "order" appears 1 time, "buy" appears 0 times
- Sentence 2: "where" appears 1 time, "order" appears 1 time, "cancel" appears 0 times
- Sentence 3: "buy" appears 1 time, "product" appears 1 time, "cancel" appears 0 times
Step 4: Calculate How Rare Each Word Is Words that appear in many sentences are common (less important):
- "order" appears in 2 out of 3 sentences → somewhat common
- "cancel" appears in 1 out of 3 sentences → more unique/important
Step 5: Create Final Scores Multiply frequency by rarity to get TF-IDF scores:
- Common words get lower scores
- Rare words that appear frequently in a sentence get higher scores
Step 6: Convert to Number Matrix Each sentence becomes a list of numbers representing word importance:
# Example result
Sentence 1: [0.0, 0.85, 0.0, 0.3, 0.3, 0.0, 0.0, 0.3, 0.3, 0.0 ]
# [buy, cancel, is, my, order, product, status, to, want, where]
Code Example:
from sklearn.feature_extraction.text import TfidfVectorizer
# Step 1: Our sentences
documents = [
"I want to cancel my order",
"Where is my order status",
"I want to buy a product"
]
# Steps 2-6: TF-IDF does all the math for us
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# See the results
print("Words found:", vectorizer.get_feature_names_out())
print("Number matrix:")
print(tfidf_matrix.toarray())
Real Example from Our Chatbot:
# In our NLP engine
texts = ["I want to cancel", "Where is my order", "I want to buy"]
X = self.vectorizer.fit_transform(texts)
# "cancel" gets high score in first document
# "order" gets high score in second document
# "buy" gets high score in third document
Naive Bayes Classification
What is Naive Bayes?
Naive Bayes is a probabilistic machine learning algorithm that predicts the category of text based on the probability of words appearing in each category. It's called "naive" because it assumes all words are independent (which isn't true, but works well anyway).
Why Use Naive Bayes?
- Fast training: Learns quickly from small datasets
- Good with text: Excellent for text classification tasks
- Handles multiple classes: Can classify into many categories
- Probabilistic: Gives confidence scores, not just predictions
- Works with sparse data: Perfect for TF-IDF vectors
How Naive Bayes Works (Step-by-Step):
Step 1: Prepare Training Examples We show the computer examples of sentences and their correct categories:
- "I want to cancel my order" → Label: "cancel_order"
- "Cancel my purchase" → Label: "cancel_order"
- "Where is my order" → Label: "order_status"
- "Order status please" → Label: "order_status"
Step 2: Convert Text to Numbers Use TF-IDF to turn each sentence into a list of numbers (as explained above).
Step 3: Learn Patterns The computer analyzes the numbers and notices patterns:
- Sentences with high "cancel" scores usually belong to "cancel_order"
- Sentences with high "status" scores usually belong to "order_status"
- It calculates probabilities for each word in each category
Step 4: Store the Learning The computer remembers these patterns:
- "If I see the word 'cancel', there's an 85% chance it's 'cancel_order'"
- "If I see the word 'status', there's a 90% chance it's 'order_status'"
Step 5: Make Predictions on New Text When a new sentence comes in: "I need to cancel"
Step 6: Convert New Text to Numbers Use the same TF-IDF process to turn the new sentence into numbers.
Step 7: Calculate Probabilities For each possible category, calculate the probability:
- Probability of "cancel_order": 87%
- Probability of "order_status": 10%
- Probability of "product_inquiry": 3%
Step 8: Pick the Winner Choose the category with the highest probability: "cancel_order" (87% confidence)
Code Example:
from sklearn.naive_bayes import MultinomialNB
# Step 1: Training examples (already converted to numbers by TF-IDF)
X = tfidf_matrix # The number lists from TF-IDF
y = ["cancel_order", "order_status", "product_inquiry"] # The correct labels
# Steps 3-4: Train the classifier to learn patterns
classifier = MultinomialNB()
classifier.fit(X, y) # "Learn from these examples"
# Steps 5-8: Predict new text
new_text = ["I want to cancel my purchase"]
new_numbers = vectorizer.transform(new_text) # Convert to numbers
prediction = classifier.predict(new_numbers) # "What category is this?"
confidence = classifier.predict_proba(new_numbers) # "How sure are you?"
print("Prediction:", prediction[0]) # "cancel_order"
print("Confidence:", max(confidence[0])) # 0.87 (87% sure)
Real Example from Our Chatbot:
# In our classify_intent method
def classify_intent(self, text):
processed_text = self.preprocess_text(text)
# Convert to TF-IDF
X = self.vectorizer.transform([processed_text])
# Predict intent
intent = self.intent_classifier.predict(X)[0]
# Get confidence
confidence = max(self.intent_classifier.predict_proba(X)[0])
return intent, confidence
How They Work Together in Our Chatbot
Step 1: Training Phase
# Training data
training_texts = [
"I want to cancel my order",
"Cancel my purchase",
"Where is my order",
"Order status please"
]
training_labels = ["cancel_order", "cancel_order", "order_status", "order_status"]
# Convert text to TF-IDF vectors
X = vectorizer.fit_transform(training_texts)
# Train Naive Bayes
classifier.fit(X, training_labels)
Step 2: Prediction Phase
# New user message
user_message = "I need to cancel"
# Convert to TF-IDF using same vectorizer
user_tfidf = vectorizer.transform([user_message])
# Classify intent
intent = classifier.predict(user_tfidf)[0] # "cancel_order"
confidence = max(classifier.predict_proba(user_tfidf)[0]) # 0.89
Why This Combination Works:
- TF-IDF converts "I need to cancel" into numbers like [0.0, 0.85, 0.0, 0.67, ...]
- Naive Bayes looks at these numbers and says "These numbers look most like 'cancel_order' examples I've seen"
- Result: Correctly identifies the user wants to cancel something
Strengths:
- Fast and efficient
- Works well with small datasets
- Interpretable results
Limitations:
- Simple approach - doesn't understand context deeply
- Assumes words are independent
- Can be confused by similar intents
This is why we added keyword-based rules as a hybrid approach to improve accuracy!