smart-chatbot-api / docs /nlp_concepts_docs.md
GitHub Actions
Deploy from GitHub Actions (2026-03-03 04:20 UTC)
c4d486b

NLP Concepts Documentation

TF-IDF Vectorization

What is TF-IDF?

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical method to convert text into numbers that machine learning algorithms can understand. It measures how important a word is to a document within a collection of documents.

Formula Breakdown:

  • TF (Term Frequency): How often a word appears in a document
  • IDF (Inverse Document Frequency): How rare/common a word is across all documents
  • TF-IDF = TF × IDF

Why Use TF-IDF?

  1. Converts text to numbers: ML algorithms need numbers, not words
  2. Highlights important words: Common words like "the", "and" get low scores
  3. Context-aware: Same word gets different importance in different contexts
  4. Simple but effective: Works well for text classification tasks

How TF-IDF Works (Step-by-Step):

Step 1: Collect All Text We gather all the sentences we want to analyze:

  • "I want to cancel my order"
  • "Where is my order status"
  • "I want to buy a product"

Step 2: Find All Unique Words The system looks through all sentences and makes a list of every unique word:

  • Words found: ['buy', 'cancel', 'is', 'my', 'order', 'product', 'status', 'to', 'want', 'where']

Step 3: Count Word Frequency in Each Sentence For each sentence, count how many times each word appears:

  • Sentence 1: "cancel" appears 1 time, "order" appears 1 time, "buy" appears 0 times
  • Sentence 2: "where" appears 1 time, "order" appears 1 time, "cancel" appears 0 times
  • Sentence 3: "buy" appears 1 time, "product" appears 1 time, "cancel" appears 0 times

Step 4: Calculate How Rare Each Word Is Words that appear in many sentences are common (less important):

  • "order" appears in 2 out of 3 sentences → somewhat common
  • "cancel" appears in 1 out of 3 sentences → more unique/important

Step 5: Create Final Scores Multiply frequency by rarity to get TF-IDF scores:

  • Common words get lower scores
  • Rare words that appear frequently in a sentence get higher scores

Step 6: Convert to Number Matrix Each sentence becomes a list of numbers representing word importance:

# Example result
Sentence 1: [0.0,  0.85, 0.0,  0.3,  0.3,  0.0,  0.0,  0.3,  0.3,  0.0 ]
#           [buy, cancel, is,  my,  order, product, status, to, want, where]

Code Example:

from sklearn.feature_extraction.text import TfidfVectorizer

# Step 1: Our sentences
documents = [
    "I want to cancel my order",
    "Where is my order status", 
    "I want to buy a product"
]

# Steps 2-6: TF-IDF does all the math for us
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# See the results
print("Words found:", vectorizer.get_feature_names_out())
print("Number matrix:")
print(tfidf_matrix.toarray())

Real Example from Our Chatbot:

# In our NLP engine
texts = ["I want to cancel", "Where is my order", "I want to buy"]
X = self.vectorizer.fit_transform(texts)

# "cancel" gets high score in first document
# "order" gets high score in second document  
# "buy" gets high score in third document

Naive Bayes Classification

What is Naive Bayes?

Naive Bayes is a probabilistic machine learning algorithm that predicts the category of text based on the probability of words appearing in each category. It's called "naive" because it assumes all words are independent (which isn't true, but works well anyway).

Why Use Naive Bayes?

  1. Fast training: Learns quickly from small datasets
  2. Good with text: Excellent for text classification tasks
  3. Handles multiple classes: Can classify into many categories
  4. Probabilistic: Gives confidence scores, not just predictions
  5. Works with sparse data: Perfect for TF-IDF vectors

How Naive Bayes Works (Step-by-Step):

Step 1: Prepare Training Examples We show the computer examples of sentences and their correct categories:

  • "I want to cancel my order" → Label: "cancel_order"
  • "Cancel my purchase" → Label: "cancel_order"
  • "Where is my order" → Label: "order_status"
  • "Order status please" → Label: "order_status"

Step 2: Convert Text to Numbers Use TF-IDF to turn each sentence into a list of numbers (as explained above).

Step 3: Learn Patterns The computer analyzes the numbers and notices patterns:

  • Sentences with high "cancel" scores usually belong to "cancel_order"
  • Sentences with high "status" scores usually belong to "order_status"
  • It calculates probabilities for each word in each category

Step 4: Store the Learning The computer remembers these patterns:

  • "If I see the word 'cancel', there's an 85% chance it's 'cancel_order'"
  • "If I see the word 'status', there's a 90% chance it's 'order_status'"

Step 5: Make Predictions on New Text When a new sentence comes in: "I need to cancel"

Step 6: Convert New Text to Numbers Use the same TF-IDF process to turn the new sentence into numbers.

Step 7: Calculate Probabilities For each possible category, calculate the probability:

  • Probability of "cancel_order": 87%
  • Probability of "order_status": 10%
  • Probability of "product_inquiry": 3%

Step 8: Pick the Winner Choose the category with the highest probability: "cancel_order" (87% confidence)

Code Example:

from sklearn.naive_bayes import MultinomialNB

# Step 1: Training examples (already converted to numbers by TF-IDF)
X = tfidf_matrix  # The number lists from TF-IDF
y = ["cancel_order", "order_status", "product_inquiry"]  # The correct labels

# Steps 3-4: Train the classifier to learn patterns
classifier = MultinomialNB()
classifier.fit(X, y)  # "Learn from these examples"

# Steps 5-8: Predict new text
new_text = ["I want to cancel my purchase"]
new_numbers = vectorizer.transform(new_text)  # Convert to numbers

prediction = classifier.predict(new_numbers)  # "What category is this?"
confidence = classifier.predict_proba(new_numbers)  # "How sure are you?"

print("Prediction:", prediction[0])  # "cancel_order"
print("Confidence:", max(confidence[0]))  # 0.87 (87% sure)

Real Example from Our Chatbot:

# In our classify_intent method
def classify_intent(self, text):
    processed_text = self.preprocess_text(text)
    
    # Convert to TF-IDF
    X = self.vectorizer.transform([processed_text])
    
    # Predict intent
    intent = self.intent_classifier.predict(X)[0]
    
    # Get confidence
    confidence = max(self.intent_classifier.predict_proba(X)[0])
    
    return intent, confidence

How They Work Together in Our Chatbot

Step 1: Training Phase

# Training data
training_texts = [
    "I want to cancel my order",
    "Cancel my purchase", 
    "Where is my order",
    "Order status please"
]
training_labels = ["cancel_order", "cancel_order", "order_status", "order_status"]

# Convert text to TF-IDF vectors
X = vectorizer.fit_transform(training_texts)

# Train Naive Bayes
classifier.fit(X, training_labels)

Step 2: Prediction Phase

# New user message
user_message = "I need to cancel"

# Convert to TF-IDF using same vectorizer
user_tfidf = vectorizer.transform([user_message])

# Classify intent
intent = classifier.predict(user_tfidf)[0]  # "cancel_order"
confidence = max(classifier.predict_proba(user_tfidf)[0])  # 0.89

Why This Combination Works:

  1. TF-IDF converts "I need to cancel" into numbers like [0.0, 0.85, 0.0, 0.67, ...]
  2. Naive Bayes looks at these numbers and says "These numbers look most like 'cancel_order' examples I've seen"
  3. Result: Correctly identifies the user wants to cancel something

Strengths:

  • Fast and efficient
  • Works well with small datasets
  • Interpretable results

Limitations:

  • Simple approach - doesn't understand context deeply
  • Assumes words are independent
  • Can be confused by similar intents

This is why we added keyword-based rules as a hybrid approach to improve accuracy!