Spaces:
Running
Running
File size: 7,982 Bytes
c4d486b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | # NLP Concepts Documentation
## TF-IDF Vectorization
### What is TF-IDF?
**TF-IDF (Term Frequency-Inverse Document Frequency)** is a numerical method to convert text into numbers that machine learning algorithms can understand. It measures how important a word is to a document within a collection of documents.
### Formula Breakdown:
- **TF (Term Frequency)**: How often a word appears in a document
- **IDF (Inverse Document Frequency)**: How rare/common a word is across all documents
- **TF-IDF = TF × IDF**
### Why Use TF-IDF?
1. **Converts text to numbers**: ML algorithms need numbers, not words
2. **Highlights important words**: Common words like "the", "and" get low scores
3. **Context-aware**: Same word gets different importance in different contexts
4. **Simple but effective**: Works well for text classification tasks
### How TF-IDF Works (Step-by-Step):
**Step 1: Collect All Text**
We gather all the sentences we want to analyze:
- "I want to cancel my order"
- "Where is my order status"
- "I want to buy a product"
**Step 2: Find All Unique Words**
The system looks through all sentences and makes a list of every unique word:
- Words found: ['buy', 'cancel', 'is', 'my', 'order', 'product', 'status', 'to', 'want', 'where']
**Step 3: Count Word Frequency in Each Sentence**
For each sentence, count how many times each word appears:
- Sentence 1: "cancel" appears 1 time, "order" appears 1 time, "buy" appears 0 times
- Sentence 2: "where" appears 1 time, "order" appears 1 time, "cancel" appears 0 times
- Sentence 3: "buy" appears 1 time, "product" appears 1 time, "cancel" appears 0 times
**Step 4: Calculate How Rare Each Word Is**
Words that appear in many sentences are common (less important):
- "order" appears in 2 out of 3 sentences → somewhat common
- "cancel" appears in 1 out of 3 sentences → more unique/important
**Step 5: Create Final Scores**
Multiply frequency by rarity to get TF-IDF scores:
- Common words get lower scores
- Rare words that appear frequently in a sentence get higher scores
**Step 6: Convert to Number Matrix**
Each sentence becomes a list of numbers representing word importance:
```python
# Example result
Sentence 1: [0.0, 0.85, 0.0, 0.3, 0.3, 0.0, 0.0, 0.3, 0.3, 0.0 ]
# [buy, cancel, is, my, order, product, status, to, want, where]
```
### Code Example:
```python
from sklearn.feature_extraction.text import TfidfVectorizer
# Step 1: Our sentences
documents = [
"I want to cancel my order",
"Where is my order status",
"I want to buy a product"
]
# Steps 2-6: TF-IDF does all the math for us
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# See the results
print("Words found:", vectorizer.get_feature_names_out())
print("Number matrix:")
print(tfidf_matrix.toarray())
```
### Real Example from Our Chatbot:
```python
# In our NLP engine
texts = ["I want to cancel", "Where is my order", "I want to buy"]
X = self.vectorizer.fit_transform(texts)
# "cancel" gets high score in first document
# "order" gets high score in second document
# "buy" gets high score in third document
```
---
## Naive Bayes Classification
### What is Naive Bayes?
**Naive Bayes** is a probabilistic machine learning algorithm that predicts the category of text based on the probability of words appearing in each category. It's called "naive" because it assumes all words are independent (which isn't true, but works well anyway).
### Why Use Naive Bayes?
1. **Fast training**: Learns quickly from small datasets
2. **Good with text**: Excellent for text classification tasks
3. **Handles multiple classes**: Can classify into many categories
4. **Probabilistic**: Gives confidence scores, not just predictions
5. **Works with sparse data**: Perfect for TF-IDF vectors
### How Naive Bayes Works (Step-by-Step):
**Step 1: Prepare Training Examples**
We show the computer examples of sentences and their correct categories:
- "I want to cancel my order" → Label: "cancel_order"
- "Cancel my purchase" → Label: "cancel_order"
- "Where is my order" → Label: "order_status"
- "Order status please" → Label: "order_status"
**Step 2: Convert Text to Numbers**
Use TF-IDF to turn each sentence into a list of numbers (as explained above).
**Step 3: Learn Patterns**
The computer analyzes the numbers and notices patterns:
- Sentences with high "cancel" scores usually belong to "cancel_order"
- Sentences with high "status" scores usually belong to "order_status"
- It calculates probabilities for each word in each category
**Step 4: Store the Learning**
The computer remembers these patterns:
- "If I see the word 'cancel', there's an 85% chance it's 'cancel_order'"
- "If I see the word 'status', there's a 90% chance it's 'order_status'"
**Step 5: Make Predictions on New Text**
When a new sentence comes in: "I need to cancel"
**Step 6: Convert New Text to Numbers**
Use the same TF-IDF process to turn the new sentence into numbers.
**Step 7: Calculate Probabilities**
For each possible category, calculate the probability:
- Probability of "cancel_order": 87%
- Probability of "order_status": 10%
- Probability of "product_inquiry": 3%
**Step 8: Pick the Winner**
Choose the category with the highest probability: "cancel_order" (87% confidence)
### Code Example:
```python
from sklearn.naive_bayes import MultinomialNB
# Step 1: Training examples (already converted to numbers by TF-IDF)
X = tfidf_matrix # The number lists from TF-IDF
y = ["cancel_order", "order_status", "product_inquiry"] # The correct labels
# Steps 3-4: Train the classifier to learn patterns
classifier = MultinomialNB()
classifier.fit(X, y) # "Learn from these examples"
# Steps 5-8: Predict new text
new_text = ["I want to cancel my purchase"]
new_numbers = vectorizer.transform(new_text) # Convert to numbers
prediction = classifier.predict(new_numbers) # "What category is this?"
confidence = classifier.predict_proba(new_numbers) # "How sure are you?"
print("Prediction:", prediction[0]) # "cancel_order"
print("Confidence:", max(confidence[0])) # 0.87 (87% sure)
```
### Real Example from Our Chatbot:
```python
# In our classify_intent method
def classify_intent(self, text):
processed_text = self.preprocess_text(text)
# Convert to TF-IDF
X = self.vectorizer.transform([processed_text])
# Predict intent
intent = self.intent_classifier.predict(X)[0]
# Get confidence
confidence = max(self.intent_classifier.predict_proba(X)[0])
return intent, confidence
```
---
## How They Work Together in Our Chatbot
### Step 1: Training Phase
```python
# Training data
training_texts = [
"I want to cancel my order",
"Cancel my purchase",
"Where is my order",
"Order status please"
]
training_labels = ["cancel_order", "cancel_order", "order_status", "order_status"]
# Convert text to TF-IDF vectors
X = vectorizer.fit_transform(training_texts)
# Train Naive Bayes
classifier.fit(X, training_labels)
```
### Step 2: Prediction Phase
```python
# New user message
user_message = "I need to cancel"
# Convert to TF-IDF using same vectorizer
user_tfidf = vectorizer.transform([user_message])
# Classify intent
intent = classifier.predict(user_tfidf)[0] # "cancel_order"
confidence = max(classifier.predict_proba(user_tfidf)[0]) # 0.89
```
### Why This Combination Works:
1. **TF-IDF** converts "I need to cancel" into numbers like [0.0, 0.85, 0.0, 0.67, ...]
2. **Naive Bayes** looks at these numbers and says "These numbers look most like 'cancel_order' examples I've seen"
3. **Result**: Correctly identifies the user wants to cancel something
### Strengths:
- Fast and efficient
- Works well with small datasets
- Interpretable results
### Limitations:
- Simple approach - doesn't understand context deeply
- Assumes words are independent
- Can be confused by similar intents
This is why we added keyword-based rules as a hybrid approach to improve accuracy! |