File size: 6,092 Bytes
9b1c753 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 |
# π§ LDA Integration Fix - get_risk_labels() Method
## β Problem Identified
```
AttributeError: 'TopicModelingRiskDiscovery' object has no attribute 'get_topic_labels'
```
**Root Cause:** The `LDARiskDiscovery.get_risk_labels()` method was calling `self.lda_backend.get_topic_labels()`, but this method doesn't exist in the `TopicModelingRiskDiscovery` class.
---
## β
Solution Applied
### **File: `risk_discovery.py`**
**Before (Broken):**
```python
def get_risk_labels(self, clause_texts: List[str]) -> List[int]:
if self.cluster_labels is None:
raise ValueError("Must discover patterns first")
# β This method doesn't exist!
labels = self.lda_backend.get_topic_labels(clause_texts)
return labels
```
**After (Fixed):**
```python
def get_risk_labels(self, clause_texts: List[str]) -> List[int]:
if self.cluster_labels is None:
raise ValueError("Must discover patterns first")
# β
Implement directly using LDA model
cleaned_texts = [self.lda_backend._clean_text(text) for text in clause_texts]
feature_matrix = self.lda_backend.vectorizer.transform(cleaned_texts)
doc_topic_dist = self.lda_backend.lda_model.transform(feature_matrix)
# Return the topic with highest probability
labels = doc_topic_dist.argmax(axis=1).tolist()
return labels
```
---
## π§ Additional Fix: train.py Return Values
### **File: `train.py`**
**Problem:** When errors occurred, `main()` returned `None`, causing:
```
TypeError: cannot unpack non-iterable NoneType object
```
**Fixed:** Changed all `return` statements in error handlers to `return None, None`
**Before:**
```python
except Exception as e:
print(f"β Error: {e}")
return # β Returns None
```
**After:**
```python
except Exception as e:
print(f"β Error: {e}")
return None, None # β
Returns tuple
```
Also updated the main call:
```python
if __name__ == "__main__":
result = main()
if result is not None:
trainer, history = result
else:
print("\nβ Training failed.")
exit(1)
```
---
## π― How It Works Now
### **1. Pattern Discovery:**
```python
lda = LDARiskDiscovery(n_clusters=7)
results = lda.discover_risk_patterns(train_clauses)
# Discovers 7 topics using LDA
```
### **2. Getting Labels for New Clauses:**
```python
labels = lda.get_risk_labels(new_clauses)
# Returns: [2, 0, 5, 1, ...] (dominant topic per clause)
```
**Process:**
1. Clean text using `_clean_text()`
2. Transform to document-term matrix using `vectorizer.transform()`
3. Get topic probabilities using `lda_model.transform()`
4. Return `argmax()` - the topic with highest probability
### **3. Getting Full Probability Distribution (Optional):**
```python
dist = lda.get_topic_distribution(new_clauses)
# Returns: [[0.1, 0.05, 0.7, ...], ...] (probabilities for each topic)
```
---
## β
Verification
### **Test Script: `test_lda_get_labels.py`**
The test script verifies:
1. β
LDARiskDiscovery instance creation
2. β
Pattern discovery works
3. β
`get_risk_labels()` returns correct format
4. β
Labels are integers in valid range
5. β
`get_topic_distribution()` returns probability matrix
**Run test:**
```bash
python3 test_lda_get_labels.py
```
---
## π Ready to Use
The fix is complete! You can now run:
```bash
python3 train.py
```
**Expected output:**
```
π― Using LDA (Topic Modeling) for risk discovery
π Discovering risk patterns using LDA (n_topics=7)...
π Creating document-term matrix...
π§ Fitting LDA model...
β
LDA discovery complete: 7 risk topics found
π Discovered Risk Patterns:
β’ Topic_PARTY_AGREEMENT
Keywords: party, agreement, shall, company, consent
β’ Topic_INTELLECTUAL_PROPERTY
Keywords: shall, product, products, agreement, section
...
```
---
## π Technical Details
### **Method Signature:**
```python
def get_risk_labels(self, clause_texts: List[str]) -> List[int]
```
### **Returns:**
- List of integers representing dominant topic IDs
- Range: 0 to (n_clusters-1)
- Length: Same as input `clause_texts`
### **Algorithm:**
1. **Text Cleaning:** Remove extra whitespace, normalize
2. **Vectorization:** Convert to bag-of-words using CountVectorizer
3. **LDA Transform:** Get document-topic probability distribution
4. **Argmax:** Select topic with highest probability per document
### **Example:**
```python
Input: ["The party shall indemnify...", "Governed by state law..."]
Probs: [[0.1, 0.05, 0.7, 0.1, 0.05], [0.2, 0.6, 0.1, 0.05, 0.05]]
Output: [2, 1] # Topics 2 and 1 have highest probabilities
```
---
## π― Key Differences from K-Means
| Aspect | K-Means | LDA |
|--------|---------|-----|
| **Assignment** | `kmeans.predict()` | `lda_model.transform()` + `argmax()` |
| **Hard/Soft** | Hard clusters | Soft topics (with probabilities) |
| **Model** | Centroid-based | Probabilistic topic model |
| **Output** | Cluster ID | Dominant topic ID + full distribution |
---
## π Troubleshooting
### **Error: "No module named 'sklearn'"**
```bash
pip install scikit-learn
```
### **Error: "Must discover patterns first"**
**Solution:** Call `discover_risk_patterns()` before `get_risk_labels()`:
```python
lda.discover_risk_patterns(train_clauses) # First
labels = lda.get_risk_labels(test_clauses) # Then
```
### **Error: "Feature names are different"**
**Solution:** You must use the same clauses to discover patterns that will be used for training. The LDA model learns vocabulary from the training set.
---
## β
Status
- [x] Fixed `get_risk_labels()` method implementation
- [x] Fixed `train.py` return values for error handling
- [x] Created test script for verification
- [x] Documented the fix
- [x] Ready for production use
**You can now train with LDA!** π
---
**Files Modified:**
1. `risk_discovery.py` - Fixed `get_risk_labels()` (lines 350-373)
2. `train.py` - Fixed return statements (lines 66, 89, 152)
3. `test_lda_get_labels.py` - New test script (created)
4. `doc/LDA_FIX_GET_LABELS.md` - This documentation (created)
**Status:** β
**FIXED AND READY**
|