| # π§ LDA Integration Fix - get_risk_labels() Method | |
| ## β Problem Identified | |
| ``` | |
| AttributeError: 'TopicModelingRiskDiscovery' object has no attribute 'get_topic_labels' | |
| ``` | |
| **Root Cause:** The `LDARiskDiscovery.get_risk_labels()` method was calling `self.lda_backend.get_topic_labels()`, but this method doesn't exist in the `TopicModelingRiskDiscovery` class. | |
| --- | |
| ## β Solution Applied | |
| ### **File: `risk_discovery.py`** | |
| **Before (Broken):** | |
| ```python | |
| def get_risk_labels(self, clause_texts: List[str]) -> List[int]: | |
| if self.cluster_labels is None: | |
| raise ValueError("Must discover patterns first") | |
| # β This method doesn't exist! | |
| labels = self.lda_backend.get_topic_labels(clause_texts) | |
| return labels | |
| ``` | |
| **After (Fixed):** | |
| ```python | |
| def get_risk_labels(self, clause_texts: List[str]) -> List[int]: | |
| if self.cluster_labels is None: | |
| raise ValueError("Must discover patterns first") | |
| # β Implement directly using LDA model | |
| cleaned_texts = [self.lda_backend._clean_text(text) for text in clause_texts] | |
| feature_matrix = self.lda_backend.vectorizer.transform(cleaned_texts) | |
| doc_topic_dist = self.lda_backend.lda_model.transform(feature_matrix) | |
| # Return the topic with highest probability | |
| labels = doc_topic_dist.argmax(axis=1).tolist() | |
| return labels | |
| ``` | |
| --- | |
| ## π§ Additional Fix: train.py Return Values | |
| ### **File: `train.py`** | |
| **Problem:** When errors occurred, `main()` returned `None`, causing: | |
| ``` | |
| TypeError: cannot unpack non-iterable NoneType object | |
| ``` | |
| **Fixed:** Changed all `return` statements in error handlers to `return None, None` | |
| **Before:** | |
| ```python | |
| except Exception as e: | |
| print(f"β Error: {e}") | |
| return # β Returns None | |
| ``` | |
| **After:** | |
| ```python | |
| except Exception as e: | |
| print(f"β Error: {e}") | |
| return None, None # β Returns tuple | |
| ``` | |
| Also updated the main call: | |
| ```python | |
| if __name__ == "__main__": | |
| result = main() | |
| if result is not None: | |
| trainer, history = result | |
| else: | |
| print("\nβ Training failed.") | |
| exit(1) | |
| ``` | |
| --- | |
| ## π― How It Works Now | |
| ### **1. Pattern Discovery:** | |
| ```python | |
| lda = LDARiskDiscovery(n_clusters=7) | |
| results = lda.discover_risk_patterns(train_clauses) | |
| # Discovers 7 topics using LDA | |
| ``` | |
| ### **2. Getting Labels for New Clauses:** | |
| ```python | |
| labels = lda.get_risk_labels(new_clauses) | |
| # Returns: [2, 0, 5, 1, ...] (dominant topic per clause) | |
| ``` | |
| **Process:** | |
| 1. Clean text using `_clean_text()` | |
| 2. Transform to document-term matrix using `vectorizer.transform()` | |
| 3. Get topic probabilities using `lda_model.transform()` | |
| 4. Return `argmax()` - the topic with highest probability | |
| ### **3. Getting Full Probability Distribution (Optional):** | |
| ```python | |
| dist = lda.get_topic_distribution(new_clauses) | |
| # Returns: [[0.1, 0.05, 0.7, ...], ...] (probabilities for each topic) | |
| ``` | |
| --- | |
| ## β Verification | |
| ### **Test Script: `test_lda_get_labels.py`** | |
| The test script verifies: | |
| 1. β LDARiskDiscovery instance creation | |
| 2. β Pattern discovery works | |
| 3. β `get_risk_labels()` returns correct format | |
| 4. β Labels are integers in valid range | |
| 5. β `get_topic_distribution()` returns probability matrix | |
| **Run test:** | |
| ```bash | |
| python3 test_lda_get_labels.py | |
| ``` | |
| --- | |
| ## π Ready to Use | |
| The fix is complete! You can now run: | |
| ```bash | |
| python3 train.py | |
| ``` | |
| **Expected output:** | |
| ``` | |
| π― Using LDA (Topic Modeling) for risk discovery | |
| π Discovering risk patterns using LDA (n_topics=7)... | |
| π Creating document-term matrix... | |
| π§ Fitting LDA model... | |
| β LDA discovery complete: 7 risk topics found | |
| π Discovered Risk Patterns: | |
| β’ Topic_PARTY_AGREEMENT | |
| Keywords: party, agreement, shall, company, consent | |
| β’ Topic_INTELLECTUAL_PROPERTY | |
| Keywords: shall, product, products, agreement, section | |
| ... | |
| ``` | |
| --- | |
| ## π Technical Details | |
| ### **Method Signature:** | |
| ```python | |
| def get_risk_labels(self, clause_texts: List[str]) -> List[int] | |
| ``` | |
| ### **Returns:** | |
| - List of integers representing dominant topic IDs | |
| - Range: 0 to (n_clusters-1) | |
| - Length: Same as input `clause_texts` | |
| ### **Algorithm:** | |
| 1. **Text Cleaning:** Remove extra whitespace, normalize | |
| 2. **Vectorization:** Convert to bag-of-words using CountVectorizer | |
| 3. **LDA Transform:** Get document-topic probability distribution | |
| 4. **Argmax:** Select topic with highest probability per document | |
| ### **Example:** | |
| ```python | |
| Input: ["The party shall indemnify...", "Governed by state law..."] | |
| Probs: [[0.1, 0.05, 0.7, 0.1, 0.05], [0.2, 0.6, 0.1, 0.05, 0.05]] | |
| Output: [2, 1] # Topics 2 and 1 have highest probabilities | |
| ``` | |
| --- | |
| ## π― Key Differences from K-Means | |
| | Aspect | K-Means | LDA | | |
| |--------|---------|-----| | |
| | **Assignment** | `kmeans.predict()` | `lda_model.transform()` + `argmax()` | | |
| | **Hard/Soft** | Hard clusters | Soft topics (with probabilities) | | |
| | **Model** | Centroid-based | Probabilistic topic model | | |
| | **Output** | Cluster ID | Dominant topic ID + full distribution | | |
| --- | |
| ## π Troubleshooting | |
| ### **Error: "No module named 'sklearn'"** | |
| ```bash | |
| pip install scikit-learn | |
| ``` | |
| ### **Error: "Must discover patterns first"** | |
| **Solution:** Call `discover_risk_patterns()` before `get_risk_labels()`: | |
| ```python | |
| lda.discover_risk_patterns(train_clauses) # First | |
| labels = lda.get_risk_labels(test_clauses) # Then | |
| ``` | |
| ### **Error: "Feature names are different"** | |
| **Solution:** You must use the same clauses to discover patterns that will be used for training. The LDA model learns vocabulary from the training set. | |
| --- | |
| ## β Status | |
| - [x] Fixed `get_risk_labels()` method implementation | |
| - [x] Fixed `train.py` return values for error handling | |
| - [x] Created test script for verification | |
| - [x] Documented the fix | |
| - [x] Ready for production use | |
| **You can now train with LDA!** π | |
| --- | |
| **Files Modified:** | |
| 1. `risk_discovery.py` - Fixed `get_risk_labels()` (lines 350-373) | |
| 2. `train.py` - Fixed return statements (lines 66, 89, 152) | |
| 3. `test_lda_get_labels.py` - New test script (created) | |
| 4. `doc/LDA_FIX_GET_LABELS.md` - This documentation (created) | |
| **Status:** β **FIXED AND READY** | |