code2-repo / doc /LDA_FIX_EXTRACT_FEATURES.md

Deepu1965

Upload folder using huggingface_hub

9b1c753 verified about 2 months ago

preview code

raw

history blame contribute delete

9.01 kB

🔧 LDA Integration Fix #2 - extract_risk_features() Method

❌ Problem Identified

AttributeError: 'LDARiskDiscovery' object has no attribute 'extract_risk_features'

Root Cause: The trainer calls self.risk_discovery.extract_risk_features() to generate synthetic severity and importance scores, but LDARiskDiscovery was missing this method and its required attributes.

✅ Solution Applied

Added to `LDARiskDiscovery` class:

1. Legal Pattern Attributes

# Legal language patterns
self.legal_indicators = {
    'obligation_strength': r'\b(?:shall|must|required|...)\b',
    'prohibition_terms': r'\b(?:shall not|must not|...)\b',
    'conditional_risk': r'\b(?:if|unless|provided|...)\b',
    'liability_terms': r'\b(?:liable|responsibility|...)\b',
    'temporal_urgency': r'\b(?:immediately|within|...)\b',
    'monetary_terms': r'\$|USD|dollar|payment|...',
    'parties': r'\b(?:Party|Parties|Company|...)\b',
    'dates': r'\b(?:January|February|...)\b'
}

# Legal complexity indicators
self.complexity_indicators = {
    'modal_verbs': r'\b(?:shall|must|may|...)\b',
    'conditional_terms': r'\b(?:if|unless|...)\b',
    'legal_conjunctions': r'\b(?:whereas|therefore|...)\b',
    'obligation_terms': r'\b(?:agrees?|undertakes?|...)\b'
}

2. clean_clause_text() Method

def clean_clause_text(self, text: str) -> str:
    """Clean and normalize clause text"""
    if not isinstance(text, str):
        return ""
    
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove special characters but keep legal punctuation
    text = re.sub(r'[^\w\s.,;:()"-]', ' ', text)
    
    return text.strip()

3. extract_risk_features() Method ⭐

def extract_risk_features(self, clause_text: str) -> Dict[str, float]:
    """Extract numerical features that indicate risk levels"""
    text_lower = clause_text.lower()
    words = text_lower.split()
    features = {}
    
    # Basic statistics
    features['clause_length'] = len(words)
    features['sentence_count'] = len(re.split(r'[.!?]+', clause_text))
    features['avg_word_length'] = np.mean([len(w) for w in words])
    
    # Legal language intensity (8 indicators)
    for pattern_name, pattern in self.legal_indicators.items():
        matches = len(re.findall(pattern, text_lower))
        features[f'{pattern_name}_count'] = matches
        features[f'{pattern_name}_density'] = matches / len(words)
    
    # Legal complexity (4 indicators)
    for pattern_name, pattern in self.complexity_indicators.items():
        matches = len(re.findall(pattern, text_lower))
        features[f'{pattern_name}_complexity'] = matches / len(words)
    
    # Composite risk scores
    features['obligation_strength'] = (
        features['obligation_strength_density'] * 2 +
        features['modal_verbs_complexity']
    )
    
    features['legal_complexity'] = (
        features['conditional_terms_complexity'] +
        features['legal_conjunctions_complexity'] +
        features['obligation_terms_complexity']
    )
    
    features['risk_intensity'] = (
        features['liability_terms_density'] * 2 +
        features['prohibition_terms_density'] +
        features['conditional_risk_density']
    )
    
    return features

🎯 How Trainer Uses These Features

In `trainer.py`:

def _generate_synthetic_scores(self, clauses, score_type):
    """Generate synthetic severity/importance scores"""
    scores = []
    
    for clause in clauses:
        # Extract risk features
        features = self.risk_discovery.extract_risk_features(clause)
        
        if score_type == 'severity':
            # Calculate severity from features
            score = (
                features['risk_intensity'] * 30 +
                features['obligation_strength'] * 20 +
                features['prohibition_terms_density'] * 100 +
                features['liability_terms_density'] * 100 +
                min(features['monetary_terms_count'] * 0.5, 2)
            )
        else:  # importance
            score = (
                features['legal_complexity'] * 30 +
                features['clause_length'] / 100 * 10 +
                features['obligation_strength'] * 20
            )
        
        scores.append(min(score, 10.0))
    
    return scores

📊 Feature Categories

1. Basic Statistics (3 features)

clause_length - Number of words
sentence_count - Number of sentences
avg_word_length - Average word length

2. Legal Indicators (16 features - 8 pairs)

obligation_strength_count / obligation_strength_density
prohibition_terms_count / prohibition_terms_density
conditional_risk_count / conditional_risk_density
liability_terms_count / liability_terms_density
temporal_urgency_count / temporal_urgency_density
monetary_terms_count / monetary_terms_density
parties_count / parties_density
dates_count / dates_density

3. Complexity Indicators (4 features)

modal_verbs_complexity
conditional_terms_complexity
legal_conjunctions_complexity
obligation_terms_complexity

4. Composite Scores (3 features)

obligation_strength - Weighted obligation indicator
legal_complexity - Combined complexity score
risk_intensity - Overall risk level

Total: ~26 features per clause

✅ Complete Interface

LDARiskDiscovery now has full compatibility with UnsupervisedRiskDiscovery:

Attributes:

✅ n_clusters
✅ discovered_patterns
✅ cluster_labels
✅ feature_matrix
✅ legal_indicators
✅ complexity_indicators

Methods:

✅ discover_risk_patterns(clauses) - Main discovery method
✅ get_risk_labels(clauses) - Get dominant topics
✅ get_discovered_risk_names() - Get topic names
✅ get_topic_distribution(clauses) - Get probabilities
✅ clean_clause_text(text) - Text cleaning
✅ extract_risk_features(text) - Feature extraction ⭐

🧪 Verification

Test Script: `test_lda_complete.py`

Comprehensive test covering:

✅ All required attributes
✅ All required methods
✅ Feature extraction functionality
✅ Feature types and values
✅ Text cleaning
✅ Label assignment
✅ Probability distributions

Run test:

python3 test_lda_complete.py

Expected output:

✅ Step 1: Import successful
✅ Step 2: Creating LDARiskDiscovery instance...
✅ Step 3: Checking required attributes...
✅ Step 4: Checking required methods...
✅ Step 5: Testing discover_risk_patterns()...
✅ Step 6: Testing extract_risk_features()...
   📊 Sample features:
      - risk_intensity: 0.143
      - obligation_strength: 0.167
      - legal_complexity: 0.000
✅ Step 7: Testing clean_clause_text()...
✅ Step 8: Testing get_risk_labels()...
✅ Step 9: Testing get_topic_distribution()...
✅ Step 10: Testing get_discovered_risk_names()...

🎉 ALL TESTS PASSED!

🚀 Ready to Train

Now you can run training without errors:

python3 train.py

Expected flow:

1. Load data ✅
2. Discover risk patterns using LDA ✅
3. Extract risk features for each clause ✅
4. Generate synthetic severity scores ✅
5. Generate synthetic importance scores ✅
6. Create training datasets ✅
7. Train model ✅

📝 Summary of Fixes

Fix #1: get_risk_labels() (Previous fix)

Implemented topic label extraction
Used argmax() on probability distributions

Fix #2: extract_risk_features() (This fix) ⭐

Added legal_indicators dictionary (8 patterns)
Added complexity_indicators dictionary (4 patterns)
Implemented clean_clause_text() method
Implemented extract_risk_features() method (26+ features)

🎯 Key Points

Full Compatibility - LDARiskDiscovery now matches UnsupervisedRiskDiscovery interface
Feature-Rich - Extracts 26+ numerical features per clause
Domain-Agnostic - Uses general legal language patterns
Trainer-Ready - Works seamlessly with synthetic score generation
Tested - Comprehensive test suite validates all functionality

📚 Files Modified

risk_discovery.py - Added methods and attributes
- Lines 319-344: Added legal indicators
- Lines 420-446: Added clean_clause_text()
- Lines 448-496: Added extract_risk_features()
test_lda_complete.py - New comprehensive test (150 lines)
doc/LDA_FIX_EXTRACT_FEATURES.md - This documentation

✅ Status

Added legal_indicators dictionary
Added complexity_indicators dictionary
Implemented clean_clause_text() method
Implemented extract_risk_features() method
Created comprehensive test script
Documented all changes
READY FOR TRAINING 🎉

Next: Run python3 train.py to train with LDA! 🚀