File size: 3,027 Bytes
cca9c07 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
---
title: Enhanced Readability Assessment Random Forest Model
emoji: 🌲
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
tags:
- readability
- nlp
- education
- random-forest
- sklearn
- enhanced-features
- linguistic-analysis
license: mit
---
# Enhanced Readability Assessment Random Forest Model
This model predicts the reading grade level of English text using an Enhanced Random Forest algorithm with comprehensive linguistic features and improved generalization capabilities.
## Model Performance
- **Cross-Validation MAE**: 0.41789318171271833
- **Training R²**: N/A
- **Test Set R²**: 0.8460916091361399
- **Training Date**: 2025-07-02T23:04:35.722721
## Enhanced Features
This enhanced model uses **36** total features with **25** selected features using **N/A**:
### Feature Categories:
- **Traditional Readability Metrics**: Flesch-Kincaid, Coleman-Liau, ARI, etc.
- **Age of Acquisition (AoA) Metrics**: Word difficulty based on acquisition age
- **Syntactic Complexity**: Sentence structure and parsing depth
- **Lexical Diversity**: Vocabulary richness and variation
- **Morphological Features**: Word formation patterns
- **Semantic Features**: Word meaning and context complexity
- **Corpus Source Indicators**: Training data source information
### Key Improvements:
- **Feature Selection**: Automated selection of most predictive features
- **Robust Scaling**: Better handling of outliers and extreme values
- **Enhanced Generalization**: Optimized hyperparameters for cross-domain performance
- **Comprehensive Evaluation**: Multi-dataset validation
## Model Architecture
- **Algorithm**: Random Forest Regressor
- **Trees**: 200 estimators for stability
- **Max Depth**: Controlled to prevent overfitting
- **Feature Selection**: SelectKBest with f_regression
- **Scaling**: RobustScaler for outlier resistance
## Usage
```python
import joblib
import pandas as pd
import numpy as np
# Load the enhanced model
model = joblib.load('enhanced_readability_random_forest.pkl')
# The model is a complete EnhancedReadabilityRandomForestModel instance
# with built-in feature computation and prediction methods
# Example usage (simplified):
# predicted_grade = model.predict_text("Your text here")
```
## Training Data
- **Primary**: WeeBit corpus (age-graded web content)
- **Secondary**: CLEAR corpus (simplified text pairs)
- **Validation**: Multiple independent datasets
- **Total Samples**: 2500
## Performance Comparison
This enhanced model shows improved performance over the baseline:
- Better cross-validation stability
- Enhanced feature representation
- Improved generalization to unseen text types
- More robust predictions across grade levels
## Citation
If you use this enhanced model, please cite:
```
Enhanced Readability Assessment Random Forest Model
With Comprehensive Linguistic Features and Improved Generalization
Trained on WeeBit and CLEAR corpus data
2025-07-02
```
## License
MIT License - See LICENSE file for details.
|