|
|
--- |
|
|
title: Enhanced Readability Assessment Random Forest Model |
|
|
emoji: 🌲 |
|
|
colorFrom: green |
|
|
colorTo: blue |
|
|
sdk: gradio |
|
|
sdk_version: 4.0.0 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
tags: |
|
|
- readability |
|
|
- nlp |
|
|
- education |
|
|
- random-forest |
|
|
- sklearn |
|
|
- enhanced-features |
|
|
- linguistic-analysis |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# Enhanced Readability Assessment Random Forest Model |
|
|
|
|
|
This model predicts the reading grade level of English text using an Enhanced Random Forest algorithm with comprehensive linguistic features and improved generalization capabilities. |
|
|
|
|
|
## Model Performance |
|
|
|
|
|
- **Cross-Validation MAE**: 0.41789318171271833 |
|
|
- **Training R²**: N/A |
|
|
- **Test Set R²**: 0.8460916091361399 |
|
|
- **Training Date**: 2025-07-02T23:04:35.722721 |
|
|
|
|
|
## Enhanced Features |
|
|
|
|
|
This enhanced model uses **36** total features with **25** selected features using **N/A**: |
|
|
|
|
|
### Feature Categories: |
|
|
- **Traditional Readability Metrics**: Flesch-Kincaid, Coleman-Liau, ARI, etc. |
|
|
- **Age of Acquisition (AoA) Metrics**: Word difficulty based on acquisition age |
|
|
- **Syntactic Complexity**: Sentence structure and parsing depth |
|
|
- **Lexical Diversity**: Vocabulary richness and variation |
|
|
- **Morphological Features**: Word formation patterns |
|
|
- **Semantic Features**: Word meaning and context complexity |
|
|
- **Corpus Source Indicators**: Training data source information |
|
|
|
|
|
### Key Improvements: |
|
|
- **Feature Selection**: Automated selection of most predictive features |
|
|
- **Robust Scaling**: Better handling of outliers and extreme values |
|
|
- **Enhanced Generalization**: Optimized hyperparameters for cross-domain performance |
|
|
- **Comprehensive Evaluation**: Multi-dataset validation |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Algorithm**: Random Forest Regressor |
|
|
- **Trees**: 200 estimators for stability |
|
|
- **Max Depth**: Controlled to prevent overfitting |
|
|
- **Feature Selection**: SelectKBest with f_regression |
|
|
- **Scaling**: RobustScaler for outlier resistance |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import joblib |
|
|
import pandas as pd |
|
|
import numpy as np |
|
|
|
|
|
# Load the enhanced model |
|
|
model = joblib.load('enhanced_readability_random_forest.pkl') |
|
|
|
|
|
# The model is a complete EnhancedReadabilityRandomForestModel instance |
|
|
# with built-in feature computation and prediction methods |
|
|
|
|
|
# Example usage (simplified): |
|
|
# predicted_grade = model.predict_text("Your text here") |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Primary**: WeeBit corpus (age-graded web content) |
|
|
- **Secondary**: CLEAR corpus (simplified text pairs) |
|
|
- **Validation**: Multiple independent datasets |
|
|
- **Total Samples**: 2500 |
|
|
|
|
|
## Performance Comparison |
|
|
|
|
|
This enhanced model shows improved performance over the baseline: |
|
|
- Better cross-validation stability |
|
|
- Enhanced feature representation |
|
|
- Improved generalization to unseen text types |
|
|
- More robust predictions across grade levels |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this enhanced model, please cite: |
|
|
``` |
|
|
Enhanced Readability Assessment Random Forest Model |
|
|
With Comprehensive Linguistic Features and Improved Generalization |
|
|
Trained on WeeBit and CLEAR corpus data |
|
|
2025-07-02 |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - See LICENSE file for details. |
|
|
|