--- title: Enhanced Readability Assessment Random Forest Model emoji: 🌲 colorFrom: green colorTo: blue sdk: gradio sdk_version: 4.0.0 app_file: app.py pinned: false tags: - readability - nlp - education - random-forest - sklearn - enhanced-features - linguistic-analysis license: mit --- # Enhanced Readability Assessment Random Forest Model This model predicts the reading grade level of English text using an Enhanced Random Forest algorithm with comprehensive linguistic features and improved generalization capabilities. ## Model Performance - **Cross-Validation MAE**: 0.41789318171271833 - **Training R²**: N/A - **Test Set R²**: 0.8460916091361399 - **Training Date**: 2025-07-02T23:04:35.722721 ## Enhanced Features This enhanced model uses **36** total features with **25** selected features using **N/A**: ### Feature Categories: - **Traditional Readability Metrics**: Flesch-Kincaid, Coleman-Liau, ARI, etc. - **Age of Acquisition (AoA) Metrics**: Word difficulty based on acquisition age - **Syntactic Complexity**: Sentence structure and parsing depth - **Lexical Diversity**: Vocabulary richness and variation - **Morphological Features**: Word formation patterns - **Semantic Features**: Word meaning and context complexity - **Corpus Source Indicators**: Training data source information ### Key Improvements: - **Feature Selection**: Automated selection of most predictive features - **Robust Scaling**: Better handling of outliers and extreme values - **Enhanced Generalization**: Optimized hyperparameters for cross-domain performance - **Comprehensive Evaluation**: Multi-dataset validation ## Model Architecture - **Algorithm**: Random Forest Regressor - **Trees**: 200 estimators for stability - **Max Depth**: Controlled to prevent overfitting - **Feature Selection**: SelectKBest with f_regression - **Scaling**: RobustScaler for outlier resistance ## Usage ```python import joblib import pandas as pd import numpy as np # Load the enhanced model model = joblib.load('enhanced_readability_random_forest.pkl') # The model is a complete EnhancedReadabilityRandomForestModel instance # with built-in feature computation and prediction methods # Example usage (simplified): # predicted_grade = model.predict_text("Your text here") ``` ## Training Data - **Primary**: WeeBit corpus (age-graded web content) - **Secondary**: CLEAR corpus (simplified text pairs) - **Validation**: Multiple independent datasets - **Total Samples**: 2500 ## Performance Comparison This enhanced model shows improved performance over the baseline: - Better cross-validation stability - Enhanced feature representation - Improved generalization to unseen text types - More robust predictions across grade levels ## Citation If you use this enhanced model, please cite: ``` Enhanced Readability Assessment Random Forest Model With Comprehensive Linguistic Features and Improved Generalization Trained on WeeBit and CLEAR corpus data 2025-07-02 ``` ## License MIT License - See LICENSE file for details.