File size: 3,027 Bytes
cca9c07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
title: Enhanced Readability Assessment Random Forest Model
emoji: 🌲
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
tags:
- readability
- nlp
- education
- random-forest
- sklearn
- enhanced-features
- linguistic-analysis
license: mit
---

# Enhanced Readability Assessment Random Forest Model

This model predicts the reading grade level of English text using an Enhanced Random Forest algorithm with comprehensive linguistic features and improved generalization capabilities.

## Model Performance

- **Cross-Validation MAE**: 0.41789318171271833
- **Training R²**: N/A
- **Test Set R²**: 0.8460916091361399
- **Training Date**: 2025-07-02T23:04:35.722721

## Enhanced Features

This enhanced model uses **36** total features with **25** selected features using **N/A**:

### Feature Categories:
- **Traditional Readability Metrics**: Flesch-Kincaid, Coleman-Liau, ARI, etc.
- **Age of Acquisition (AoA) Metrics**: Word difficulty based on acquisition age
- **Syntactic Complexity**: Sentence structure and parsing depth
- **Lexical Diversity**: Vocabulary richness and variation
- **Morphological Features**: Word formation patterns
- **Semantic Features**: Word meaning and context complexity
- **Corpus Source Indicators**: Training data source information

### Key Improvements:
- **Feature Selection**: Automated selection of most predictive features
- **Robust Scaling**: Better handling of outliers and extreme values
- **Enhanced Generalization**: Optimized hyperparameters for cross-domain performance
- **Comprehensive Evaluation**: Multi-dataset validation

## Model Architecture

- **Algorithm**: Random Forest Regressor
- **Trees**: 200 estimators for stability
- **Max Depth**: Controlled to prevent overfitting
- **Feature Selection**: SelectKBest with f_regression
- **Scaling**: RobustScaler for outlier resistance

## Usage

```python
import joblib
import pandas as pd
import numpy as np

# Load the enhanced model
model = joblib.load('enhanced_readability_random_forest.pkl')

# The model is a complete EnhancedReadabilityRandomForestModel instance
# with built-in feature computation and prediction methods

# Example usage (simplified):
# predicted_grade = model.predict_text("Your text here")
```

## Training Data

- **Primary**: WeeBit corpus (age-graded web content)
- **Secondary**: CLEAR corpus (simplified text pairs)
- **Validation**: Multiple independent datasets
- **Total Samples**: 2500

## Performance Comparison

This enhanced model shows improved performance over the baseline:
- Better cross-validation stability
- Enhanced feature representation
- Improved generalization to unseen text types
- More robust predictions across grade levels

## Citation

If you use this enhanced model, please cite:
```
Enhanced Readability Assessment Random Forest Model
With Comprehensive Linguistic Features and Improved Generalization
Trained on WeeBit and CLEAR corpus data
2025-07-02
```

## License

MIT License - See LICENSE file for details.