yimingwang123
/

enhanced-readability-random-forest

enhanced-features

linguistic-analysis

Model card Files Files and versions

enhanced-readability-random-forest / README.md

yimingwang123's picture

Upload README.md with huggingface_hub

cca9c07 verified 6 months ago

|

history blame contribute delete

3.03 kB

	---
	title: Enhanced Readability Assessment Random Forest Model
	emoji: 🌲
	colorFrom: green
	colorTo: blue
	sdk: gradio
	sdk_version: 4.0.0
	app_file: app.py
	pinned: false
	tags:
	- readability
	- nlp
	- education
	- random-forest
	- sklearn
	- enhanced-features
	- linguistic-analysis
	license: mit
	---

	# Enhanced Readability Assessment Random Forest Model

	This model predicts the reading grade level of English text using an Enhanced Random Forest algorithm with comprehensive linguistic features and improved generalization capabilities.

	## Model Performance

	- Cross-Validation MAE: 0.41789318171271833
	- Training R²: N/A
	- Test Set R²: 0.8460916091361399
	- Training Date: 2025-07-02T23:04:35.722721

	## Enhanced Features

	This enhanced model uses 36 total features with 25 selected features using N/A:

	### Feature Categories:
	- Traditional Readability Metrics: Flesch-Kincaid, Coleman-Liau, ARI, etc.
	- Age of Acquisition (AoA) Metrics: Word difficulty based on acquisition age
	- Syntactic Complexity: Sentence structure and parsing depth
	- Lexical Diversity: Vocabulary richness and variation
	- Morphological Features: Word formation patterns
	- Semantic Features: Word meaning and context complexity
	- Corpus Source Indicators: Training data source information

	### Key Improvements:
	- Feature Selection: Automated selection of most predictive features
	- Robust Scaling: Better handling of outliers and extreme values
	- Enhanced Generalization: Optimized hyperparameters for cross-domain performance
	- Comprehensive Evaluation: Multi-dataset validation

	## Model Architecture

	- Algorithm: Random Forest Regressor
	- Trees: 200 estimators for stability
	- Max Depth: Controlled to prevent overfitting
	- Feature Selection: SelectKBest with f_regression
	- Scaling: RobustScaler for outlier resistance

	## Usage

	```python
	import joblib
	import pandas as pd
	import numpy as np

	# Load the enhanced model
	model = joblib.load('enhanced_readability_random_forest.pkl')

	# The model is a complete EnhancedReadabilityRandomForestModel instance
	# with built-in feature computation and prediction methods

	# Example usage (simplified):
	# predicted_grade = model.predict_text("Your text here")
	```

	## Training Data

	- Primary: WeeBit corpus (age-graded web content)
	- Secondary: CLEAR corpus (simplified text pairs)
	- Validation: Multiple independent datasets
	- Total Samples: 2500

	## Performance Comparison

	This enhanced model shows improved performance over the baseline:
	- Better cross-validation stability
	- Enhanced feature representation
	- Improved generalization to unseen text types
	- More robust predictions across grade levels

	## Citation

	If you use this enhanced model, please cite:
	```
	Enhanced Readability Assessment Random Forest Model
	With Comprehensive Linguistic Features and Improved Generalization
	Trained on WeeBit and CLEAR corpus data
	2025-07-02
	```

	## License

	MIT License - See LICENSE file for details.