|
|
--- |
|
|
language: |
|
|
- ja |
|
|
- en |
|
|
tags: |
|
|
- xgboost |
|
|
- anime |
|
|
- difficulty-estimation |
|
|
- education |
|
|
- japanese |
|
|
- language-learning |
|
|
- nlp |
|
|
- difficulty-prediction |
|
|
metrics: |
|
|
- rmse |
|
|
- mae |
|
|
- r2 |
|
|
model_type: xgboost |
|
|
--- |
|
|
|
|
|
# Anime Japanese Difficulty Predictor |
|
|
|
|
|
This project implements an XGBoost regression model to predict the Japanese language difficulty of anime series. The model assigns a score on a 0-50 scale based on subtitle linguistics, vocabulary statistics, and semantic content. |
|
|
|
|
|
## Dataset and Ground Truth |
|
|
|
|
|
The model was trained on a dataset of approximately **1,100 anime series and movies**. |
|
|
|
|
|
* **Source:** Difficulty ratings were sourced from **Natively (learnnatively.com)** using the platform's "Data Download" feature. |
|
|
* **Scale:** 0 to 50 (User-generated ratings). |
|
|
* **Distribution:** The dataset is normally distributed but heavily concentrated in the 15-35 range, representing the standard difficulty of most broadcast anime. |
|
|
|
|
|
## Data Collection |
|
|
|
|
|
Subtitle data was aggregated using `jimaku-downloader`, a custom tool that interfaces with the **Jimaku.cc** API. |
|
|
|
|
|
* **Extraction:** The tool utilizes regex-based parsing to identify and map episodes to metadata. |
|
|
* **Selection Logic:** Priority was given to official Web-DL sources and text-based formats (SRT) over OCR or ASS files. |
|
|
* **Potential Noise:** As Jimaku relies on user/group uploads, and episode mapping is automated via regex, the dataset contains a margin of error regarding subtitle timing accuracy and version matching. |
|
|
|
|
|
## Feature Engineering |
|
|
|
|
|
The model utilizes a combination of hard statistical features and semantic embeddings. |
|
|
|
|
|
### 1. Statistical Features |
|
|
* **Density Metrics:** Characters per minute (CPM), Kanji density, Type-Token Ratio (TTR). |
|
|
* **Vocabulary Coverage:** Percentage of words appearing in common frequency lists (Top 1k, 2k, 5k, 10k). |
|
|
* **Comprehension Thresholds:** Number of unique words required to reach 90%, 95%, and 98% text coverage. |
|
|
* **JLPT Distribution:** Proportion of vocabulary corresponding to JLPT levels N5 through N1. |
|
|
* **Part-of-Speech:** Distribution of word types (nouns, verbs, particles, etc.). |
|
|
|
|
|
### 2. Semantic Features |
|
|
* **Text Inputs:** |
|
|
* Series description. |
|
|
* "Lexical Signature": A concatenation of the top 200 most frequent content words (excluding stopwords) extracted from the subtitles. |
|
|
* **Encoding:** Text is encoded using `paraphrase-multilingual-MiniLM-L12-v2`. |
|
|
* **Dimensionality Reduction:** High-dimensional embeddings are reduced to 30 components using PCA. |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The inference pipeline follows this structure: |
|
|
|
|
|
1. **Preprocessing:** |
|
|
* Numeric features are normalized using `StandardScaler`. |
|
|
* Text inputs are vectorized via SentenceTransformer and reduced via PCA. |
|
|
2. **Estimator:** |
|
|
* Algorithm: XGBoost Regressor. |
|
|
* Optimization: Hyperparameters tuned via Optuna (50 trials) minimizing RMSE. |
|
|
* Validation: 5-Fold Cross-Validation. |
|
|
|
|
|
## Performance |
|
|
|
|
|
Evaluated on a held-out test set (20% split): |
|
|
|
|
|
| Metric | Value | |
|
|
| :------- | :--------- | |
|
|
| **RMSE** | **2.3633** | |
|
|
| **MAE** | **1.8670** | |
|
|
| **R²** | **0.5813** | |
|
|
|
|
|
## Limitations |
|
|
|
|
|
* **Subtitle Quality:** Reliance on user-uploaded subtitles introduces potential variance in transcription accuracy and timing. |
|
|
* **Ground Truth Subjectivity:** Natively ratings are based on user perception of difficulty rather than a standardized linguistic index. |
|
|
* **Parsing Errors:** The automated episode detection in the data collection phase may have resulted in mismatched subtitles for a small fraction of the training data. |
|
|
|
|
|
## Artifacts |
|
|
|
|
|
The trained model is serialized as `anime_difficulty_model.pkl`. This file contains a dictionary with the following keys: |
|
|
* `model`: The trained XGBoost regressor. |
|
|
* `scaler`: Fitted StandardScaler for numeric features. |
|
|
* `pca`: Fitted PCA object for text embeddings. |
|
|
* `feature_cols`: List of numeric column names expected by the pipeline. |
|
|
|
|
|
**Note:** The SentenceTransformer model is not pickled due to size; it must be re-initialized during inference. |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
* **Natively:** For providing the difficulty rating dataset. |
|
|
* **Jimaku.cc:** For providing access to the subtitle repository. |