Upload 3 files
Browse files- README.md +100 -0
- anime_difficulty_model.pkl +3 -0
- requirements.txt +6 -0
README.md
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- ja
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- xgboost
|
| 7 |
+
- anime
|
| 8 |
+
- difficulty-estimation
|
| 9 |
+
- education
|
| 10 |
+
- japanese
|
| 11 |
+
- language-learning
|
| 12 |
+
- nlp
|
| 13 |
+
- difficulty-prediction
|
| 14 |
+
metrics:
|
| 15 |
+
- rmse
|
| 16 |
+
- mae
|
| 17 |
+
- r2
|
| 18 |
+
model_type: xgboost
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# Anime Japanese Difficulty Predictor
|
| 22 |
+
|
| 23 |
+
This project implements an XGBoost regression model to predict the Japanese language difficulty of anime series. The model assigns a score on a 0-50 scale based on subtitle linguistics, vocabulary statistics, and semantic content.
|
| 24 |
+
|
| 25 |
+
## Dataset and Ground Truth
|
| 26 |
+
|
| 27 |
+
The model was trained on a dataset of approximately **1,100 anime series and movies**.
|
| 28 |
+
|
| 29 |
+
* **Source:** Difficulty ratings were sourced from **Natively (learnnatively.com)** using the platform's "Data Download" feature.
|
| 30 |
+
* **Scale:** 0 to 50 (User-generated ratings).
|
| 31 |
+
* **Distribution:** The dataset is normally distributed but heavily concentrated in the 15-35 range, representing the standard difficulty of most broadcast anime.
|
| 32 |
+
|
| 33 |
+
## Data Collection
|
| 34 |
+
|
| 35 |
+
Subtitle data was aggregated using `jimaku-downloader`, a custom tool that interfaces with the **Jimaku.cc** API.
|
| 36 |
+
|
| 37 |
+
* **Extraction:** The tool utilizes regex-based parsing to identify and map episodes to metadata.
|
| 38 |
+
* **Selection Logic:** Priority was given to official Web-DL sources and text-based formats (SRT) over OCR or ASS files.
|
| 39 |
+
* **Potential Noise:** As Jimaku relies on user/group uploads, and episode mapping is automated via regex, the dataset contains a margin of error regarding subtitle timing accuracy and version matching.
|
| 40 |
+
|
| 41 |
+
## Feature Engineering
|
| 42 |
+
|
| 43 |
+
The model utilizes a combination of hard statistical features and semantic embeddings.
|
| 44 |
+
|
| 45 |
+
### 1. Statistical Features
|
| 46 |
+
* **Density Metrics:** Characters per minute (CPM), Kanji density, Type-Token Ratio (TTR).
|
| 47 |
+
* **Vocabulary Coverage:** Percentage of words appearing in common frequency lists (Top 1k, 2k, 5k, 10k).
|
| 48 |
+
* **Comprehension Thresholds:** Number of unique words required to reach 90%, 95%, and 98% text coverage.
|
| 49 |
+
* **JLPT Distribution:** Proportion of vocabulary corresponding to JLPT levels N5 through N1.
|
| 50 |
+
* **Part-of-Speech:** Distribution of word types (nouns, verbs, particles, etc.).
|
| 51 |
+
|
| 52 |
+
### 2. Semantic Features
|
| 53 |
+
* **Text Inputs:**
|
| 54 |
+
* Series description.
|
| 55 |
+
* "Lexical Signature": A concatenation of the top 200 most frequent content words (excluding stopwords) extracted from the subtitles.
|
| 56 |
+
* **Encoding:** Text is encoded using `paraphrase-multilingual-MiniLM-L12-v2`.
|
| 57 |
+
* **Dimensionality Reduction:** High-dimensional embeddings are reduced to 30 components using PCA.
|
| 58 |
+
|
| 59 |
+
## Model Architecture
|
| 60 |
+
|
| 61 |
+
The inference pipeline follows this structure:
|
| 62 |
+
|
| 63 |
+
1. **Preprocessing:**
|
| 64 |
+
* Numeric features are normalized using `StandardScaler`.
|
| 65 |
+
* Text inputs are vectorized via SentenceTransformer and reduced via PCA.
|
| 66 |
+
2. **Estimator:**
|
| 67 |
+
* Algorithm: XGBoost Regressor.
|
| 68 |
+
* Optimization: Hyperparameters tuned via Optuna (50 trials) minimizing RMSE.
|
| 69 |
+
* Validation: 5-Fold Cross-Validation.
|
| 70 |
+
|
| 71 |
+
## Performance
|
| 72 |
+
|
| 73 |
+
Evaluated on a held-out test set (20% split):
|
| 74 |
+
|
| 75 |
+
| Metric | Value |
|
| 76 |
+
| :------- | :--------- |
|
| 77 |
+
| **RMSE** | **2.3633** |
|
| 78 |
+
| **MAE** | **1.8670** |
|
| 79 |
+
| **R²** | **0.5813** |
|
| 80 |
+
|
| 81 |
+
## Limitations
|
| 82 |
+
|
| 83 |
+
* **Subtitle Quality:** Reliance on user-uploaded subtitles introduces potential variance in transcription accuracy and timing.
|
| 84 |
+
* **Ground Truth Subjectivity:** Natively ratings are based on user perception of difficulty rather than a standardized linguistic index.
|
| 85 |
+
* **Parsing Errors:** The automated episode detection in the data collection phase may have resulted in mismatched subtitles for a small fraction of the training data.
|
| 86 |
+
|
| 87 |
+
## Artifacts
|
| 88 |
+
|
| 89 |
+
The trained model is serialized as `anime_difficulty_model.pkl`. This file contains a dictionary with the following keys:
|
| 90 |
+
* `model`: The trained XGBoost regressor.
|
| 91 |
+
* `scaler`: Fitted StandardScaler for numeric features.
|
| 92 |
+
* `pca`: Fitted PCA object for text embeddings.
|
| 93 |
+
* `feature_cols`: List of numeric column names expected by the pipeline.
|
| 94 |
+
|
| 95 |
+
**Note:** The SentenceTransformer model is not pickled due to size; it must be re-initialized during inference.
|
| 96 |
+
|
| 97 |
+
## Acknowledgements
|
| 98 |
+
|
| 99 |
+
* **Natively:** For providing the difficulty rating dataset.
|
| 100 |
+
* **Jimaku.cc:** For providing access to the subtitle repository.
|
anime_difficulty_model.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e0c240394266835a6dbbe3e03a9082a03d9f771054bad3e5b4232f095ca492fe
|
| 3 |
+
size 864925
|
requirements.txt
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
xgboost==3.1.2
|
| 2 |
+
sentence-transformers==5.2.0
|
| 3 |
+
numpy==2.3.5
|
| 4 |
+
pandas==2.3.3
|
| 5 |
+
scikit-learn==1.8.0
|
| 6 |
+
torch==2.9.1
|