kjar
/

anime-difficulty

+---
+language:
+  - ja
+  - en
+tags:
+  - xgboost
+  - anime
+  - difficulty-estimation
+  - education
+  - japanese
+  - language-learning
+  - nlp
+  - difficulty-prediction
+metrics:
+  - rmse
+  - mae
+  - r2
+model_type: xgboost
+---
+# Anime Japanese Difficulty Predictor
+This project implements an XGBoost regression model to predict the Japanese language difficulty of anime series. The model assigns a score on a 0-50 scale based on subtitle linguistics, vocabulary statistics, and semantic content.
+## Dataset and Ground Truth
+The model was trained on a dataset of approximately **1,100 anime series and movies**.
+*   **Source:** Difficulty ratings were sourced from **Natively (learnnatively.com)** using the platform's "Data Download" feature.
+*   **Scale:** 0 to 50 (User-generated ratings).
+*   **Distribution:** The dataset is normally distributed but heavily concentrated in the 15-35 range, representing the standard difficulty of most broadcast anime.
+## Data Collection
+Subtitle data was aggregated using `jimaku-downloader`, a custom tool that interfaces with the **Jimaku.cc** API.
+*   **Extraction:** The tool utilizes regex-based parsing to identify and map episodes to metadata.
+*   **Selection Logic:** Priority was given to official Web-DL sources and text-based formats (SRT) over OCR or ASS files.
+*   **Potential Noise:** As Jimaku relies on user/group uploads, and episode mapping is automated via regex, the dataset contains a margin of error regarding subtitle timing accuracy and version matching.
+## Feature Engineering
+The model utilizes a combination of hard statistical features and semantic embeddings.
+### 1. Statistical Features
+*   **Density Metrics:** Characters per minute (CPM), Kanji density, Type-Token Ratio (TTR).
+*   **Vocabulary Coverage:** Percentage of words appearing in common frequency lists (Top 1k, 2k, 5k, 10k).
+*   **Comprehension Thresholds:** Number of unique words required to reach 90%, 95%, and 98% text coverage.
+*   **JLPT Distribution:** Proportion of vocabulary corresponding to JLPT levels N5 through N1.
+*   **Part-of-Speech:** Distribution of word types (nouns, verbs, particles, etc.).
+### 2. Semantic Features
+*   **Text Inputs:**
+    *   Series description.
+    *   "Lexical Signature": A concatenation of the top 200 most frequent content words (excluding stopwords) extracted from the subtitles.
+*   **Encoding:** Text is encoded using `paraphrase-multilingual-MiniLM-L12-v2`.
+*   **Dimensionality Reduction:** High-dimensional embeddings are reduced to 30 components using PCA.
+## Model Architecture
+The inference pipeline follows this structure:
+1.  **Preprocessing:**
+    *   Numeric features are normalized using `StandardScaler`.
+    *   Text inputs are vectorized via SentenceTransformer and reduced via PCA.
+2.  **Estimator:**
+    *   Algorithm: XGBoost Regressor.
+    *   Optimization: Hyperparameters tuned via Optuna (50 trials) minimizing RMSE.
+    *   Validation: 5-Fold Cross-Validation.
+## Performance
+Evaluated on a held-out test set (20% split):
+| Metric   | Value      |
+| :------- | :--------- |
+| **RMSE** | **2.3633** |
+| **MAE**  | **1.8670** |
+| **R²**   | **0.5813** |
+## Limitations
+*   **Subtitle Quality:** Reliance on user-uploaded subtitles introduces potential variance in transcription accuracy and timing.
+*   **Ground Truth Subjectivity:** Natively ratings are based on user perception of difficulty rather than a standardized linguistic index.
+*   **Parsing Errors:** The automated episode detection in the data collection phase may have resulted in mismatched subtitles for a small fraction of the training data.
+## Artifacts
+The trained model is serialized as `anime_difficulty_model.pkl`. This file contains a dictionary with the following keys:
+*   `model`: The trained XGBoost regressor.
+*   `scaler`: Fitted StandardScaler for numeric features.
+*   `pca`: Fitted PCA object for text embeddings.
+*   `feature_cols`: List of numeric column names expected by the pipeline.
+**Note:** The SentenceTransformer model is not pickled due to size; it must be re-initialized during inference.
+## Acknowledgements
+*   **Natively:** For providing the difficulty rating dataset.
+*   **Jimaku.cc:** For providing access to the subtitle repository.

anime_difficulty_model.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e0c240394266835a6dbbe3e03a9082a03d9f771054bad3e5b4232f095ca492fe
+size 864925

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+xgboost==3.1.2
+sentence-transformers==5.2.0
+numpy==2.3.5
+pandas==2.3.3
+scikit-learn==1.8.0
+torch==2.9.1