File size: 4,255 Bytes
0e3cac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
language:
  - ja
  - en
tags:
  - xgboost
  - anime
  - difficulty-estimation
  - education
  - japanese
  - language-learning
  - nlp
  - difficulty-prediction
metrics:
  - rmse
  - mae
  - r2
model_type: xgboost
---

# Anime Japanese Difficulty Predictor

This project implements an XGBoost regression model to predict the Japanese language difficulty of anime series. The model assigns a score on a 0-50 scale based on subtitle linguistics, vocabulary statistics, and semantic content.

## Dataset and Ground Truth

The model was trained on a dataset of approximately **1,100 anime series and movies**.

*   **Source:** Difficulty ratings were sourced from **Natively (learnnatively.com)** using the platform's "Data Download" feature.
*   **Scale:** 0 to 50 (User-generated ratings).
*   **Distribution:** The dataset is normally distributed but heavily concentrated in the 15-35 range, representing the standard difficulty of most broadcast anime.

## Data Collection

Subtitle data was aggregated using `jimaku-downloader`, a custom tool that interfaces with the **Jimaku.cc** API.

*   **Extraction:** The tool utilizes regex-based parsing to identify and map episodes to metadata.
*   **Selection Logic:** Priority was given to official Web-DL sources and text-based formats (SRT) over OCR or ASS files.
*   **Potential Noise:** As Jimaku relies on user/group uploads, and episode mapping is automated via regex, the dataset contains a margin of error regarding subtitle timing accuracy and version matching.

## Feature Engineering

The model utilizes a combination of hard statistical features and semantic embeddings.

### 1. Statistical Features
*   **Density Metrics:** Characters per minute (CPM), Kanji density, Type-Token Ratio (TTR).
*   **Vocabulary Coverage:** Percentage of words appearing in common frequency lists (Top 1k, 2k, 5k, 10k).
*   **Comprehension Thresholds:** Number of unique words required to reach 90%, 95%, and 98% text coverage.
*   **JLPT Distribution:** Proportion of vocabulary corresponding to JLPT levels N5 through N1.
*   **Part-of-Speech:** Distribution of word types (nouns, verbs, particles, etc.).

### 2. Semantic Features
*   **Text Inputs:**
    *   Series description.
    *   "Lexical Signature": A concatenation of the top 200 most frequent content words (excluding stopwords) extracted from the subtitles.
*   **Encoding:** Text is encoded using `paraphrase-multilingual-MiniLM-L12-v2`.
*   **Dimensionality Reduction:** High-dimensional embeddings are reduced to 30 components using PCA.

## Model Architecture

The inference pipeline follows this structure:

1.  **Preprocessing:**
    *   Numeric features are normalized using `StandardScaler`.
    *   Text inputs are vectorized via SentenceTransformer and reduced via PCA.
2.  **Estimator:**
    *   Algorithm: XGBoost Regressor.
    *   Optimization: Hyperparameters tuned via Optuna (50 trials) minimizing RMSE.
    *   Validation: 5-Fold Cross-Validation.

## Performance

Evaluated on a held-out test set (20% split):

| Metric   | Value      |
| :------- | :--------- |
| **RMSE** | **2.3633** |
| **MAE**  | **1.8670** |
| **R²**   | **0.5813** |

## Limitations

*   **Subtitle Quality:** Reliance on user-uploaded subtitles introduces potential variance in transcription accuracy and timing.
*   **Ground Truth Subjectivity:** Natively ratings are based on user perception of difficulty rather than a standardized linguistic index.
*   **Parsing Errors:** The automated episode detection in the data collection phase may have resulted in mismatched subtitles for a small fraction of the training data.

## Artifacts

The trained model is serialized as `anime_difficulty_model.pkl`. This file contains a dictionary with the following keys:
*   `model`: The trained XGBoost regressor.
*   `scaler`: Fitted StandardScaler for numeric features.
*   `pca`: Fitted PCA object for text embeddings.
*   `feature_cols`: List of numeric column names expected by the pipeline.

**Note:** The SentenceTransformer model is not pickled due to size; it must be re-initialized during inference.

## Acknowledgements

*   **Natively:** For providing the difficulty rating dataset.
*   **Jimaku.cc:** For providing access to the subtitle repository.