kjar commited on
Commit
0e3cac3
·
verified ·
1 Parent(s): 5e32de5

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +100 -0
  2. anime_difficulty_model.pkl +3 -0
  3. requirements.txt +6 -0
README.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ja
4
+ - en
5
+ tags:
6
+ - xgboost
7
+ - anime
8
+ - difficulty-estimation
9
+ - education
10
+ - japanese
11
+ - language-learning
12
+ - nlp
13
+ - difficulty-prediction
14
+ metrics:
15
+ - rmse
16
+ - mae
17
+ - r2
18
+ model_type: xgboost
19
+ ---
20
+
21
+ # Anime Japanese Difficulty Predictor
22
+
23
+ This project implements an XGBoost regression model to predict the Japanese language difficulty of anime series. The model assigns a score on a 0-50 scale based on subtitle linguistics, vocabulary statistics, and semantic content.
24
+
25
+ ## Dataset and Ground Truth
26
+
27
+ The model was trained on a dataset of approximately **1,100 anime series and movies**.
28
+
29
+ * **Source:** Difficulty ratings were sourced from **Natively (learnnatively.com)** using the platform's "Data Download" feature.
30
+ * **Scale:** 0 to 50 (User-generated ratings).
31
+ * **Distribution:** The dataset is normally distributed but heavily concentrated in the 15-35 range, representing the standard difficulty of most broadcast anime.
32
+
33
+ ## Data Collection
34
+
35
+ Subtitle data was aggregated using `jimaku-downloader`, a custom tool that interfaces with the **Jimaku.cc** API.
36
+
37
+ * **Extraction:** The tool utilizes regex-based parsing to identify and map episodes to metadata.
38
+ * **Selection Logic:** Priority was given to official Web-DL sources and text-based formats (SRT) over OCR or ASS files.
39
+ * **Potential Noise:** As Jimaku relies on user/group uploads, and episode mapping is automated via regex, the dataset contains a margin of error regarding subtitle timing accuracy and version matching.
40
+
41
+ ## Feature Engineering
42
+
43
+ The model utilizes a combination of hard statistical features and semantic embeddings.
44
+
45
+ ### 1. Statistical Features
46
+ * **Density Metrics:** Characters per minute (CPM), Kanji density, Type-Token Ratio (TTR).
47
+ * **Vocabulary Coverage:** Percentage of words appearing in common frequency lists (Top 1k, 2k, 5k, 10k).
48
+ * **Comprehension Thresholds:** Number of unique words required to reach 90%, 95%, and 98% text coverage.
49
+ * **JLPT Distribution:** Proportion of vocabulary corresponding to JLPT levels N5 through N1.
50
+ * **Part-of-Speech:** Distribution of word types (nouns, verbs, particles, etc.).
51
+
52
+ ### 2. Semantic Features
53
+ * **Text Inputs:**
54
+ * Series description.
55
+ * "Lexical Signature": A concatenation of the top 200 most frequent content words (excluding stopwords) extracted from the subtitles.
56
+ * **Encoding:** Text is encoded using `paraphrase-multilingual-MiniLM-L12-v2`.
57
+ * **Dimensionality Reduction:** High-dimensional embeddings are reduced to 30 components using PCA.
58
+
59
+ ## Model Architecture
60
+
61
+ The inference pipeline follows this structure:
62
+
63
+ 1. **Preprocessing:**
64
+ * Numeric features are normalized using `StandardScaler`.
65
+ * Text inputs are vectorized via SentenceTransformer and reduced via PCA.
66
+ 2. **Estimator:**
67
+ * Algorithm: XGBoost Regressor.
68
+ * Optimization: Hyperparameters tuned via Optuna (50 trials) minimizing RMSE.
69
+ * Validation: 5-Fold Cross-Validation.
70
+
71
+ ## Performance
72
+
73
+ Evaluated on a held-out test set (20% split):
74
+
75
+ | Metric | Value |
76
+ | :------- | :--------- |
77
+ | **RMSE** | **2.3633** |
78
+ | **MAE** | **1.8670** |
79
+ | **R²** | **0.5813** |
80
+
81
+ ## Limitations
82
+
83
+ * **Subtitle Quality:** Reliance on user-uploaded subtitles introduces potential variance in transcription accuracy and timing.
84
+ * **Ground Truth Subjectivity:** Natively ratings are based on user perception of difficulty rather than a standardized linguistic index.
85
+ * **Parsing Errors:** The automated episode detection in the data collection phase may have resulted in mismatched subtitles for a small fraction of the training data.
86
+
87
+ ## Artifacts
88
+
89
+ The trained model is serialized as `anime_difficulty_model.pkl`. This file contains a dictionary with the following keys:
90
+ * `model`: The trained XGBoost regressor.
91
+ * `scaler`: Fitted StandardScaler for numeric features.
92
+ * `pca`: Fitted PCA object for text embeddings.
93
+ * `feature_cols`: List of numeric column names expected by the pipeline.
94
+
95
+ **Note:** The SentenceTransformer model is not pickled due to size; it must be re-initialized during inference.
96
+
97
+ ## Acknowledgements
98
+
99
+ * **Natively:** For providing the difficulty rating dataset.
100
+ * **Jimaku.cc:** For providing access to the subtitle repository.
anime_difficulty_model.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e0c240394266835a6dbbe3e03a9082a03d9f771054bad3e5b4232f095ca492fe
3
+ size 864925
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ xgboost==3.1.2
2
+ sentence-transformers==5.2.0
3
+ numpy==2.3.5
4
+ pandas==2.3.3
5
+ scikit-learn==1.8.0
6
+ torch==2.9.1