anime-difficulty / README.md

Upload 3 files

0e3cac3 verified about 1 month ago

4.26 kB

	---
	language:
	- ja
	- en
	tags:
	- xgboost
	- anime
	- difficulty-estimation
	- education
	- japanese
	- language-learning
	- nlp
	- difficulty-prediction
	metrics:
	- rmse
	- mae
	- r2
	model_type: xgboost
	---

	# Anime Japanese Difficulty Predictor

	This project implements an XGBoost regression model to predict the Japanese language difficulty of anime series. The model assigns a score on a 0-50 scale based on subtitle linguistics, vocabulary statistics, and semantic content.

	## Dataset and Ground Truth

	The model was trained on a dataset of approximately 1,100 anime series and movies.

	* Source: Difficulty ratings were sourced from Natively (learnnatively.com) using the platform's "Data Download" feature.
	* Scale: 0 to 50 (User-generated ratings).
	* Distribution: The dataset is normally distributed but heavily concentrated in the 15-35 range, representing the standard difficulty of most broadcast anime.

	## Data Collection

	Subtitle data was aggregated using `jimaku-downloader`, a custom tool that interfaces with the Jimaku.cc API.

	* Extraction: The tool utilizes regex-based parsing to identify and map episodes to metadata.
	* Selection Logic: Priority was given to official Web-DL sources and text-based formats (SRT) over OCR or ASS files.
	* Potential Noise: As Jimaku relies on user/group uploads, and episode mapping is automated via regex, the dataset contains a margin of error regarding subtitle timing accuracy and version matching.

	## Feature Engineering

	The model utilizes a combination of hard statistical features and semantic embeddings.

	### 1. Statistical Features
	* Density Metrics: Characters per minute (CPM), Kanji density, Type-Token Ratio (TTR).
	* Vocabulary Coverage: Percentage of words appearing in common frequency lists (Top 1k, 2k, 5k, 10k).
	* Comprehension Thresholds: Number of unique words required to reach 90%, 95%, and 98% text coverage.
	* JLPT Distribution: Proportion of vocabulary corresponding to JLPT levels N5 through N1.
	* Part-of-Speech: Distribution of word types (nouns, verbs, particles, etc.).

	### 2. Semantic Features
	* Text Inputs:
	* Series description.
	* "Lexical Signature": A concatenation of the top 200 most frequent content words (excluding stopwords) extracted from the subtitles.
	* Encoding: Text is encoded using `paraphrase-multilingual-MiniLM-L12-v2`.
	* Dimensionality Reduction: High-dimensional embeddings are reduced to 30 components using PCA.

	## Model Architecture

	The inference pipeline follows this structure:

	1. Preprocessing:
	* Numeric features are normalized using `StandardScaler`.
	* Text inputs are vectorized via SentenceTransformer and reduced via PCA.
	2. Estimator:
	* Algorithm: XGBoost Regressor.
	* Optimization: Hyperparameters tuned via Optuna (50 trials) minimizing RMSE.
	* Validation: 5-Fold Cross-Validation.

	## Performance

	Evaluated on a held-out test set (20% split):

	\| Metric \| Value \|
	\| :------- \| :--------- \|
	\| RMSE \| 2.3633 \|
	\| MAE \| 1.8670 \|
	\| R² \| 0.5813 \|

	## Limitations

	* Subtitle Quality: Reliance on user-uploaded subtitles introduces potential variance in transcription accuracy and timing.
	* Ground Truth Subjectivity: Natively ratings are based on user perception of difficulty rather than a standardized linguistic index.
	* Parsing Errors: The automated episode detection in the data collection phase may have resulted in mismatched subtitles for a small fraction of the training data.

	## Artifacts

	The trained model is serialized as `anime_difficulty_model.pkl`. This file contains a dictionary with the following keys:
	* `model`: The trained XGBoost regressor.
	* `scaler`: Fitted StandardScaler for numeric features.
	* `pca`: Fitted PCA object for text embeddings.
	* `feature_cols`: List of numeric column names expected by the pipeline.

	Note: The SentenceTransformer model is not pickled due to size; it must be re-initialized during inference.

	## Acknowledgements

	* Natively: For providing the difficulty rating dataset.
	* Jimaku.cc: For providing access to the subtitle repository.