Update README.md
Browse files
README.md
CHANGED
|
@@ -1,33 +1,125 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# hikka-forge-anime2vec
|
| 2 |
|
| 3 |
This repository contains `hikka-forge-anime2vec`, a sophisticated semantic vector space model for anime, created by [Lorg0n](https://huggingface.co/Lorg0n).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
## Installation
|
| 8 |
|
|
|
|
|
|
|
| 9 |
```bash
|
| 10 |
pip install git+https://huggingface.co/Lorg0n/hikka-forge-anime2vec
|
| 11 |
```
|
| 12 |
|
| 13 |
## Usage
|
| 14 |
|
|
|
|
|
|
|
| 15 |
```python
|
| 16 |
from hikka_forge import Anime2Vec
|
| 17 |
|
| 18 |
-
# Initialize the model. All artifacts will be downloaded
|
|
|
|
| 19 |
anime2vec = Anime2Vec()
|
| 20 |
|
| 21 |
-
# Prepare data for a target anime
|
|
|
|
| 22 |
frieren_data = {
|
|
|
|
| 23 |
"en_title": "Frieren: Beyond Journey's End",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
"genres": ["Adventure", "Drama", "Fantasy"],
|
| 25 |
"studio": "Madhouse",
|
| 26 |
"type": "TV",
|
| 27 |
-
|
| 28 |
}
|
| 29 |
|
| 30 |
-
# Generate the 512-dimensional vector representation
|
| 31 |
frieren_vector = anime2vec.encode(frieren_data)
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
```
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- uk
|
| 6 |
+
- ja
|
| 7 |
+
tags:
|
| 8 |
+
- pytorch
|
| 9 |
+
- sentence-transformers
|
| 10 |
+
- feature-extraction
|
| 11 |
+
- semantic-search
|
| 12 |
+
- vector-arithmetic
|
| 13 |
+
- anime
|
| 14 |
+
- hikka
|
| 15 |
+
- hikka-forge
|
| 16 |
+
datasets:
|
| 17 |
+
- private
|
| 18 |
+
- synthetic
|
| 19 |
+
model-index:
|
| 20 |
+
- name: hikka-forge-anime2vec
|
| 21 |
+
results: []
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
# hikka-forge-anime2vec
|
| 25 |
|
| 26 |
This repository contains `hikka-forge-anime2vec`, a sophisticated semantic vector space model for anime, created by [Lorg0n](https://huggingface.co/Lorg0n).
|
| 27 |
+
This repository is also a directly installable Python package, allowing you to integrate powerful anime vectorization capabilities into your own projects with ease.
|
| 28 |
+
|
| 29 |
+
## Model Details
|
| 30 |
+
|
| 31 |
+
- **Model Version**: v12
|
| 32 |
+
- **Architecture**: A multi-input neural network with separate processing streams for text, genres, studios, and other categorical/numerical features. It uses attention mechanisms to weigh the importance of different text fields and genres, creating a rich, context-aware representation.
|
| 33 |
+
- **Embedding Size**: The model outputs a **512-dimensional** vector for each anime.
|
| 34 |
|
| 35 |
+
## Data & Training
|
| 36 |
+
|
| 37 |
+
The model's understanding of conceptual relationships comes from a unique blend of real-world and synthetic data.
|
| 38 |
+
- **Base Data**: The foundational metadata (titles, genres, studios, etc.) was sourced from the [hikka.io](https://hikka.io/) database. This provided a solid, real-world grounding for the model.
|
| 39 |
+
- **Synthetic Data Augmentation**: To teach the model complex conceptual and arithmetic relationships (e.g., `"Anime A" - "Anime X" + "Anime Y" = "Anime B"`), the training set was heavily augmented with data generated by a Large Language Model (LLM). This synthetic data was crucial for enabling the model's advanced vector arithmetic capabilities.
|
| 40 |
+
- **Training Procedure**: The model was trained using a combination of three loss functions:
|
| 41 |
+
1. **Triplet Loss**: Based on user recommendations to learn similarity.
|
| 42 |
+
2. **Cosine Similarity Loss**: For the LLM-generated vector arithmetic examples.
|
| 43 |
+
3. **Diversity Loss**: To ensure a well-distributed embedding space and prevent model collapse.
|
| 44 |
|
| 45 |
## Installation
|
| 46 |
|
| 47 |
+
Install the library directly from this Hugging Face repository:
|
| 48 |
+
|
| 49 |
```bash
|
| 50 |
pip install git+https://huggingface.co/Lorg0n/hikka-forge-anime2vec
|
| 51 |
```
|
| 52 |
|
| 53 |
## Usage
|
| 54 |
|
| 55 |
+
The library provides a simple, high-level `Anime2Vec` class that handles all the complexity of downloading models, preprocessing data, and generating embeddings.
|
| 56 |
+
|
| 57 |
```python
|
| 58 |
from hikka_forge import Anime2Vec
|
| 59 |
|
| 60 |
+
# Initialize the model. All required artifacts will be downloaded
|
| 61 |
+
# and cached automatically on the first run.
|
| 62 |
anime2vec = Anime2Vec()
|
| 63 |
|
| 64 |
+
# 1. Prepare data for a target anime.
|
| 65 |
+
# The `encode` method expects a dictionary with specific keys.
|
| 66 |
frieren_data = {
|
| 67 |
+
"ua_title": "Фрірен, що проводжає в останню путь",
|
| 68 |
"en_title": "Frieren: Beyond Journey's End",
|
| 69 |
+
"original_title": "Sousou no Frieren",
|
| 70 |
+
"ua_description": "Ельфійка-чарівниця Фрірен перемогла Короля Демонів...",
|
| 71 |
+
"en_description": "The elf mage Frieren and her courageous fellow adventurers...",
|
| 72 |
+
"alternate_names": ["Sousou no Frieren"],
|
| 73 |
"genres": ["Adventure", "Drama", "Fantasy"],
|
| 74 |
"studio": "Madhouse",
|
| 75 |
"type": "TV",
|
| 76 |
+
"numerical_features": [8.9, 500000, 2023, 28, 24, 100] # Example data
|
| 77 |
}
|
| 78 |
|
| 79 |
+
# 2. Generate the 512-dimensional vector representation
|
| 80 |
frieren_vector = anime2vec.encode(frieren_data)
|
| 81 |
+
|
| 82 |
+
print(f"Resulting vector for '{frieren_data['en_title']}' has shape: {frieren_vector.shape}")
|
| 83 |
+
# Now you can use this vector for similarity search, clustering, or vector arithmetic.
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
### Vector Arithmetic Example
|
| 87 |
+
|
| 88 |
+
The true power of this model lies in its ability to perform conceptual arithmetic, a skill honed by the LLM-generated training data.
|
| 89 |
+
|
| 90 |
+
```python
|
| 91 |
+
# This is a conceptual example. You would need to pre-compute vectors
|
| 92 |
+
# for all anime in your database to perform the final similarity search.
|
| 93 |
+
|
| 94 |
+
# Get vectors for two anime
|
| 95 |
+
aot_vector = anime2vec.encode(attack_on_titan_data)
|
| 96 |
+
code_geass_vector = anime2vec.encode(code_geass_data)
|
| 97 |
+
|
| 98 |
+
# Find the semantic average between them
|
| 99 |
+
# This should represent concepts like "military drama with sci-fi/mecha elements"
|
| 100 |
+
average_vector = (aot_vector + code_geass_vector) / 2.0
|
| 101 |
+
|
| 102 |
+
# You can now use `average_vector` to find anime that fit this
|
| 103 |
+
# hybrid description in your own vector database.
|
| 104 |
+
# Expected results: Aldnoah.Zero, Mobile Suit Gundam: Iron-Blooded Orphans, etc.
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
## Limitations and Bias
|
| 108 |
+
|
| 109 |
+
- **Data Bias**: The model's understanding is shaped by its training data. The base data from `hikka.io` and the synthetic data from the LLM both have inherent biases which will be reflected in the model's embeddings.
|
| 110 |
+
- **Textual Keyword Collision**: The model relies heavily on text descriptions. Sometimes, an anime from a completely different genre might use specific keywords (e.g., "political intrigue," "strategy") in its synopsis, causing it to appear anomalously in search results for serious thrillers.
|
| 111 |
+
- **Encoder Vocabulary**: While the categorical encoders (`genre`, `studio`) were trained on a comprehensive list, they are not exhaustive. A brand-new studio or a very niche genre tag not present in the original database will be treated as 'UNKNOWN'.
|
| 112 |
+
|
| 113 |
+
## Citation
|
| 114 |
+
|
| 115 |
+
If you use this model in your work, please consider citing it:
|
| 116 |
+
```bibtex
|
| 117 |
+
@misc{lorg0n2025anime2vec,
|
| 118 |
+
author = {Lorg0n},
|
| 119 |
+
title = {{hikka-forge-anime2vec: A Semantic Vector Space Model for Anime}},
|
| 120 |
+
year = {2025},
|
| 121 |
+
publisher = {Hugging Face},
|
| 122 |
+
howpublished = {\url{https://huggingface.co/Lorg0n/hikka-forge-anime2vec}},
|
| 123 |
+
note = {Hugging Face repository}
|
| 124 |
+
}
|
| 125 |
```
|