| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | - uk |
| | - ja |
| | tags: |
| | - pytorch |
| | - sentence-transformers |
| | - feature-extraction |
| | - semantic-search |
| | - vector-arithmetic |
| | - anime |
| | - hikka |
| | - hikka-forge |
| | - 2vec |
| | datasets: |
| | - private |
| | - synthetic |
| | model-index: |
| | - name: hikka-forge-anime2vec |
| | results: [] |
| | --- |
| | |
| | # hikka-forge-anime2vec |
| |
|
| | This repository contains `hikka-forge-anime2vec`, a sophisticated semantic vector space model for anime, created by [Lorg0n](https://huggingface.co/Lorg0n). |
| | This repository is also a directly installable Python package, allowing you to integrate powerful anime vectorization capabilities into your own projects with ease. |
| |
|
| | ## Model Details |
| |
|
| | - **Model Version**: v13 |
| | - **Architecture**: A multi-input neural network with separate processing streams for text, genres, studios, and other categorical/numerical features. It uses attention mechanisms to weigh the importance of different text fields and genres, creating a rich, context-aware representation. |
| | - **Embedding Size**: The model outputs a **512-dimensional** vector for each anime. |
| |
|
| |  |
| |
|
| | ## Data & Training |
| |
|
| | The model's understanding of conceptual relationships comes from a unique blend of real-world and synthetic data. |
| | - **Base Data**: The foundational metadata (titles, genres, studios, etc.) was sourced from the [hikka.io](https://hikka.io/) database. This provided a solid, real-world grounding for the model. |
| | - **Synthetic Data Augmentation**: To teach the model complex conceptual and arithmetic relationships (e.g., `"Anime A" - "Anime X" + "Anime Y" = "Anime B"`), the training set was heavily augmented with data generated by a Large Language Model (LLM). This synthetic data was crucial for enabling the model's advanced vector arithmetic capabilities. |
| | - **Training Procedure**: The model was trained using a combination of three loss functions: |
| | 1. **Triplet Loss**: Based on user recommendations to learn similarity. |
| | 2. **Cosine Similarity Loss**: For the LLM-generated vector arithmetic examples. |
| | 3. **Diversity Loss**: To ensure a well-distributed embedding space and prevent model collapse. |
| |
|
| | ## Installation |
| |
|
| | Install the library directly from this Hugging Face repository: |
| |
|
| | ```bash |
| | pip install git+https://huggingface.co/Lorg0n/hikka-forge-anime2vec |
| | ``` |
| |
|
| | ## Usage |
| |
|
| | The library provides a simple, high-level `Anime2Vec` class that handles all the complexity of downloading models, preprocessing data, and generating embeddings. |
| |
|
| | ```python |
| | from hikka_forge import Anime2Vec |
| | |
| | # Initialize the model. All required artifacts will be downloaded |
| | # and cached automatically on the first run. |
| | anime2vec = Anime2Vec() |
| | |
| | # 1. Prepare data for a target anime. |
| | # The `encode` method expects a dictionary with specific keys. |
| | frieren_data = { |
| | "ua_title": "Фрірен, що проводжає в останню путь", |
| | "en_title": "Frieren: Beyond Journey's End", |
| | "original_title": "Sousou no Frieren", |
| | "ua_description": "Ельфійка-чарівниця Фрірен перемогла Короля Демонів...", |
| | "en_description": "The elf mage Frieren and her courageous fellow adventurers...", |
| | "alternate_names": ["Sousou no Frieren"], |
| | "genres": ["Adventure", "Drama", "Fantasy"], |
| | "studio": "Madhouse", |
| | "type": "TV", |
| | "numerical_features": [8.9, 500000, 2023, 28, 24, 100] # Example data |
| | } |
| | |
| | # 2. Generate the 512-dimensional vector representation |
| | frieren_vector = anime2vec.encode(frieren_data) |
| | |
| | print(f"Resulting vector for '{frieren_data['en_title']}' has shape: {frieren_vector.shape}") |
| | # Now you can use this vector for similarity search, clustering, or vector arithmetic. |
| | ``` |
| |
|
| | ### Vector Arithmetic Example |
| |
|
| | The true power of this model lies in its ability to perform conceptual arithmetic, a skill honed by the LLM-generated training data. |
| |
|
| | ```python |
| | # This is a conceptual example. You would need to pre-compute vectors |
| | # for all anime in your database to perform the final similarity search. |
| | |
| | # Get vectors for two anime |
| | aot_vector = anime2vec.encode(attack_on_titan_data) |
| | code_geass_vector = anime2vec.encode(code_geass_data) |
| | |
| | # Find the semantic average between them |
| | # This should represent concepts like "military drama with sci-fi/mecha elements" |
| | average_vector = (aot_vector + code_geass_vector) / 2.0 |
| | |
| | # You can now use `average_vector` to find anime that fit this |
| | # hybrid description in your own vector database. |
| | # Expected results: Aldnoah.Zero, Mobile Suit Gundam: Iron-Blooded Orphans, etc. |
| | ``` |
| |
|
| | ## Limitations and Bias |
| |
|
| | - **Data Bias**: The model's understanding is shaped by its training data. The base data from `hikka.io` and the synthetic data from the LLM both have inherent biases which will be reflected in the model's embeddings. |
| | - **Textual Keyword Collision**: The model relies heavily on text descriptions. Sometimes, an anime from a completely different genre might use specific keywords (e.g., "political intrigue," "strategy") in its synopsis, causing it to appear anomalously in search results for serious thrillers. |
| | - **Encoder Vocabulary**: While the categorical encoders (`genre`, `studio`) were trained on a comprehensive list, they are not exhaustive. A brand-new studio or a very niche genre tag not present in the original database will be treated as 'UNKNOWN'. |
| |
|
| | ## Citation |
| |
|
| | If you use this model in your work, please consider citing it: |
| | ```bibtex |
| | @misc{lorg0n2025anime2vec, |
| | author = {Lorg0n}, |
| | title = {{hikka-forge-anime2vec: A Semantic Vector Space Model for Anime}}, |
| | year = {2025}, |
| | publisher = {Hugging Face}, |
| | howpublished = {\url{https://huggingface.co/Lorg0n/hikka-forge-anime2vec}}, |
| | note = {Hugging Face repository} |
| | } |
| | ``` |