Lorg0n
/

hikka-forge-anime2vec

@@ -1,33 +1,125 @@
 # hikka-forge-anime2vec
 This repository contains `hikka-forge-anime2vec`, a sophisticated semantic vector space model for anime, created by [Lorg0n](https://huggingface.co/Lorg0n).
-This repository is also a directly installable Python package.
 ## Installation
 ```bash
 pip install git+https://huggingface.co/Lorg0n/hikka-forge-anime2vec
 ```
 ## Usage
 ```python
 from hikka_forge import Anime2Vec
-# Initialize the model. All artifacts will be downloaded on first run.
 anime2vec = Anime2Vec()
-# Prepare data for a target anime
 frieren_data = {
     "en_title": "Frieren: Beyond Journey's End",
     "genres": ["Adventure", "Drama", "Fantasy"],
     "studio": "Madhouse",
     "type": "TV",
-    # ... other relevant fields
 }
-# Generate the 512-dimensional vector representation
 frieren_vector = anime2vec.encode(frieren_data)
-print(f"Resulting vector shape: {frieren_vector.shape}")
 ```

+---
+license: apache-2.0
+language:
+- en
+- uk
+- ja
+tags:
+- pytorch
+- sentence-transformers
+- feature-extraction
+- semantic-search
+- vector-arithmetic
+- anime
+- hikka
+- hikka-forge
+datasets:
+- private
+- synthetic
+model-index:
+- name: hikka-forge-anime2vec
+  results: []
+---
 # hikka-forge-anime2vec
 This repository contains `hikka-forge-anime2vec`, a sophisticated semantic vector space model for anime, created by [Lorg0n](https://huggingface.co/Lorg0n).
+This repository is also a directly installable Python package, allowing you to integrate powerful anime vectorization capabilities into your own projects with ease.
+## Model Details
+- **Model Version**: v12
+- **Architecture**: A multi-input neural network with separate processing streams for text, genres, studios, and other categorical/numerical features. It uses attention mechanisms to weigh the importance of different text fields and genres, creating a rich, context-aware representation.
+- **Embedding Size**: The model outputs a **512-dimensional** vector for each anime.
+## Data & Training
+The model's understanding of conceptual relationships comes from a unique blend of real-world and synthetic data.
+- **Base Data**: The foundational metadata (titles, genres, studios, etc.) was sourced from the [hikka.io](https://hikka.io/) database. This provided a solid, real-world grounding for the model.
+- **Synthetic Data Augmentation**: To teach the model complex conceptual and arithmetic relationships (e.g., `"Anime A" - "Anime X" + "Anime Y" = "Anime B"`), the training set was heavily augmented with data generated by a Large Language Model (LLM). This synthetic data was crucial for enabling the model's advanced vector arithmetic capabilities.
+- **Training Procedure**: The model was trained using a combination of three loss functions:
+    1.  **Triplet Loss**: Based on user recommendations to learn similarity.
+    2.  **Cosine Similarity Loss**: For the LLM-generated vector arithmetic examples.
+    3.  **Diversity Loss**: To ensure a well-distributed embedding space and prevent model collapse.
 ## Installation
+Install the library directly from this Hugging Face repository:
 ```bash
 pip install git+https://huggingface.co/Lorg0n/hikka-forge-anime2vec
 ```
 ## Usage
+The library provides a simple, high-level `Anime2Vec` class that handles all the complexity of downloading models, preprocessing data, and generating embeddings.
 ```python
 from hikka_forge import Anime2Vec
+# Initialize the model. All required artifacts will be downloaded
+# and cached automatically on the first run.
 anime2vec = Anime2Vec()
+# 1. Prepare data for a target anime.
+# The `encode` method expects a dictionary with specific keys.
 frieren_data = {
+    "ua_title": "Фрірен, що проводжає в останню путь",
     "en_title": "Frieren: Beyond Journey's End",
+    "original_title": "Sousou no Frieren",
+    "ua_description": "Ельфійка-чарівниця Фрірен перемогла Короля Демонів...",
+    "en_description": "The elf mage Frieren and her courageous fellow adventurers...",
+    "alternate_names": ["Sousou no Frieren"],
     "genres": ["Adventure", "Drama", "Fantasy"],
     "studio": "Madhouse",
     "type": "TV",
+    "numerical_features": [8.9, 500000, 2023, 28, 24, 100] # Example data
 }
+# 2. Generate the 512-dimensional vector representation
 frieren_vector = anime2vec.encode(frieren_data)
+print(f"Resulting vector for '{frieren_data['en_title']}' has shape: {frieren_vector.shape}")
+# Now you can use this vector for similarity search, clustering, or vector arithmetic.
+```
+### Vector Arithmetic Example
+The true power of this model lies in its ability to perform conceptual arithmetic, a skill honed by the LLM-generated training data.
+```python
+# This is a conceptual example. You would need to pre-compute vectors
+# for all anime in your database to perform the final similarity search.
+# Get vectors for two anime
+aot_vector = anime2vec.encode(attack_on_titan_data)
+code_geass_vector = anime2vec.encode(code_geass_data)
+# Find the semantic average between them
+# This should represent concepts like "military drama with sci-fi/mecha elements"
+average_vector = (aot_vector + code_geass_vector) / 2.0
+# You can now use `average_vector` to find anime that fit this
+# hybrid description in your own vector database.
+# Expected results: Aldnoah.Zero, Mobile Suit Gundam: Iron-Blooded Orphans, etc.
+```
+## Limitations and Bias
+- **Data Bias**: The model's understanding is shaped by its training data. The base data from `hikka.io` and the synthetic data from the LLM both have inherent biases which will be reflected in the model's embeddings.
+- **Textual Keyword Collision**: The model relies heavily on text descriptions. Sometimes, an anime from a completely different genre might use specific keywords (e.g., "political intrigue," "strategy") in its synopsis, causing it to appear anomalously in search results for serious thrillers.
+- **Encoder Vocabulary**: While the categorical encoders (`genre`, `studio`) were trained on a comprehensive list, they are not exhaustive. A brand-new studio or a very niche genre tag not present in the original database will be treated as 'UNKNOWN'.
+## Citation
+If you use this model in your work, please consider citing it:
+```bibtex
+@misc{lorg0n2025anime2vec,
+  author       = {Lorg0n},
+  title        = {{hikka-forge-anime2vec: A Semantic Vector Space Model for Anime}},
+  year         = {2025},
+  publisher    = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/Lorg0n/hikka-forge-anime2vec}},
+  note         = {Hugging Face repository}
+}
 ```