Update README.md

1cceed4 verified 7 months ago

5.76 kB

	---
	license: apache-2.0
	language:
	- en
	- uk
	- ja
	tags:
	- pytorch
	- sentence-transformers
	- feature-extraction
	- semantic-search
	- vector-arithmetic
	- anime
	- hikka
	- hikka-forge
	- 2vec
	datasets:
	- private
	- synthetic
	model-index:
	- name: hikka-forge-anime2vec
	results: []
	---

	# hikka-forge-anime2vec

	This repository contains `hikka-forge-anime2vec`, a sophisticated semantic vector space model for anime, created by [Lorg0n](https://huggingface.co/Lorg0n).
	This repository is also a directly installable Python package, allowing you to integrate powerful anime vectorization capabilities into your own projects with ease.

	## Model Details

	- Model Version: v13
	- Architecture: A multi-input neural network with separate processing streams for text, genres, studios, and other categorical/numerical features. It uses attention mechanisms to weigh the importance of different text fields and genres, creating a rich, context-aware representation.
	- Embedding Size: The model outputs a 512-dimensional vector for each anime.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/62827ced3e7734836f331c72/ySTuRJKgYMQQjb94KwkNb.png)

	## Data & Training

	The model's understanding of conceptual relationships comes from a unique blend of real-world and synthetic data.
	- Base Data: The foundational metadata (titles, genres, studios, etc.) was sourced from the [hikka.io](https://hikka.io/) database. This provided a solid, real-world grounding for the model.
	- Synthetic Data Augmentation: To teach the model complex conceptual and arithmetic relationships (e.g., `"Anime A" - "Anime X" + "Anime Y" = "Anime B"`), the training set was heavily augmented with data generated by a Large Language Model (LLM). This synthetic data was crucial for enabling the model's advanced vector arithmetic capabilities.
	- Training Procedure: The model was trained using a combination of three loss functions:
	1. Triplet Loss: Based on user recommendations to learn similarity.
	2. Cosine Similarity Loss: For the LLM-generated vector arithmetic examples.
	3. Diversity Loss: To ensure a well-distributed embedding space and prevent model collapse.

	## Installation

	Install the library directly from this Hugging Face repository:

	```bash
	pip install git+https://huggingface.co/Lorg0n/hikka-forge-anime2vec
	```

	## Usage

	The library provides a simple, high-level `Anime2Vec` class that handles all the complexity of downloading models, preprocessing data, and generating embeddings.

	```python
	from hikka_forge import Anime2Vec

	# Initialize the model. All required artifacts will be downloaded
	# and cached automatically on the first run.
	anime2vec = Anime2Vec()

	# 1. Prepare data for a target anime.
	# The `encode` method expects a dictionary with specific keys.
	frieren_data = {
	"ua_title": "Фрірен, що проводжає в останню путь",
	"en_title": "Frieren: Beyond Journey's End",
	"original_title": "Sousou no Frieren",
	"ua_description": "Ельфійка-чарівниця Фрірен перемогла Короля Демонів...",
	"en_description": "The elf mage Frieren and her courageous fellow adventurers...",
	"alternate_names": ["Sousou no Frieren"],
	"genres": ["Adventure", "Drama", "Fantasy"],
	"studio": "Madhouse",
	"type": "TV",
	"numerical_features": [8.9, 500000, 2023, 28, 24, 100] # Example data
	}

	# 2. Generate the 512-dimensional vector representation
	frieren_vector = anime2vec.encode(frieren_data)

	print(f"Resulting vector for '{frieren_data['en_title']}' has shape: {frieren_vector.shape}")
	# Now you can use this vector for similarity search, clustering, or vector arithmetic.
	```

	### Vector Arithmetic Example

	The true power of this model lies in its ability to perform conceptual arithmetic, a skill honed by the LLM-generated training data.

	```python
	# This is a conceptual example. You would need to pre-compute vectors
	# for all anime in your database to perform the final similarity search.

	# Get vectors for two anime
	aot_vector = anime2vec.encode(attack_on_titan_data)
	code_geass_vector = anime2vec.encode(code_geass_data)

	# Find the semantic average between them
	# This should represent concepts like "military drama with sci-fi/mecha elements"
	average_vector = (aot_vector + code_geass_vector) / 2.0

	# You can now use `average_vector` to find anime that fit this
	# hybrid description in your own vector database.
	# Expected results: Aldnoah.Zero, Mobile Suit Gundam: Iron-Blooded Orphans, etc.
	```

	## Limitations and Bias

	- Data Bias: The model's understanding is shaped by its training data. The base data from `hikka.io` and the synthetic data from the LLM both have inherent biases which will be reflected in the model's embeddings.
	- Textual Keyword Collision: The model relies heavily on text descriptions. Sometimes, an anime from a completely different genre might use specific keywords (e.g., "political intrigue," "strategy") in its synopsis, causing it to appear anomalously in search results for serious thrillers.
	- Encoder Vocabulary: While the categorical encoders (`genre`, `studio`) were trained on a comprehensive list, they are not exhaustive. A brand-new studio or a very niche genre tag not present in the original database will be treated as 'UNKNOWN'.

	## Citation

	If you use this model in your work, please consider citing it:
	```bibtex
	@misc{lorg0n2025anime2vec,
	author = {Lorg0n},
	title = {{hikka-forge-anime2vec: A Semantic Vector Space Model for Anime}},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/Lorg0n/hikka-forge-anime2vec}},
	note = {Hugging Face repository}
	}
	```