Lorg0n commited on
Commit
eeeacda
·
verified ·
1 Parent(s): 24aa8bd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -6
README.md CHANGED
@@ -1,33 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # hikka-forge-anime2vec
2
 
3
  This repository contains `hikka-forge-anime2vec`, a sophisticated semantic vector space model for anime, created by [Lorg0n](https://huggingface.co/Lorg0n).
 
 
 
 
 
 
 
4
 
5
- This repository is also a directly installable Python package.
 
 
 
 
 
 
 
 
6
 
7
  ## Installation
8
 
 
 
9
  ```bash
10
  pip install git+https://huggingface.co/Lorg0n/hikka-forge-anime2vec
11
  ```
12
 
13
  ## Usage
14
 
 
 
15
  ```python
16
  from hikka_forge import Anime2Vec
17
 
18
- # Initialize the model. All artifacts will be downloaded on first run.
 
19
  anime2vec = Anime2Vec()
20
 
21
- # Prepare data for a target anime
 
22
  frieren_data = {
 
23
  "en_title": "Frieren: Beyond Journey's End",
 
 
 
 
24
  "genres": ["Adventure", "Drama", "Fantasy"],
25
  "studio": "Madhouse",
26
  "type": "TV",
27
- # ... other relevant fields
28
  }
29
 
30
- # Generate the 512-dimensional vector representation
31
  frieren_vector = anime2vec.encode(frieren_data)
32
- print(f"Resulting vector shape: {frieren_vector.shape}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ```
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - uk
6
+ - ja
7
+ tags:
8
+ - pytorch
9
+ - sentence-transformers
10
+ - feature-extraction
11
+ - semantic-search
12
+ - vector-arithmetic
13
+ - anime
14
+ - hikka
15
+ - hikka-forge
16
+ datasets:
17
+ - private
18
+ - synthetic
19
+ model-index:
20
+ - name: hikka-forge-anime2vec
21
+ results: []
22
+ ---
23
+
24
  # hikka-forge-anime2vec
25
 
26
  This repository contains `hikka-forge-anime2vec`, a sophisticated semantic vector space model for anime, created by [Lorg0n](https://huggingface.co/Lorg0n).
27
+ This repository is also a directly installable Python package, allowing you to integrate powerful anime vectorization capabilities into your own projects with ease.
28
+
29
+ ## Model Details
30
+
31
+ - **Model Version**: v12
32
+ - **Architecture**: A multi-input neural network with separate processing streams for text, genres, studios, and other categorical/numerical features. It uses attention mechanisms to weigh the importance of different text fields and genres, creating a rich, context-aware representation.
33
+ - **Embedding Size**: The model outputs a **512-dimensional** vector for each anime.
34
 
35
+ ## Data & Training
36
+
37
+ The model's understanding of conceptual relationships comes from a unique blend of real-world and synthetic data.
38
+ - **Base Data**: The foundational metadata (titles, genres, studios, etc.) was sourced from the [hikka.io](https://hikka.io/) database. This provided a solid, real-world grounding for the model.
39
+ - **Synthetic Data Augmentation**: To teach the model complex conceptual and arithmetic relationships (e.g., `"Anime A" - "Anime X" + "Anime Y" = "Anime B"`), the training set was heavily augmented with data generated by a Large Language Model (LLM). This synthetic data was crucial for enabling the model's advanced vector arithmetic capabilities.
40
+ - **Training Procedure**: The model was trained using a combination of three loss functions:
41
+ 1. **Triplet Loss**: Based on user recommendations to learn similarity.
42
+ 2. **Cosine Similarity Loss**: For the LLM-generated vector arithmetic examples.
43
+ 3. **Diversity Loss**: To ensure a well-distributed embedding space and prevent model collapse.
44
 
45
  ## Installation
46
 
47
+ Install the library directly from this Hugging Face repository:
48
+
49
  ```bash
50
  pip install git+https://huggingface.co/Lorg0n/hikka-forge-anime2vec
51
  ```
52
 
53
  ## Usage
54
 
55
+ The library provides a simple, high-level `Anime2Vec` class that handles all the complexity of downloading models, preprocessing data, and generating embeddings.
56
+
57
  ```python
58
  from hikka_forge import Anime2Vec
59
 
60
+ # Initialize the model. All required artifacts will be downloaded
61
+ # and cached automatically on the first run.
62
  anime2vec = Anime2Vec()
63
 
64
+ # 1. Prepare data for a target anime.
65
+ # The `encode` method expects a dictionary with specific keys.
66
  frieren_data = {
67
+ "ua_title": "Фрірен, що проводжає в останню путь",
68
  "en_title": "Frieren: Beyond Journey's End",
69
+ "original_title": "Sousou no Frieren",
70
+ "ua_description": "Ельфійка-чарівниця Фрірен перемогла Короля Демонів...",
71
+ "en_description": "The elf mage Frieren and her courageous fellow adventurers...",
72
+ "alternate_names": ["Sousou no Frieren"],
73
  "genres": ["Adventure", "Drama", "Fantasy"],
74
  "studio": "Madhouse",
75
  "type": "TV",
76
+ "numerical_features": [8.9, 500000, 2023, 28, 24, 100] # Example data
77
  }
78
 
79
+ # 2. Generate the 512-dimensional vector representation
80
  frieren_vector = anime2vec.encode(frieren_data)
81
+
82
+ print(f"Resulting vector for '{frieren_data['en_title']}' has shape: {frieren_vector.shape}")
83
+ # Now you can use this vector for similarity search, clustering, or vector arithmetic.
84
+ ```
85
+
86
+ ### Vector Arithmetic Example
87
+
88
+ The true power of this model lies in its ability to perform conceptual arithmetic, a skill honed by the LLM-generated training data.
89
+
90
+ ```python
91
+ # This is a conceptual example. You would need to pre-compute vectors
92
+ # for all anime in your database to perform the final similarity search.
93
+
94
+ # Get vectors for two anime
95
+ aot_vector = anime2vec.encode(attack_on_titan_data)
96
+ code_geass_vector = anime2vec.encode(code_geass_data)
97
+
98
+ # Find the semantic average between them
99
+ # This should represent concepts like "military drama with sci-fi/mecha elements"
100
+ average_vector = (aot_vector + code_geass_vector) / 2.0
101
+
102
+ # You can now use `average_vector` to find anime that fit this
103
+ # hybrid description in your own vector database.
104
+ # Expected results: Aldnoah.Zero, Mobile Suit Gundam: Iron-Blooded Orphans, etc.
105
+ ```
106
+
107
+ ## Limitations and Bias
108
+
109
+ - **Data Bias**: The model's understanding is shaped by its training data. The base data from `hikka.io` and the synthetic data from the LLM both have inherent biases which will be reflected in the model's embeddings.
110
+ - **Textual Keyword Collision**: The model relies heavily on text descriptions. Sometimes, an anime from a completely different genre might use specific keywords (e.g., "political intrigue," "strategy") in its synopsis, causing it to appear anomalously in search results for serious thrillers.
111
+ - **Encoder Vocabulary**: While the categorical encoders (`genre`, `studio`) were trained on a comprehensive list, they are not exhaustive. A brand-new studio or a very niche genre tag not present in the original database will be treated as 'UNKNOWN'.
112
+
113
+ ## Citation
114
+
115
+ If you use this model in your work, please consider citing it:
116
+ ```bibtex
117
+ @misc{lorg0n2025anime2vec,
118
+ author = {Lorg0n},
119
+ title = {{hikka-forge-anime2vec: A Semantic Vector Space Model for Anime}},
120
+ year = {2025},
121
+ publisher = {Hugging Face},
122
+ howpublished = {\url{https://huggingface.co/Lorg0n/hikka-forge-anime2vec}},
123
+ note = {Hugging Face repository}
124
+ }
125
  ```