Lorg0n commited on
Commit
24aa8bd
·
verified ·
1 Parent(s): bc2410c

feat: Upload Python package structure and model artifacts

Browse files
Files changed (5) hide show
  1. LICENSE +201 -0
  2. README.md +22 -37
  3. pyproject.toml +31 -0
  4. src/hikka_forge/__init__.py +1 -0
  5. src/hikka_forge/vectorizer.py +106 -0
LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright 2025 Lorg0n
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
README.md CHANGED
@@ -1,48 +1,33 @@
1
- ---
2
- license: mit
3
- language:
4
- - en
5
- - uk
6
- - ja
7
- tags:
8
- - anime
9
- - embeddings
10
- - semantic-search
11
- - vector-arithmetic
12
- - pytorch
13
- datasets:
14
- - private
15
- author: Lorg0n
16
- ---
17
-
18
  # hikka-forge-anime2vec
19
 
20
  This repository contains `hikka-forge-anime2vec`, a sophisticated semantic vector space model for anime, created by [Lorg0n](https://huggingface.co/Lorg0n).
21
 
22
- The model is trained to understand deep connections between titles based on multilingual textual descriptions, genres, studios, and other metadata. It supports vector arithmetic, allowing for creative queries like `"Show me something like 'Spirited Away' - 'Ghibli Style' + 'Cyberpunk'"`.
23
-
24
- ## Model Details
25
 
26
- - **Model Version**: v12
27
- - **Architecture**: A multi-input neural network with separate processing streams for text, genres, and other categorical/numerical features. It uses attention mechanisms to weigh the importance of different text fields and genres.
28
- - **Training**: Trained using a combination of Triplet Loss (from explicit user recommendations), Cosine Similarity Loss for vector arithmetic examples, and a Diversity Loss to ensure a well-distributed embedding space.
29
- - **Data**: Trained on a private, non-public database of anime titles.
30
 
31
- ## How to Use
 
 
32
 
33
- *This model requires custom code for loading and inference due to its unique architecture and preprocessing steps.*
34
 
35
- A full usage example will be provided soon. The general workflow involves:
36
- 1. Loading the model, config, and pickled `LabelEncoder` objects.
37
- 2. Preprocessing new anime data (fetching from a data source, encoding text with a SentenceTransformer, etc.).
38
- 3. Using the model to generate a 512-dimensional embedding.
39
- 4. Performing similarity search or vector arithmetic in the embedding space.
40
 
41
- ## Files in this Repository
 
42
 
43
- This repository contains all files necessary for model inference:
 
 
 
 
 
 
 
44
 
45
- - `pytorch_model.bin`: The trained model weights.
46
- - `config.json`: Configuration file specifying model architecture and vocabulary sizes.
47
- - `model.py`: The Python code defining the `AnimeEmbeddingModel` class.
48
- - `le_genre.pkl`, `le_studio.pkl`, `le_type.pkl`: Pickled Scikit-learn `LabelEncoder` objects required for preprocessing new data.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # hikka-forge-anime2vec
2
 
3
  This repository contains `hikka-forge-anime2vec`, a sophisticated semantic vector space model for anime, created by [Lorg0n](https://huggingface.co/Lorg0n).
4
 
5
+ This repository is also a directly installable Python package.
 
 
6
 
7
+ ## Installation
 
 
 
8
 
9
+ ```bash
10
+ pip install git+https://huggingface.co/Lorg0n/hikka-forge-anime2vec
11
+ ```
12
 
13
+ ## Usage
14
 
15
+ ```python
16
+ from hikka_forge import Anime2Vec
 
 
 
17
 
18
+ # Initialize the model. All artifacts will be downloaded on first run.
19
+ anime2vec = Anime2Vec()
20
 
21
+ # Prepare data for a target anime
22
+ frieren_data = {
23
+ "en_title": "Frieren: Beyond Journey's End",
24
+ "genres": ["Adventure", "Drama", "Fantasy"],
25
+ "studio": "Madhouse",
26
+ "type": "TV",
27
+ # ... other relevant fields
28
+ }
29
 
30
+ # Generate the 512-dimensional vector representation
31
+ frieren_vector = anime2vec.encode(frieren_data)
32
+ print(f"Resulting vector shape: {frieren_vector.shape}")
33
+ ```
pyproject.toml ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=61.0"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "hikka-forge-anime2vec"
7
+ version = "0.1.0"
8
+ authors = [ { name="Lorg0n" } ]
9
+ description = "A semantic vector space model for anime with vector arithmetic capabilities."
10
+ readme = "README.md"
11
+ requires-python = ">=3.8"
12
+ license = { file="LICENSE" }
13
+ classifiers = [
14
+ "Programming Language :: Python :: 3",
15
+ "License :: OSI Approved :: MIT License",
16
+ "Operating System :: OS Independent",
17
+ "Topic :: Scientific/Engineering :: Artificial Intelligence",
18
+ ]
19
+ dependencies = [
20
+ "torch",
21
+ "huggingface_hub",
22
+ "sentence-transformers",
23
+ "scikit-learn",
24
+ "numpy",
25
+ ]
26
+
27
+ [project.urls]
28
+ "Homepage" = "https://huggingface.co/Lorg0n/hikka-forge-anime2vec"
29
+
30
+ [tool.setuptools.packages.find]
31
+ where = ["src"]
src/hikka_forge/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ from .vectorizer import Anime2Vec
src/hikka_forge/vectorizer.py ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import pickle
3
+ import json
4
+ from huggingface_hub import hf_hub_download
5
+ from sentence_transformers import SentenceTransformer
6
+ import numpy as np
7
+ import importlib.util
8
+ from pathlib import Path
9
+
10
+ class Anime2Vec:
11
+ """
12
+ A high-level wrapper to easily use the hikka-forge-anime2vec model.
13
+ It automatically downloads all required artifacts from the Hugging Face Hub.
14
+ """
15
+ def __init__(self, repo_id: str = "Lorg0n/hikka-forge-anime2vec", device: str = None):
16
+ print(f"🚀 Initializing Anime2Vec from repository: {repo_id}")
17
+
18
+ self.device = device if device else ("cuda" if torch.cuda.is_available() else "cpu")
19
+ print(f" - Using device: {self.device}")
20
+
21
+ cache_dir = Path.home() / ".cache" / "hikka-forge"
22
+
23
+ # Download all necessary files from the repo root
24
+ config_path = hf_hub_download(repo_id=repo_id, filename="config.json", cache_dir=cache_dir)
25
+ model_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin", cache_dir=cache_dir)
26
+ model_code_path = hf_hub_download(repo_id=repo_id, filename="model.py", cache_dir=cache_dir)
27
+ le_genre_path = hf_hub_download(repo_id=repo_id, filename="le_genre.pkl", cache_dir=cache_dir)
28
+ le_studio_path = hf_hub_download(repo_id=repo_id, filename="le_studio.pkl", cache_dir=cache_dir)
29
+ le_type_path = hf_hub_download(repo_id=repo_id, filename="le_type.pkl", cache_dir=cache_dir)
30
+
31
+ # Load configuration and encoders
32
+ with open(config_path, 'r') as f:
33
+ self.config = json.load(f)
34
+ with open(le_genre_path, 'rb') as f:
35
+ self.le_genre = pickle.load(f)
36
+ with open(le_studio_path, 'rb') as f:
37
+ self.le_studio = pickle.load(f)
38
+ with open(le_type_path, 'rb') as f:
39
+ self.le_type = pickle.load(f)
40
+
41
+ # Dynamically import the model class from the downloaded model.py
42
+ spec = importlib.util.spec_from_file_location("AnimeEmbeddingModel", model_code_path)
43
+ model_module = importlib.util.module_from_spec(spec)
44
+ spec.loader.exec_module(model_module)
45
+ AnimeEmbeddingModel = model_module.AnimeEmbeddingModel
46
+
47
+ # Initialize the model and load its weights
48
+ self.model = AnimeEmbeddingModel(
49
+ vocab_sizes=self.config['vocab_sizes'],
50
+ embedding_dims=self.config['embedding_dims'],
51
+ text_embedding_size=self.config['text_embedding_size']
52
+ )
53
+ self.model.load_state_dict(torch.load(model_path, map_location=self.device))
54
+ self.model.to(self.device)
55
+ self.model.eval()
56
+
57
+ # Initialize the text encoder
58
+ self.text_encoder = SentenceTransformer(
59
+ 'Lorg0n/hikka-forge-paraphrase-multilingual-MiniLM-L12-v2',
60
+ device=self.device
61
+ )
62
+ print("✅ Initialization complete. Model is ready to use.")
63
+
64
+ @torch.no_grad()
65
+ def encode(self, anime_data: dict) -> np.ndarray:
66
+ """
67
+ Encodes a dictionary of anime data into a 512-dimensional vector.
68
+ """
69
+ text_fields = [
70
+ anime_data.get('ua_description', ''), anime_data.get('en_description', ''),
71
+ anime_data.get('ua_title', ''), anime_data.get('en_title', ''),
72
+ anime_data.get('original_title', ''), "; ".join(anime_data.get('alternate_names', []))
73
+ ]
74
+ text_embeddings = self.text_encoder.encode(text_fields, convert_to_tensor=True)
75
+
76
+ known_genres = [g for g in anime_data.get('genres', []) if g in self.le_genre.classes_]
77
+ genre_ids = self.le_genre.transform(known_genres) if known_genres else [0]
78
+
79
+ try:
80
+ studio_id = self.le_studio.transform([anime_data.get('studio', 'UNKNOWN')])[0]
81
+ except ValueError:
82
+ studio_id = self.le_studio.transform(['UNKNOWN'])[0]
83
+
84
+ try:
85
+ type_id = self.le_type.transform([anime_data.get('type', 'UNKNOWN')])[0]
86
+ except ValueError:
87
+ type_id = self.le_type.transform(['UNKNOWN'])[0]
88
+
89
+ numerical = torch.tensor(anime_data.get('numerical_features', [0.0]*6), dtype=torch.float32)
90
+
91
+ batch = {
92
+ 'precomputed_ua_desc': text_embeddings[0], 'precomputed_en_desc': text_embeddings[1],
93
+ 'precomputed_ua_title': text_embeddings[2], 'precomputed_en_title': text_embeddings[3],
94
+ 'precomputed_original_title': text_embeddings[4], 'precomputed_alternate_names': text_embeddings[5],
95
+ 'genres': torch.tensor(genre_ids, dtype=torch.long),
96
+ 'studio': torch.tensor(studio_id, dtype=torch.long),
97
+ 'type': torch.tensor(type_id, dtype=torch.long),
98
+ 'numerical': numerical
99
+ }
100
+
101
+ for key, tensor in batch.items():
102
+ batch[key] = tensor.unsqueeze(0).to(self.device)
103
+ batch['genres_mask'] = (batch['genres'] != 0).long()
104
+
105
+ embedding = self.model(batch)
106
+ return embedding.squeeze().cpu().numpy()