jonny9f commited on
Commit
69c350b
·
verified ·
1 Parent(s): 852e57f

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +306 -128
README.md CHANGED
@@ -1,179 +1,357 @@
1
  ---
2
- language: en
3
  tags:
4
  - sentence-transformers
5
- - food
6
- - embeddings
7
- - semantic-search
8
- - all-mpnet-base-v2
9
- library_name: sentence-transformers
10
- pipeline_tag: sentence-similarity
11
- license: mit
12
  widget:
13
- - source_sentence: Grilled chicken breast
 
 
 
 
 
14
  sentences:
15
- - Roasted chicken breast
16
- - Chicken thigh
17
- - Salmon fillet
18
- - source_sentence: Fresh apple sliced
19
  sentences:
20
- - Apple diced
21
- - Apple juice
22
- - Orange sliced
23
- - source_sentence: Greek yogurt plain
24
  sentences:
25
- - Plain yogurt
26
- - Vanilla yogurt
27
- - Sour cream
28
- metrics:
29
- - pearson_cosine
30
- - spearman_cosine
31
- model-index:
32
- - name: Food Embeddings Model v2
33
- results:
34
- - task:
35
- type: semantic-similarity
36
- name: Food Description Semantic Similarity
37
- dataset:
38
- name: validation
39
- type: validation
40
- metrics:
41
- - type: pearson_cosine
42
- value: 0.9913
43
- name: Pearson Cosine
44
- - type: spearman_cosine
45
- value: 0.9868
46
- name: Spearman Cosine
47
  ---
48
 
49
- # Food Embeddings Model v2
50
 
51
- This is a [sentence-transformers](https://www.SBERT.net) model specialized for food description embeddings, fine-tuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). It maps food descriptions to a 768-dimensional dense vector space optimized for semantic food search, matching, and similarity comparison.
52
 
53
  ## Model Details
54
 
55
  ### Model Description
56
  - **Model Type:** Sentence Transformer
57
- - **Base Model:** sentence-transformers/all-mpnet-base-v2
58
  - **Maximum Sequence Length:** 384 tokens
59
  - **Output Dimensionality:** 768 dimensions
60
  - **Similarity Function:** Cosine Similarity
 
 
 
 
 
 
 
 
 
61
 
62
  ### Full Model Architecture
63
 
64
  ```
65
  SentenceTransformer(
66
- (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel
67
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
68
  (2): Normalize()
69
  )
70
  ```
71
 
72
- ## Use Cases
73
-
74
- This model is optimized for:
75
- - Semantic food search
76
- - Food item matching and deduplication
77
- - Similar food recommendations
78
- - Vector database indexing for food catalogs
79
- - Nutritional analysis support
80
 
81
  ### Direct Usage (Sentence Transformers)
82
 
 
 
 
 
 
 
 
83
  ```python
84
  from sentence_transformers import SentenceTransformer
85
 
86
- # Load model
87
- model = SentenceTransformer('jonny9f/food_embeddings2')
88
-
89
- # Generate embeddings
90
- food_description = "grilled chicken breast"
91
- embedding = model.encode(food_description)
 
 
 
 
 
92
 
93
- # Use for similarity search, vector database indexing, etc.
 
 
 
94
  ```
95
 
96
- ### Integration Example
 
97
 
98
- ```python
99
- import numpy as np
100
- from sentence_transformers import SentenceTransformer
101
- from scipy.spatial.distance import cosine
102
 
103
- model = SentenceTransformer('jonny9f/food_embeddings2')
 
104
 
105
- # Generate embeddings for multiple foods
106
- foods = [
107
- "grilled chicken breast",
108
- "roasted chicken breast",
109
- "chicken thigh",
110
- "salmon fillet"
111
- ]
112
- embeddings = model.encode(foods)
113
 
114
- # Compare similarities
115
- def get_similarity(emb1, emb2):
116
- return 1 - cosine(emb1, emb2)
117
 
118
- # Print similarity matrix
119
- for i, food1 in enumerate(foods):
120
- for j, food2 in enumerate(foods):
121
- sim = get_similarity(embeddings[i], embeddings[j])
122
- print(f"{food1} vs {food2}: {sim:.3f}")
123
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
 
125
  ## Training Details
126
 
127
- ### Model Training
128
- - **Training Framework:** sentence-transformers
129
- - **Training Mode:** Triplet Loss with self-supervision
130
- - **Dataset Size:** 1.2 million training samples
131
- - **Loss Function:** CosineSimilarityLoss with MSE
132
- - **Training Steps:** 37,500 per epoch
133
- - **Batch Size:** 32
134
- - **Training Time:** 3 epochs
135
-
136
- ### Training Data
137
- The model was trained on carefully curated food description triplets:
138
- - Anchor: Base food description
139
- - Positive: Alternative description of the same food
140
- - Negative: Description of a different food
141
- - Self-supervision: For negative-only examples, anchor serves as its own positive
142
-
143
- ### Performance
144
-
145
- Best validation metrics achieved:
146
- - **Pearson Cosine Correlation:** 0.9913
147
- - **Spearman Cosine Correlation:** 0.9868
148
-
149
- Training showed consistent improvement in loss values:
150
- - Initial Loss: ~0.0031
151
- - Final Loss: ~0.0004
152
-
153
- ## Limitations and Biases
154
-
155
- - Model performance may vary for:
156
- - Very long food descriptions
157
- - Rare or unusual food items
158
- - Highly technical nutritional terminology
159
- - Primary focus on English language descriptions
160
- - Best suited for standard food items and common variations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
 
162
  ## Citation
163
 
164
- If you use this model, please cite:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
  ```bibtex
166
- @software{food_embeddings2,
167
- author = {Jon},
168
- title = {Food Embeddings Model v2},
169
- year = {2025},
170
- publisher = {HuggingFace},
171
- url = {https://huggingface.co/jonny9f/food_embeddings2}
 
172
  }
173
  ```
174
 
175
- ## Framework Versions
176
- - Python: 3.11
177
- - Sentence Transformers: 2.2.2
178
- - PyTorch: 2.0.1
179
- - Transformers: 4.33.2
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  tags:
3
  - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:6010
8
+ - loss:TripletLoss
9
+ base_model: sentence-transformers/all-mpnet-base-v2
 
10
  widget:
11
+ - source_sentence: Egg white, scrambled
12
+ sentences:
13
+ - Peaches, Frozen, Sliced, Sweetened
14
+ - Egg White Replacer, Loprofin
15
+ - Lemon, wedges (usually a garnish)
16
+ - source_sentence: Crackers, saltines
17
  sentences:
18
+ - Crackers, Saltines
19
+ - Crackers, Saltines, Whole Wheat
20
+ - Loops, Loprofin
21
+ - source_sentence: Cereal, corn flakes
22
  sentences:
23
+ - Wheat Flour, Whole-Grain, Soft Wheat
24
+ - Cereals, Ralston Corn Flakes
25
+ - Sorghum Grain
26
+ - source_sentence: Buffalo wings, hot sauce, celery
27
  sentences:
28
+ - Caramel corn (some protein from corn, watch portion)
29
+ - Sauce, Hot Chile Sriracha
30
+ - Sauce, Hot Chile, Tuong Ot Sriracha Sriracha
31
+ - source_sentence: Margarine, tub, light
32
+ sentences:
33
+ - Olive Oil
34
+ - Margarine, Smart Balance Light Buttery Spread
35
+ - Onions, Yellow Sauteed
36
+ pipeline_tag: sentence-similarity
37
+ library_name: sentence-transformers
 
 
 
 
 
 
 
 
 
 
 
 
38
  ---
39
 
40
+ # SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
41
 
42
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
43
 
44
  ## Model Details
45
 
46
  ### Model Description
47
  - **Model Type:** Sentence Transformer
48
+ - **Base model:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) <!-- at revision 12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0 -->
49
  - **Maximum Sequence Length:** 384 tokens
50
  - **Output Dimensionality:** 768 dimensions
51
  - **Similarity Function:** Cosine Similarity
52
+ <!-- - **Training Dataset:** Unknown -->
53
+ <!-- - **Language:** Unknown -->
54
+ <!-- - **License:** Unknown -->
55
+
56
+ ### Model Sources
57
+
58
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
59
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
60
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
61
 
62
  ### Full Model Architecture
63
 
64
  ```
65
  SentenceTransformer(
66
+ (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel
67
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
68
  (2): Normalize()
69
  )
70
  ```
71
 
72
+ ## Usage
 
 
 
 
 
 
 
73
 
74
  ### Direct Usage (Sentence Transformers)
75
 
76
+ First install the Sentence Transformers library:
77
+
78
+ ```bash
79
+ pip install -U sentence-transformers
80
+ ```
81
+
82
+ Then you can load this model and run inference.
83
  ```python
84
  from sentence_transformers import SentenceTransformer
85
 
86
+ # Download from the 🤗 Hub
87
+ model = SentenceTransformer("sentence_transformers_model_id")
88
+ # Run inference
89
+ sentences = [
90
+ 'Margarine, tub, light',
91
+ 'Margarine, Smart Balance Light Buttery Spread',
92
+ 'Olive Oil',
93
+ ]
94
+ embeddings = model.encode(sentences)
95
+ print(embeddings.shape)
96
+ # [3, 768]
97
 
98
+ # Get the similarity scores for the embeddings
99
+ similarities = model.similarity(embeddings, embeddings)
100
+ print(similarities.shape)
101
+ # [3, 3]
102
  ```
103
 
104
+ <!--
105
+ ### Direct Usage (Transformers)
106
 
107
+ <details><summary>Click to see the direct usage in Transformers</summary>
 
 
 
108
 
109
+ </details>
110
+ -->
111
 
112
+ <!--
113
+ ### Downstream Usage (Sentence Transformers)
 
 
 
 
 
 
114
 
115
+ You can finetune this model on your own dataset.
 
 
116
 
117
+ <details><summary>Click to expand</summary>
118
+
119
+ </details>
120
+ -->
121
+
122
+ <!--
123
+ ### Out-of-Scope Use
124
+
125
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
126
+ -->
127
+
128
+ <!--
129
+ ## Bias, Risks and Limitations
130
+
131
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
132
+ -->
133
+
134
+ <!--
135
+ ### Recommendations
136
+
137
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
138
+ -->
139
 
140
  ## Training Details
141
 
142
+ ### Training Dataset
143
+
144
+ #### Unnamed Dataset
145
+
146
+
147
+ * Size: 6,010 training samples
148
+ * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>sentence_2</code>
149
+ * Approximate statistics based on the first 1000 samples:
150
+ | | sentence_0 | sentence_1 | sentence_2 |
151
+ |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
152
+ | type | string | string | string |
153
+ | details | <ul><li>min: 3 tokens</li><li>mean: 9.03 tokens</li><li>max: 30 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 9.62 tokens</li><li>max: 30 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 10.11 tokens</li><li>max: 23 tokens</li></ul> |
154
+ * Samples:
155
+ | sentence_0 | sentence_1 | sentence_2 |
156
+ |:------------------------------------------------------|:--------------------------------------------------------|:--------------------------------------------------------------------|
157
+ | <code>Green banana flour “meat” (experimental)</code> | <code>Green banana flour “meat” (experimental)</code> | <code>Chicken Drumstick, Meat and Skin, Cooked, Fried, Flour</code> |
158
+ | <code>Carrot-ginger soup</code> | <code>Cauliflower Carrot Soup, the Secret Garden</code> | <code>Cream of Potato Soup, Canned, With Milk</code> |
159
+ | <code>Milk, 2%</code> | <code>Milk, 2%</code> | <code>Milk, Whole 3.25% Milkfat, No Added Vitamins</code> |
160
+ * Loss: [<code>TripletLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#tripletloss) with these parameters:
161
+ ```json
162
+ {
163
+ "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
164
+ "triplet_margin": 5
165
+ }
166
+ ```
167
+
168
+ ### Training Hyperparameters
169
+ #### Non-Default Hyperparameters
170
+
171
+ - `per_device_train_batch_size`: 16
172
+ - `per_device_eval_batch_size`: 16
173
+ - `multi_dataset_batch_sampler`: round_robin
174
+
175
+ #### All Hyperparameters
176
+ <details><summary>Click to expand</summary>
177
+
178
+ - `overwrite_output_dir`: False
179
+ - `do_predict`: False
180
+ - `eval_strategy`: no
181
+ - `prediction_loss_only`: True
182
+ - `per_device_train_batch_size`: 16
183
+ - `per_device_eval_batch_size`: 16
184
+ - `per_gpu_train_batch_size`: None
185
+ - `per_gpu_eval_batch_size`: None
186
+ - `gradient_accumulation_steps`: 1
187
+ - `eval_accumulation_steps`: None
188
+ - `torch_empty_cache_steps`: None
189
+ - `learning_rate`: 5e-05
190
+ - `weight_decay`: 0.0
191
+ - `adam_beta1`: 0.9
192
+ - `adam_beta2`: 0.999
193
+ - `adam_epsilon`: 1e-08
194
+ - `max_grad_norm`: 1
195
+ - `num_train_epochs`: 3
196
+ - `max_steps`: -1
197
+ - `lr_scheduler_type`: linear
198
+ - `lr_scheduler_kwargs`: {}
199
+ - `warmup_ratio`: 0.0
200
+ - `warmup_steps`: 0
201
+ - `log_level`: passive
202
+ - `log_level_replica`: warning
203
+ - `log_on_each_node`: True
204
+ - `logging_nan_inf_filter`: True
205
+ - `save_safetensors`: True
206
+ - `save_on_each_node`: False
207
+ - `save_only_model`: False
208
+ - `restore_callback_states_from_checkpoint`: False
209
+ - `no_cuda`: False
210
+ - `use_cpu`: False
211
+ - `use_mps_device`: False
212
+ - `seed`: 42
213
+ - `data_seed`: None
214
+ - `jit_mode_eval`: False
215
+ - `use_ipex`: False
216
+ - `bf16`: False
217
+ - `fp16`: False
218
+ - `fp16_opt_level`: O1
219
+ - `half_precision_backend`: auto
220
+ - `bf16_full_eval`: False
221
+ - `fp16_full_eval`: False
222
+ - `tf32`: None
223
+ - `local_rank`: 0
224
+ - `ddp_backend`: None
225
+ - `tpu_num_cores`: None
226
+ - `tpu_metrics_debug`: False
227
+ - `debug`: []
228
+ - `dataloader_drop_last`: False
229
+ - `dataloader_num_workers`: 0
230
+ - `dataloader_prefetch_factor`: None
231
+ - `past_index`: -1
232
+ - `disable_tqdm`: False
233
+ - `remove_unused_columns`: True
234
+ - `label_names`: None
235
+ - `load_best_model_at_end`: False
236
+ - `ignore_data_skip`: False
237
+ - `fsdp`: []
238
+ - `fsdp_min_num_params`: 0
239
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
240
+ - `fsdp_transformer_layer_cls_to_wrap`: None
241
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
242
+ - `deepspeed`: None
243
+ - `label_smoothing_factor`: 0.0
244
+ - `optim`: adamw_torch
245
+ - `optim_args`: None
246
+ - `adafactor`: False
247
+ - `group_by_length`: False
248
+ - `length_column_name`: length
249
+ - `ddp_find_unused_parameters`: None
250
+ - `ddp_bucket_cap_mb`: None
251
+ - `ddp_broadcast_buffers`: False
252
+ - `dataloader_pin_memory`: True
253
+ - `dataloader_persistent_workers`: False
254
+ - `skip_memory_metrics`: True
255
+ - `use_legacy_prediction_loop`: False
256
+ - `push_to_hub`: False
257
+ - `resume_from_checkpoint`: None
258
+ - `hub_model_id`: None
259
+ - `hub_strategy`: every_save
260
+ - `hub_private_repo`: None
261
+ - `hub_always_push`: False
262
+ - `gradient_checkpointing`: False
263
+ - `gradient_checkpointing_kwargs`: None
264
+ - `include_inputs_for_metrics`: False
265
+ - `include_for_metrics`: []
266
+ - `eval_do_concat_batches`: True
267
+ - `fp16_backend`: auto
268
+ - `push_to_hub_model_id`: None
269
+ - `push_to_hub_organization`: None
270
+ - `mp_parameters`:
271
+ - `auto_find_batch_size`: False
272
+ - `full_determinism`: False
273
+ - `torchdynamo`: None
274
+ - `ray_scope`: last
275
+ - `ddp_timeout`: 1800
276
+ - `torch_compile`: False
277
+ - `torch_compile_backend`: None
278
+ - `torch_compile_mode`: None
279
+ - `dispatch_batches`: None
280
+ - `split_batches`: None
281
+ - `include_tokens_per_second`: False
282
+ - `include_num_input_tokens_seen`: False
283
+ - `neftune_noise_alpha`: None
284
+ - `optim_target_modules`: None
285
+ - `batch_eval_metrics`: False
286
+ - `eval_on_start`: False
287
+ - `use_liger_kernel`: False
288
+ - `eval_use_gather_object`: False
289
+ - `average_tokens_across_devices`: False
290
+ - `prompts`: None
291
+ - `batch_sampler`: batch_sampler
292
+ - `multi_dataset_batch_sampler`: round_robin
293
+
294
+ </details>
295
+
296
+ ### Training Logs
297
+ | Epoch | Step | Training Loss |
298
+ |:------:|:----:|:-------------:|
299
+ | 1.3298 | 500 | 4.4921 |
300
+ | 2.6596 | 1000 | 4.3269 |
301
+
302
+
303
+ ### Framework Versions
304
+ - Python: 3.11.3
305
+ - Sentence Transformers: 3.3.1
306
+ - Transformers: 4.48.0
307
+ - PyTorch: 2.5.1+cu124
308
+ - Accelerate: 1.2.1
309
+ - Datasets: 3.2.0
310
+ - Tokenizers: 0.21.0
311
 
312
  ## Citation
313
 
314
+ ### BibTeX
315
+
316
+ #### Sentence Transformers
317
+ ```bibtex
318
+ @inproceedings{reimers-2019-sentence-bert,
319
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
320
+ author = "Reimers, Nils and Gurevych, Iryna",
321
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
322
+ month = "11",
323
+ year = "2019",
324
+ publisher = "Association for Computational Linguistics",
325
+ url = "https://arxiv.org/abs/1908.10084",
326
+ }
327
+ ```
328
+
329
+ #### TripletLoss
330
  ```bibtex
331
+ @misc{hermans2017defense,
332
+ title={In Defense of the Triplet Loss for Person Re-Identification},
333
+ author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
334
+ year={2017},
335
+ eprint={1703.07737},
336
+ archivePrefix={arXiv},
337
+ primaryClass={cs.CV}
338
  }
339
  ```
340
 
341
+ <!--
342
+ ## Glossary
343
+
344
+ *Clearly define terms in order to be accessible across audiences.*
345
+ -->
346
+
347
+ <!--
348
+ ## Model Card Authors
349
+
350
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
351
+ -->
352
+
353
+ <!--
354
+ ## Model Card Contact
355
+
356
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
357
+ -->