Matryoshka Representation Learning
Paper
• 2205.13147 • Published
• 25
This is a sentence-transformers model finetuned from Alibaba-NLP/gte-Qwen2-1.5B-instruct. It maps sentences & paragraphs to a 1536-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 32768, 'do_lower_case': False}) with Transformer model: Qwen2Model
(1): Pooling({'word_embedding_dimension': 1536, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("kenrogers/gte-ft-yt-2")
# Run inference
sentences = [
'1. What is meant by the terms "hidden state," "latent space," and "embedding space" in the context of reasoning models?\n2. How do the last hidden states function as input embeddings in a typical Chain of Thought reasoning model?',
"of the reasoning State when we say hidden state or latent space or embedding space or this sort of space of math and computation we're talking about the same space of course the the exact state of the space changes depending on where we are in the Transformer but let's take a look at the image from the paper in a typical Chain of Thought reasoning model we're going to ask a question we're going to generate some tokens and we're going to think kind of out loud we're we're going to let the chains of thought flow when you click into the Chain of Thought on 01 as we've seen before you can see sort of in the side panel the the steps it's thinking through now conversely to actually thinking out loud we have here that the last hidden states are used as input embeddings okay well what does this mean well it let's go back to our gpt2 style diagram and think about this the input embeddings here are where we're essentially looping back to so what we do is we kind of loop back before we generate",
"impact of this kind of approach on test time compute scaling some of the working hypotheses and some of the things people are interested in in looking out there on the llm edge for as we continue to see the field progress I want to demonstrate both approaches and check out the new coconut Library as well so how we're going to go through this is we're going to essentially introduce this idea of reasoning and latent space then we're going to talk about the scaling part of this before we dig into the specific approaches and we get the demo on both approaches by the end so it should be a lot of fun today let's go ahead and dig in reasoning in latent space let's root ourselves first in some definitions when we talk about reasoning we're talking about the action of thinking about something and it's kind of funny in a logical way if you look up logic it uses the word reason and there we are caught in a loop but reasoning is about thinking latent space is about using a representation of our",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1536]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
InformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.875 |
| cosine_accuracy@3 | 1.0 |
| cosine_accuracy@5 | 1.0 |
| cosine_accuracy@10 | 1.0 |
| cosine_precision@1 | 0.875 |
| cosine_precision@3 | 0.3333 |
| cosine_precision@5 | 0.2 |
| cosine_precision@10 | 0.1 |
| cosine_recall@1 | 0.875 |
| cosine_recall@3 | 1.0 |
| cosine_recall@5 | 1.0 |
| cosine_recall@10 | 1.0 |
| cosine_ndcg@10 | 0.9539 |
| cosine_mrr@10 | 0.9375 |
| cosine_map@100 | 0.9375 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
1. What are the two big ideas aimed at scaling the power of LLMs during inference mentioned in the context? |
okay whiz we're talking about reasoning in latent space today is that the same as test time compute yeah that's right nice nice okay and we've got two big ideas to cover that are aimed at scaling the power of llms during inference is that right that yeah that's right so we have we have two you know latent space methods uh we have our continuous Chain of Thought or coconut right and then we have our more more directly more uh you know uh budget forcing recurrent depth uh model yes man that's a lot so when we look across both of those there appears to be a pretty simple explanation it's almost like uh you know when we when we're in that sort of thinking space of computation we don't have to do the thinky thinky in words and that's better maybe even it will allow us to find a new scaling axis is that right yeah that's exactly right I mean the idea is that we have this uh you know we we have this way of taking advantage of of uh the most powerful thinking space in the Transformer and not |
1. What are the two big ideas aimed at scaling the power of LLMs during inference mentioned in the context? |
okay whiz we're talking about reasoning in latent space today is that the same as test time compute yeah that's right nice nice okay and we've got two big ideas to cover that are aimed at scaling the power of llms during inference is that right that yeah that's right so we have we have two you know latent space methods uh we have our continuous Chain of Thought or coconut right and then we have our more more directly more uh you know uh budget forcing recurrent depth uh model yes man that's a lot so when we look across both of those there appears to be a pretty simple explanation it's almost like uh you know when we when we're in that sort of thinking space of computation we don't have to do the thinky thinky in words and that's better maybe even it will allow us to find a new scaling axis is that right yeah that's exactly right I mean the idea is that we have this uh you know we we have this way of taking advantage of of uh the most powerful thinking space in the Transformer and not |
1. What is the significance of staying in the "mind Palace" of the Transformer instead of resolving back to token space? |
is that right yeah that's exactly right I mean the idea is that we have this uh you know we we have this way of taking advantage of of uh the most powerful thinking space in the Transformer and not just like for a second right not automatically resolving back to token space but kind of staying in this very like uh you know in in the mind Palace of the of the Transformer without having to write down the words yes okay okay okay so basically scaling is dead Long Live scaling something like that yeah scaling has died uh we should scale yeah all right all right all right well I'm pumped for the demos today we're going to see some thinking in latent space let's cover all the Concepts we need to get there we'll get you back in for some discussions along the way because this one's pretty meta thanks whiz all right guys we are gonna rock out on large reasoning models today while we were originally going to just cover chain of continuous thought or coconut we saw a paper come out a couple |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
eval_strategy: stepsper_device_train_batch_size: 10per_device_eval_batch_size: 10num_train_epochs: 10multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 10per_device_eval_batch_size: 10per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 10max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | cosine_ndcg@10 |
|---|---|---|
| 1.0 | 10 | 0.9539 |
| 2.0 | 20 | 0.9077 |
| 3.0 | 30 | 0.9539 |
| 4.0 | 40 | 0.9539 |
| 5.0 | 50 | 0.9539 |
| 6.0 | 60 | 0.9539 |
| 7.0 | 70 | 0.9539 |
| 8.0 | 80 | 0.9539 |
| 9.0 | 90 | 0.9539 |
| 10.0 | 100 | 0.9539 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
Alibaba-NLP/gte-Qwen2-1.5B-instruct