Instructions to use collaborativeearth/bge-m3_wri_notitles with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use collaborativeearth/bge-m3_wri_notitles with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("collaborativeearth/bge-m3_wri_notitles")

sentences = [
    "who is world wildlife fund",
    "In the restoration scenario, a representative set of 20 degraded hectares in Latin America and the Caribbean—51 percent from subtropical wet and moist landscapes and 48 percent from drymixed forest and savanna biomes—is assumed to be brought under restoration and fully restored (WRI 2014; Potapov et al. 2011).\n\nThe 20 million hectares are pulled from each of these biomes according to the actual distribution of the degrees of degradation—34 percent lightly degraded, 58 percent moderately degraded, and 8 percent extremely degraded (Oldeman et al. 1991). 33 The restoration scenario then assumes that a balanced menu of three broad restoration methods—(1) wide-scale planted restoration, (2) assisted regeneration of secondary and naturally existing forests, and (3) agroforestry—will split equally between, and applied to, these target hectares.",
    "▪ Fermentation-enabled proteins. These include products in which microorganisms such as fungi and microalgae are used to convert glucose into protein (e.g., mycoprotein) via a biomass fermentation process. The microbial biomass is typically consumed, such as in products from the companies Quorn and Meati. This category also includes foods containing functional ingredients (e.g., egg proteins, enzymes) produced by microorganisms through a precision fermentation process. The ingredients that are created are typically used to improve the flavor and texture of other foods, such as dairy products containing animal-free whey protein produced by Perfect Day.\n\n▪ Cultivated meat. This novel technology produces meat from animal cells and is also known as “cell-based,” “cultured,” or “lab-grown” meat. Animal cells are grown in a bioreactor in a growth medium, producing actual animal meat. Examples include Eat Just’s chicken nuggets currently available for sale in Singapore, and cultivated chicken from Upside Foods and Good Meat for sale in the United States (Lucas 2023).",
    "BirdLife International; Carbon Dioxide Information Analysis Center (CDIAC), Oak Ridge National Laboratory(ORNL); Center for International Earth Science Information Network (CIESIN); Environmental Systems Research Institute (ESRI); European Space Agency (ESA); Food and Agriculture Organization of the United Nations (FAO); International Livestock Research Institute (ILRI); International Soil Reference and Information Centre (ISRIC); IUCNThe World Conservation Union; National Oceanic and Atmospheric Administration - National Geophysical Data Center (NOAA-NGDC); The Nature Conservancy (TNC); Patuxent Wildlife Research Laboratory; Safari Club International; United Nations Environment Programme (UNEP); United States Geological Survey (USGS), Earth Resources Observation Systems (EROS) Data Center; University of Maryland, Geography Department; The World Bank; World Conservation Monitoring Centre (WCMC); World Wildlife Fund – U.S. (WWF-U.S.).\n\nThe authors also would like to express their gratitude to the many individuals who contributed information and advice, attended expert workshops, and reviewed successive drafts of this report. Niels Batjes, International Soil Reference and Information Centre; Roy H. Behnke, Overseas Development Institute;"
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

Notebooks
Google Colab
Kaggle

SentenceTransformer based on BAAI/bge-m3

This is a sentence-transformers model finetuned from BAAI/bge-m3. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Finetuned bge-m3 (dense retrieval) using a QA dataset created from corpus file of WRI (250 chunk length, overlap = 40, no titles). Questions were generated using the QGEN method (https://github.com/UKPLab/gpl/tree/main), 2 questions per chunk. Loss function was the MNRL without hard negatives, see training details below.

Model Description

Model Type: Sentence Transformer
Base model: BAAI/bge-m3
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("collaborativeearth/bge-m3_wri_notitles")
# Run inference
sentences = [
    'what to do about climate change in the meat industry',
    '1.  Calculate the scope 3 GHG emissions baseline of food purchases, including meat. Establishing a scope 3 GHG emissions baseline for food purchases will allow companies to understand how much of an impact meat has on their food-related carbon footprint and enable them to pinpoint emissions hot spots.\n\n2.  Shift from high-emissions products like beef and lamb toward lower-emissions products like plant-based foods and alternative proteins. This type of shift is a triple win for climate, nature, and animal welfare.\n\n3.  Define priorities around improved meat sourcing by product type. For example, around beef, the goal might be to reduce climate and land impacts—both through sourcing less of it, and through encouraging lower-emissions production methods. For chicken and eggs, the goal might be to improve animal welfare, promote responsible antibiotic use, and minimize water pollution.',
    'We also conducted t-tests to determine the statistical significance of the above findings. For these t-tests, our null hypothesis was that there would be no difference between the conventional and alternative production systems, while the alternative hypothesis was that the alternative production systems would have mostly higher environmental impacts than the conventional systems. We conducted these tests using the paired data points for beef, lamb, dairy, pork, poultry, and eggs, for both GHG emissions and land use. (There were not enough data for water pollution and water use to conduct t-tests.) The GHG emissions results were statistically significant for beef, poultry, and eggs, with a p value <0.05. The land use results were statistically significant for beef, dairy, pork, poultry, and eggs, with a p value <0.05. Overall, the fact that the majority of these results, for GHG emissions and land use, were statistically significant reinforces the findings that alternative production systems generally have higher environmental impacts than conventional systems. There were not enough data for water pollution and water use, so the statistical significance of the water-related results could not be determined.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Dataset: ir-eval
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.3449
cosine_accuracy@3	0.5416
cosine_accuracy@5	0.6198
cosine_accuracy@10	0.7198
cosine_precision@1	0.3449
cosine_precision@3	0.1805
cosine_precision@5	0.124
cosine_precision@10	0.072
cosine_recall@1	0.3449
cosine_recall@3	0.5416
cosine_recall@5	0.6198
cosine_recall@10	0.7198
cosine_ndcg@10	0.5246
cosine_mrr@10	0.463
cosine_map@100	0.4721

Training Details

Training Dataset

Unnamed Dataset

Size: 82,191 training samples
Columns: question and answer
Approximate statistics based on the first 1000 samples:
question answer
type string string
details
min: 5 tokens
mean: 10.69 tokens
max: 36 tokens

min: 40 tokens
mean: 217.15 tokens
max: 334 tokens

	question	answer
type	string	string
details	min: 5 tokens mean: 10.69 tokens max: 36 tokens	min: 40 tokens mean: 217.15 tokens max: 334 tokens

Samples:

question	answer
`what countries are affected by landscape restoration?`	The Economic Case for Landscape Restoration in Latin America THE ECONOMIC CASE FOR LANDSCAPE RESTORATION IN LATIN AMERICA WALTER VERGARA, LUCIANA GALLARDO LOMELI, ANA R. RIOS, PAUL ISBELL, STEVEN PRAGER, RONNIE DE CAMINO Land use and land-use change are central to the economic and social fabric of Latin America and the Caribbean, and essential to the region’s prospects for sustainable development. Countries are realizing that now, more than ever, is the time for action. Eleven countries, three Brazilian states and several regional programs have already committed to restoring more than 27 million hectares of degraded land in Latin America—but can these ambitions become a reality while supporting good living standards and economic development?
`how many countries in latin america are trying to restore landscapes`	The Economic Case for Landscape Restoration in Latin America THE ECONOMIC CASE FOR LANDSCAPE RESTORATION IN LATIN AMERICA WALTER VERGARA, LUCIANA GALLARDO LOMELI, ANA R. RIOS, PAUL ISBELL, STEVEN PRAGER, RONNIE DE CAMINO Land use and land-use change are central to the economic and social fabric of Latin America and the Caribbean, and essential to the region’s prospects for sustainable development. Countries are realizing that now, more than ever, is the time for action. Eleven countries, three Brazilian states and several regional programs have already committed to restoring more than 27 million hectares of degraded land in Latin America—but can these ambitions become a reality while supporting good living standards and economic development?
`what percent of land is deforested`	Agriculture and forestry exports from Latin America represent about 13 percent of the global trade of food, feed, and fiber and account for a majority of employment outside large urban areas—numbers only expected to grow as Latin America is called upon to meet an increasing global demand for food. Yet, since the turn of the century, about 37 million hectares of natural forests, savannas and wetlands have been transformed to expand agriculture. Cumulative, unsustainable land-use practices have led to the degradation of about 300 million hectares, resulting in a reduction in yields and quality of production, and in losses in biomass content, soil quality, surface water hydrology, and biodiversity. Deforestation, land-use change, and unsustainable agricultural activities are also currently the largest drivers of climate change in the region, accounting for 56 percent of all greenhouse gas emissions. Today, while some progress has been achieved, the rate of deforestation remains high at an...

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 32
learning_rate: 1e-05
num_train_epochs: 2
warmup_ratio: 0.1
fp16: True
gradient_checkpointing: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 32
per_device_eval_batch_size: 8
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 1e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 2
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
tp_size: 0
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: True
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	ir-eval_cosine_ndcg@10
-1	-1	-	0.4718
0.0389	100	0.5021	-
0.0779	200	0.2574	-
0.1168	300	0.2008	-
0.1557	400	0.182	-
0.1946	500	0.1673	0.5134
0.2336	600	0.1488	-
0.2725	700	0.1582	-
0.3114	800	0.1662	-
0.3503	900	0.1642	-
0.3893	1000	0.1522	0.5107
0.4282	1100	0.1448	-
0.4671	1200	0.1525	-
0.5060	1300	0.1354	-
0.5450	1400	0.1437	-
0.5839	1500	0.1403	0.5172
0.6228	1600	0.1355	-
0.6617	1700	0.1459	-
0.7007	1800	0.1498	-
0.7396	1900	0.1221	-
0.7785	2000	0.1311	0.5201
0.8174	2100	0.1263	-
0.8564	2200	0.126	-
0.8953	2300	0.1111	-
0.9342	2400	0.1394	-
0.9731	2500	0.1188	0.5228
1.0121	2600	0.1267	-
1.0510	2700	0.0999	-
1.0899	2800	0.0911	-
1.1288	2900	0.0803	-
1.1678	3000	0.095	0.5255
1.2067	3100	0.0933	-
1.2456	3200	0.0909	-
1.2845	3300	0.093	-
1.3235	3400	0.0895	-
1.3624	3500	0.0872	0.5191
1.4013	3600	0.0914	-
1.4402	3700	0.0901	-
1.4792	3800	0.0832	-
1.5181	3900	0.0867	-
1.5570	4000	0.078	0.5250
1.5960	4100	0.0799	-
1.6349	4200	0.0871	-
1.6738	4300	0.0837	-
1.7127	4400	0.0911	-
1.7517	4500	0.0783	0.5248
1.7906	4600	0.0749	-
1.8295	4700	0.097	-
1.8684	4800	0.0865	-
1.9074	4900	0.0849	-
1.9463	5000	0.0937	0.5246
1.9852	5100	0.0839	-

Framework Versions

Python: 3.11.12
Sentence Transformers: 4.1.0
Transformers: 4.51.3
PyTorch: 2.6.0+cu124
Accelerate: 1.6.0
Datasets: 2.14.4
Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: 6

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for collaborativeearth/bge-m3_wri_notitles

Base model

BAAI/bge-m3

Finetuned

(510)

this model

Papers for collaborativeearth/bge-m3_wri_notitles

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 15

Efficient Natural Language Response Suggestion for Smart Reply

Paper • 1705.00652 • Published May 1, 2017

Evaluation results

Cosine Accuracy@1 on ir eval
self-reported

0.345
Cosine Accuracy@3 on ir eval
self-reported

0.542
Cosine Accuracy@5 on ir eval
self-reported

0.620
Cosine Accuracy@10 on ir eval
self-reported

0.720
Cosine Precision@1 on ir eval
self-reported

0.345
Cosine Precision@3 on ir eval
self-reported

0.181
Cosine Precision@5 on ir eval
self-reported

0.124
Cosine Precision@10 on ir eval
self-reported

0.072