Model Card for Indus SDE Sentence Transformer (Binary / Quantized)

Model Name: nasa-impact/indus-sde-st-equat-v0.1

This is a quantization-aware trained version of the Indus SDE Sentence Transformer v0.2. It was fine-tuned specifically to generate binary embeddings (1-bit per dimension) while maintaining high semantic retrieval performance.

This model allows for significant reductions in storage (up to 32x compression when packed) and faster retrieval speeds using Hamming distance, making it ideal for large-scale scientific information retrieval.

Usage

You can use this model with the sentence-transformers library.

Note: This model is trained to generate binary embeddings. By default, it outputs continuous (float) embeddings. To obtain the binary representation (values of -1.0 and 1.0), you must apply the sign function to the output.

1. Install Library

pip install -U sentence-transformers

2. Load and Generate Binary Embeddings

from sentence_transformers import SentenceTransformer
import torch
import os

# 1. Load the model
model = SentenceTransformer(
    "nasa-impact/indus-sde-st-equat-v0.1", 
    token=os.getenv("HUGGINGFACE_TOKEN")
)

sentences = [
    "The Navier-Stokes equations describe fluid motion.",
    "Photosynthesis converts light energy into chemical energy."
]

# 2. Get standard float embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

# 3. Apply sign function to get unpacked binary embeddings (-1.0 and 1.0)
binary_embeddings = torch.sign(embeddings)

3. Semantic Search with Hamming Distance

For binary embeddings, Hamming Distance (counting the number of differing bits) is the standard metric for similarity.

  • Lower Hamming Distance = Higher Similarity.
  • Higher Hamming Distance = Lower Similarity.
# Example: Retrieval
query = "Fluid dynamics and momentum"
docs = [
    "The Navier-Stokes equations describe fluid motion.",
    "Photosynthesis converts light energy into chemical energy.",
    "Momentum is conserved in closed systems."
]

# 1. Encode Query and Docs to Binary
query_emb = torch.sign(model.encode(query, convert_to_tensor=True))
doc_embs = torch.sign(model.encode(docs, convert_to_tensor=True))

# 2. Calculate Hamming Distance
# (Count positions where the bits differ)
# Since we use -1 and 1, we can just check inequality.
hamming_distances = (query_emb != doc_embs).sum(dim=1).float()

# 3. Sort results (Lower distance is better)
# We can convert distance to a "similarity score" for ranking: 
# Score = (Dimensions - Distance) / Dimensions
dim = query_emb.shape[0]
similarity_scores = (dim - hamming_distances) / dim

print("Query:", query)
for i, score in enumerate(similarity_scores):
    print(f"Doc: {docs[i]}")
    print(f"  Hamming Distance: {int(hamming_distances[i])}")
    print(f"  Similarity Score: {score:.4f}\n")

Training Details

This model was trained using the same comprehensive scientific dataset as its base model (indus-sde-st-v0.2). The training process involved adapting the base model to a binarization objective, ensuring that the sign of the embedding dimensions captures the semantic meaning effectively.

The primary objective was to retain the broad linguistic foundation and scientific specialization of the base model while enabling extreme compression.

Dataset Table

The model was trained on the following scientific corpora:

Dataset Name Data Points Type Link
S2ORC_title_abstract ~41.8M Title-Body Link
S2ORC_abstract_citation ~39.6M Body-Body Link
S2ORC_title_citation ~51M Title-Title Link
arxiv_title_abstract ~2.7M Title-Body Link
PubMed ~ 24M Title-Body Link
specter ~684K Title-Body Link
nasa_ads ~2.66M Title-Abstract Link
SDE-synthesized 177,486 question-answer Link
SDE-synthesized 194,382 search_terms-document
CMR-natural 53,974 Title-Description
PDS-natural 9,832 Title-Description
CMR-synthesized 796,097 search_terms-document
PDS-synthesized 52,777 search_terms-document
Total ~162.4M

Evaluation

We evaluate the model on a variety of benchmark datasets to ensure the binarization process preserves accuracy. Benchmarks include:

Performance Reference (Base Model)

{
  "s2-86-20k": nasa-impact/indus-sde-st-equat-v0.1
}

NASA SDE IR Benchmark v5

image

Nano BEIR image

NASA SMD IR Benchmark

image

Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nasa-impact/indus-sde-st-equat-v0.1

Finetuned
(1)
this model