Model Card for Indus SDE Sentence Transformer (Binary / Quantized)

Model Name: nasa-impact/indus-sde-st-equat-v0.1

This is a quantization-aware trained version of the Indus SDE Sentence Transformer v0.2. It was fine-tuned specifically to generate binary embeddings (1-bit per dimension) while maintaining high semantic retrieval performance.

This model allows for significant reductions in storage (up to 32x compression when packed) and faster retrieval speeds using Hamming distance, making it ideal for large-scale scientific information retrieval.

Usage

You can use this model with the sentence-transformers library.

Note: This model is trained to generate binary embeddings. By default, it outputs continuous (float) embeddings. To obtain the binary representation (values of -1.0 and 1.0), you must apply the sign function to the output.

1. Install Library

pip install -U sentence-transformers

2. Load and Generate Binary Embeddings

from sentence_transformers import SentenceTransformer
import torch
import os

# 1. Load the model
model = SentenceTransformer(
    "nasa-impact/indus-sde-st-equat-v0.1", 
    token=os.getenv("HUGGINGFACE_TOKEN")
)

sentences = [
    "The Navier-Stokes equations describe fluid motion.",
    "Photosynthesis converts light energy into chemical energy."
]

# 2. Get standard float embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

# 3. Apply sign function to get unpacked binary embeddings (-1.0 and 1.0)
binary_embeddings = torch.sign(embeddings)

3. Semantic Search with Hamming Distance

For binary embeddings, Hamming Distance (counting the number of differing bits) is the standard metric for similarity.

Lower Hamming Distance = Higher Similarity.
Higher Hamming Distance = Lower Similarity.

# Example: Retrieval
query = "Fluid dynamics and momentum"
docs = [
    "The Navier-Stokes equations describe fluid motion.",
    "Photosynthesis converts light energy into chemical energy.",
    "Momentum is conserved in closed systems."
]

# 1. Encode Query and Docs to Binary
query_emb = torch.sign(model.encode(query, convert_to_tensor=True))
doc_embs = torch.sign(model.encode(docs, convert_to_tensor=True))

# 2. Calculate Hamming Distance
# (Count positions where the bits differ)
# Since we use -1 and 1, we can just check inequality.
hamming_distances = (query_emb != doc_embs).sum(dim=1).float()

# 3. Sort results (Lower distance is better)
# We can convert distance to a "similarity score" for ranking: 
# Score = (Dimensions - Distance) / Dimensions
dim = query_emb.shape[0]
similarity_scores = (dim - hamming_distances) / dim

print("Query:", query)
for i, score in enumerate(similarity_scores):
    print(f"Doc: {docs[i]}")
    print(f"  Hamming Distance: {int(hamming_distances[i])}")
    print(f"  Similarity Score: {score:.4f}\n")

Training Details

This model was trained using the same comprehensive scientific dataset as its base model (indus-sde-st-v0.2). The training process involved adapting the base model to a binarization objective, ensuring that the sign of the embedding dimensions captures the semantic meaning effectively.

The primary objective was to retain the broad linguistic foundation and scientific specialization of the base model while enabling extreme compression.

Dataset Table

The model was trained on the following scientific corpora:

Dataset Name	Data Points	Type	Link
S2ORC_title_abstract	~41.8M	Title-Body	Link
S2ORC_abstract_citation	~39.6M	Body-Body	Link
S2ORC_title_citation	~51M	Title-Title	Link
arxiv_title_abstract	~2.7M	Title-Body	Link
PubMed	~ 24M	Title-Body	Link
specter	~684K	Title-Body	Link
nasa_ads	~2.66M	Title-Abstract	Link
SDE-synthesized	177,486	question-answer	Link
SDE-synthesized	194,382	search_terms-document
CMR-natural	53,974	Title-Description
PDS-natural	9,832	Title-Description
CMR-synthesized	796,097	search_terms-document
PDS-synthesized	52,777	search_terms-document
Total	~162.4M

Evaluation

We evaluate the model on a variety of benchmark datasets to ensure the binarization process preserves accuracy. Benchmarks include:

Performance Reference (Base Model)

{
  "s2-86-20k": nasa-impact/indus-sde-st-equat-v0.1
}

NASA SDE IR Benchmark v5

Nano BEIR

NASA SMD IR Benchmark

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for nasa-impact/indus-sde-st-equat-v0.1

Base model

nasa-impact/indus-sde-st-v0.2

Finetuned

(1)

this model