Model Card for Indus SDE Sentence Transformer (Binary / Quantized)
Model Name: nasa-impact/indus-sde-st-equat-v0.1
This is a quantization-aware trained version of the Indus SDE Sentence Transformer v0.2. It was fine-tuned specifically to generate binary embeddings (1-bit per dimension) while maintaining high semantic retrieval performance.
This model allows for significant reductions in storage (up to 32x compression when packed) and faster retrieval speeds using Hamming distance, making it ideal for large-scale scientific information retrieval.
Usage
You can use this model with the sentence-transformers library.
Note: This model is trained to generate binary embeddings. By default, it outputs continuous (float) embeddings. To obtain the binary representation (values of -1.0 and 1.0), you must apply the sign function to the output.
1. Install Library
pip install -U sentence-transformers
2. Load and Generate Binary Embeddings
from sentence_transformers import SentenceTransformer
import torch
import os
# 1. Load the model
model = SentenceTransformer(
"nasa-impact/indus-sde-st-equat-v0.1",
token=os.getenv("HUGGINGFACE_TOKEN")
)
sentences = [
"The Navier-Stokes equations describe fluid motion.",
"Photosynthesis converts light energy into chemical energy."
]
# 2. Get standard float embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)
# 3. Apply sign function to get unpacked binary embeddings (-1.0 and 1.0)
binary_embeddings = torch.sign(embeddings)
3. Semantic Search with Hamming Distance
For binary embeddings, Hamming Distance (counting the number of differing bits) is the standard metric for similarity.
- Lower Hamming Distance = Higher Similarity.
- Higher Hamming Distance = Lower Similarity.
# Example: Retrieval
query = "Fluid dynamics and momentum"
docs = [
"The Navier-Stokes equations describe fluid motion.",
"Photosynthesis converts light energy into chemical energy.",
"Momentum is conserved in closed systems."
]
# 1. Encode Query and Docs to Binary
query_emb = torch.sign(model.encode(query, convert_to_tensor=True))
doc_embs = torch.sign(model.encode(docs, convert_to_tensor=True))
# 2. Calculate Hamming Distance
# (Count positions where the bits differ)
# Since we use -1 and 1, we can just check inequality.
hamming_distances = (query_emb != doc_embs).sum(dim=1).float()
# 3. Sort results (Lower distance is better)
# We can convert distance to a "similarity score" for ranking:
# Score = (Dimensions - Distance) / Dimensions
dim = query_emb.shape[0]
similarity_scores = (dim - hamming_distances) / dim
print("Query:", query)
for i, score in enumerate(similarity_scores):
print(f"Doc: {docs[i]}")
print(f" Hamming Distance: {int(hamming_distances[i])}")
print(f" Similarity Score: {score:.4f}\n")
Training Details
This model was trained using the same comprehensive scientific dataset as its base model (indus-sde-st-v0.2). The training process involved adapting the base model to a binarization objective, ensuring that the sign of the embedding dimensions captures the semantic meaning effectively.
The primary objective was to retain the broad linguistic foundation and scientific specialization of the base model while enabling extreme compression.
Dataset Table
The model was trained on the following scientific corpora:
| Dataset Name | Data Points | Type | Link |
|---|---|---|---|
| S2ORC_title_abstract | ~41.8M | Title-Body | Link |
| S2ORC_abstract_citation | ~39.6M | Body-Body | Link |
| S2ORC_title_citation | ~51M | Title-Title | Link |
| arxiv_title_abstract | ~2.7M | Title-Body | Link |
| PubMed | ~ 24M | Title-Body | Link |
| specter | ~684K | Title-Body | Link |
| nasa_ads | ~2.66M | Title-Abstract | Link |
| SDE-synthesized | 177,486 | question-answer | Link |
| SDE-synthesized | 194,382 | search_terms-document | |
| CMR-natural | 53,974 | Title-Description | |
| PDS-natural | 9,832 | Title-Description | |
| CMR-synthesized | 796,097 | search_terms-document | |
| PDS-synthesized | 52,777 | search_terms-document | |
| Total | ~162.4M |
Evaluation
We evaluate the model on a variety of benchmark datasets to ensure the binarization process preserves accuracy. Benchmarks include:
Performance Reference (Base Model)
{
"s2-86-20k": nasa-impact/indus-sde-st-equat-v0.1
}
NASA SDE IR Benchmark v5
NASA SMD IR Benchmark
- Downloads last month
- 1
Model tree for nasa-impact/indus-sde-st-equat-v0.1
Base model
nasa-impact/indus-sde-st-v0.2

