---
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- onnx
- sentence-similarity
- code-search
- on-device
pipeline_tag: feature-extraction
library_name: sentence-transformers
---

# nomic-codesearch-onnx (INT8 Quantized)

This model is a fine-tuned version of `nomic-ai/nomic-embed-text-v1.5` trained specifically for **semantic code search** on Python code snippets, then exported to ONNX and dynamically quantized to INT8 for efficient on-device execution (CPU/Mobile).

The final quantized model is compressed from **530 MB to 100 MB** (a ~5x reduction) while maintaining high retrieval performance, making it perfect for on-device deployment on Android, iOS, or other resource-constrained environments.

---

## Model Details

- **Base Model:** `nomic-ai/nomic-embed-text-v1.5` (137M parameters, 768-dimensional embeddings)
- **Fine-Tuning Dataset:** `code-search-net/code_search_net` (Python split). Trained on 50,000 positive `(docstring, function)` pairs using Multiple Negatives Ranking Loss (MNR).
- **Training Acceleration:** Apple Silicon (M4 MPS)
- **Export Format:** ONNX (Opset 17)
- **Quantization:** Dynamic INT8 Quantization (weights quantized to `QInt8`, activation optimized)
- **Dimensions:** 768 (supports Matryoshka Representation Learning down to 256 dimensions)

---

## Metrics

| Config | Size | Mean Cosine Drift | NDCG@10 (Code Search) |
|---|---|---|---|
| Baseline Model | 530 MB | 0.0 | ~0.48 |
| Fine-Tuned FP32 ONNX | 530 MB | 0.0 | **~0.71** |
| Fine-Tuned INT8 ONNX | 100 MB | ~0.07 | ~0.68 |

---

## Python Quickstart

To run semantic code search or generate embeddings locally using this ONNX model:

### 1. Install Dependencies
```bash
pip install onnxruntime transformers numpy
```

### 2. Run Inference
```python
import os
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

# Load tokenizer and ONNX session
# Ensure config.json, tokenizer.json, vocab.txt, etc., are in the same directory
model_dir = "./"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
session = ort.InferenceSession(os.path.join(model_dir, "model_int8.onnx"))

def embed(texts: list[str], max_length: int = 512) -> np.ndarray:
    """Return L2-normalised sentence embeddings, shape (len(texts), 768)."""
    encoded = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="np",
    )
    outputs = session.run(
        ["sentence_embedding"],
        {
            "input_ids": encoded["input_ids"].astype(np.int64),
            "attention_mask": encoded["attention_mask"].astype(np.int64),
        },
    )
    embeddings = outputs[0]  # (batch, 768)
    # L2 normalise so dot-product == cosine similarity
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    return embeddings / np.maximum(norms, 1e-12)

# Embed query and snippets
snippets = [
    "def add(a, b): return a + b",
    "def binary_search(arr, target): ...",
    "SELECT * FROM users WHERE age > 18"
]
query = "function that sums two numbers"

query_emb = embed([query])
code_embs = embed(snippets)

# Calculate similarity (dot product of L2-normalized embeddings)
scores = (query_emb @ code_embs.T)[0]
for idx, score in enumerate(scores):
    print(f"[{score:.4f}] {snippets[idx]}")
```

---

## On-Device Deployment (Android)

This model has been successfully deployed inside a native Android application using:
1. **ONNX Runtime Android AAR** (`com.microsoft.onnxruntime:onnxruntime-android`) for CPU inference.
2. **Custom WordPiece Tokenizer in Kotlin** (`BertTokenizer.kt`) to parse strings directly on-device without JVM-overhead Python dependencies.
3. **Coroutines-based asynchronous loading** to load the 100 MB model in the background without blocking the UI thread.

For complete Android source files (MainActivity, OnnxEmbedder, and BertTokenizer), please refer to the GitHub repository: [CoderOMaster/nomic-codesearch-android](https://github.com/CoderOMaster/nomic-codesearch-android).

---

## License

This project is licensed under the Apache 2.0 License.