KingLLM's picture
Create README.md
1ab2659 verified
|
Raw
History Blame Contribute Delete
4.1 kB
---
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- onnx
- sentence-similarity
- code-search
- on-device
pipeline_tag: feature-extraction
library_name: sentence-transformers
---
# nomic-codesearch-onnx (INT8 Quantized)
This model is a fine-tuned version of `nomic-ai/nomic-embed-text-v1.5` trained specifically for **semantic code search** on Python code snippets, then exported to ONNX and dynamically quantized to INT8 for efficient on-device execution (CPU/Mobile).
The final quantized model is compressed from **530 MB to 100 MB** (a ~5x reduction) while maintaining high retrieval performance, making it perfect for on-device deployment on Android, iOS, or other resource-constrained environments.
---
## Model Details
- **Base Model:** `nomic-ai/nomic-embed-text-v1.5` (137M parameters, 768-dimensional embeddings)
- **Fine-Tuning Dataset:** `code-search-net/code_search_net` (Python split). Trained on 50,000 positive `(docstring, function)` pairs using Multiple Negatives Ranking Loss (MNR).
- **Training Acceleration:** Apple Silicon (M4 MPS)
- **Export Format:** ONNX (Opset 17)
- **Quantization:** Dynamic INT8 Quantization (weights quantized to `QInt8`, activation optimized)
- **Dimensions:** 768 (supports Matryoshka Representation Learning down to 256 dimensions)
---
## Metrics
| Config | Size | Mean Cosine Drift | NDCG@10 (Code Search) |
|---|---|---|---|
| Baseline Model | 530 MB | 0.0 | ~0.48 |
| Fine-Tuned FP32 ONNX | 530 MB | 0.0 | **~0.71** |
| Fine-Tuned INT8 ONNX | 100 MB | ~0.07 | ~0.68 |
---
## Python Quickstart
To run semantic code search or generate embeddings locally using this ONNX model:
### 1. Install Dependencies
```bash
pip install onnxruntime transformers numpy
```
### 2. Run Inference
```python
import os
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
# Load tokenizer and ONNX session
# Ensure config.json, tokenizer.json, vocab.txt, etc., are in the same directory
model_dir = "./"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
session = ort.InferenceSession(os.path.join(model_dir, "model_int8.onnx"))
def embed(texts: list[str], max_length: int = 512) -> np.ndarray:
"""Return L2-normalised sentence embeddings, shape (len(texts), 768)."""
encoded = tokenizer(
texts,
padding=True,
truncation=True,
max_length=max_length,
return_tensors="np",
)
outputs = session.run(
["sentence_embedding"],
{
"input_ids": encoded["input_ids"].astype(np.int64),
"attention_mask": encoded["attention_mask"].astype(np.int64),
},
)
embeddings = outputs[0] # (batch, 768)
# L2 normalise so dot-product == cosine similarity
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
return embeddings / np.maximum(norms, 1e-12)
# Embed query and snippets
snippets = [
"def add(a, b): return a + b",
"def binary_search(arr, target): ...",
"SELECT * FROM users WHERE age > 18"
]
query = "function that sums two numbers"
query_emb = embed([query])
code_embs = embed(snippets)
# Calculate similarity (dot product of L2-normalized embeddings)
scores = (query_emb @ code_embs.T)[0]
for idx, score in enumerate(scores):
print(f"[{score:.4f}] {snippets[idx]}")
```
---
## On-Device Deployment (Android)
This model has been successfully deployed inside a native Android application using:
1. **ONNX Runtime Android AAR** (`com.microsoft.onnxruntime:onnxruntime-android`) for CPU inference.
2. **Custom WordPiece Tokenizer in Kotlin** (`BertTokenizer.kt`) to parse strings directly on-device without JVM-overhead Python dependencies.
3. **Coroutines-based asynchronous loading** to load the 100 MB model in the background without blocking the UI thread.
For complete Android source files (MainActivity, OnnxEmbedder, and BertTokenizer), please refer to the GitHub repository: [CoderOMaster/nomic-codesearch-android](https://github.com/CoderOMaster/nomic-codesearch-android).
---
## License
This project is licensed under the Apache 2.0 License.