Feature Extraction
sentence-transformers
ONNX
nomic_bert
sentence-similarity
code-search
on-device
custom_code
text-embeddings-inference
Instructions to use KingLLM/nomic-codesearch-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use KingLLM/nomic-codesearch-onnx with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("KingLLM/nomic-codesearch-onnx", trust_remote_code=True) sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| tags: | |
| - sentence-transformers | |
| - feature-extraction | |
| - onnx | |
| - sentence-similarity | |
| - code-search | |
| - on-device | |
| pipeline_tag: feature-extraction | |
| library_name: sentence-transformers | |
| # nomic-codesearch-onnx (INT8 Quantized) | |
| This model is a fine-tuned version of `nomic-ai/nomic-embed-text-v1.5` trained specifically for **semantic code search** on Python code snippets, then exported to ONNX and dynamically quantized to INT8 for efficient on-device execution (CPU/Mobile). | |
| The final quantized model is compressed from **530 MB to 100 MB** (a ~5x reduction) while maintaining high retrieval performance, making it perfect for on-device deployment on Android, iOS, or other resource-constrained environments. | |
| --- | |
| ## Model Details | |
| - **Base Model:** `nomic-ai/nomic-embed-text-v1.5` (137M parameters, 768-dimensional embeddings) | |
| - **Fine-Tuning Dataset:** `code-search-net/code_search_net` (Python split). Trained on 50,000 positive `(docstring, function)` pairs using Multiple Negatives Ranking Loss (MNR). | |
| - **Training Acceleration:** Apple Silicon (M4 MPS) | |
| - **Export Format:** ONNX (Opset 17) | |
| - **Quantization:** Dynamic INT8 Quantization (weights quantized to `QInt8`, activation optimized) | |
| - **Dimensions:** 768 (supports Matryoshka Representation Learning down to 256 dimensions) | |
| --- | |
| ## Metrics | |
| | Config | Size | Mean Cosine Drift | NDCG@10 (Code Search) | | |
| |---|---|---|---| | |
| | Baseline Model | 530 MB | 0.0 | ~0.48 | | |
| | Fine-Tuned FP32 ONNX | 530 MB | 0.0 | **~0.71** | | |
| | Fine-Tuned INT8 ONNX | 100 MB | ~0.07 | ~0.68 | | |
| --- | |
| ## Python Quickstart | |
| To run semantic code search or generate embeddings locally using this ONNX model: | |
| ### 1. Install Dependencies | |
| ```bash | |
| pip install onnxruntime transformers numpy | |
| ``` | |
| ### 2. Run Inference | |
| ```python | |
| import os | |
| import numpy as np | |
| import onnxruntime as ort | |
| from transformers import AutoTokenizer | |
| # Load tokenizer and ONNX session | |
| # Ensure config.json, tokenizer.json, vocab.txt, etc., are in the same directory | |
| model_dir = "./" | |
| tokenizer = AutoTokenizer.from_pretrained(model_dir) | |
| session = ort.InferenceSession(os.path.join(model_dir, "model_int8.onnx")) | |
| def embed(texts: list[str], max_length: int = 512) -> np.ndarray: | |
| """Return L2-normalised sentence embeddings, shape (len(texts), 768).""" | |
| encoded = tokenizer( | |
| texts, | |
| padding=True, | |
| truncation=True, | |
| max_length=max_length, | |
| return_tensors="np", | |
| ) | |
| outputs = session.run( | |
| ["sentence_embedding"], | |
| { | |
| "input_ids": encoded["input_ids"].astype(np.int64), | |
| "attention_mask": encoded["attention_mask"].astype(np.int64), | |
| }, | |
| ) | |
| embeddings = outputs[0] # (batch, 768) | |
| # L2 normalise so dot-product == cosine similarity | |
| norms = np.linalg.norm(embeddings, axis=1, keepdims=True) | |
| return embeddings / np.maximum(norms, 1e-12) | |
| # Embed query and snippets | |
| snippets = [ | |
| "def add(a, b): return a + b", | |
| "def binary_search(arr, target): ...", | |
| "SELECT * FROM users WHERE age > 18" | |
| ] | |
| query = "function that sums two numbers" | |
| query_emb = embed([query]) | |
| code_embs = embed(snippets) | |
| # Calculate similarity (dot product of L2-normalized embeddings) | |
| scores = (query_emb @ code_embs.T)[0] | |
| for idx, score in enumerate(scores): | |
| print(f"[{score:.4f}] {snippets[idx]}") | |
| ``` | |
| --- | |
| ## On-Device Deployment (Android) | |
| This model has been successfully deployed inside a native Android application using: | |
| 1. **ONNX Runtime Android AAR** (`com.microsoft.onnxruntime:onnxruntime-android`) for CPU inference. | |
| 2. **Custom WordPiece Tokenizer in Kotlin** (`BertTokenizer.kt`) to parse strings directly on-device without JVM-overhead Python dependencies. | |
| 3. **Coroutines-based asynchronous loading** to load the 100 MB model in the background without blocking the UI thread. | |
| For complete Android source files (MainActivity, OnnxEmbedder, and BertTokenizer), please refer to the GitHub repository: [CoderOMaster/nomic-codesearch-android](https://github.com/CoderOMaster/nomic-codesearch-android). | |
| --- | |
| ## License | |
| This project is licensed under the Apache 2.0 License. | |