--- license: apache-2.0 tags: - sentence-transformers - feature-extraction - onnx - sentence-similarity - code-search - on-device pipeline_tag: feature-extraction library_name: sentence-transformers --- # nomic-codesearch-onnx (INT8 Quantized) This model is a fine-tuned version of `nomic-ai/nomic-embed-text-v1.5` trained specifically for **semantic code search** on Python code snippets, then exported to ONNX and dynamically quantized to INT8 for efficient on-device execution (CPU/Mobile). The final quantized model is compressed from **530 MB to 100 MB** (a ~5x reduction) while maintaining high retrieval performance, making it perfect for on-device deployment on Android, iOS, or other resource-constrained environments. --- ## Model Details - **Base Model:** `nomic-ai/nomic-embed-text-v1.5` (137M parameters, 768-dimensional embeddings) - **Fine-Tuning Dataset:** `code-search-net/code_search_net` (Python split). Trained on 50,000 positive `(docstring, function)` pairs using Multiple Negatives Ranking Loss (MNR). - **Training Acceleration:** Apple Silicon (M4 MPS) - **Export Format:** ONNX (Opset 17) - **Quantization:** Dynamic INT8 Quantization (weights quantized to `QInt8`, activation optimized) - **Dimensions:** 768 (supports Matryoshka Representation Learning down to 256 dimensions) --- ## Metrics | Config | Size | Mean Cosine Drift | NDCG@10 (Code Search) | |---|---|---|---| | Baseline Model | 530 MB | 0.0 | ~0.48 | | Fine-Tuned FP32 ONNX | 530 MB | 0.0 | **~0.71** | | Fine-Tuned INT8 ONNX | 100 MB | ~0.07 | ~0.68 | --- ## Python Quickstart To run semantic code search or generate embeddings locally using this ONNX model: ### 1. Install Dependencies ```bash pip install onnxruntime transformers numpy ``` ### 2. Run Inference ```python import os import numpy as np import onnxruntime as ort from transformers import AutoTokenizer # Load tokenizer and ONNX session # Ensure config.json, tokenizer.json, vocab.txt, etc., are in the same directory model_dir = "./" tokenizer = AutoTokenizer.from_pretrained(model_dir) session = ort.InferenceSession(os.path.join(model_dir, "model_int8.onnx")) def embed(texts: list[str], max_length: int = 512) -> np.ndarray: """Return L2-normalised sentence embeddings, shape (len(texts), 768).""" encoded = tokenizer( texts, padding=True, truncation=True, max_length=max_length, return_tensors="np", ) outputs = session.run( ["sentence_embedding"], { "input_ids": encoded["input_ids"].astype(np.int64), "attention_mask": encoded["attention_mask"].astype(np.int64), }, ) embeddings = outputs[0] # (batch, 768) # L2 normalise so dot-product == cosine similarity norms = np.linalg.norm(embeddings, axis=1, keepdims=True) return embeddings / np.maximum(norms, 1e-12) # Embed query and snippets snippets = [ "def add(a, b): return a + b", "def binary_search(arr, target): ...", "SELECT * FROM users WHERE age > 18" ] query = "function that sums two numbers" query_emb = embed([query]) code_embs = embed(snippets) # Calculate similarity (dot product of L2-normalized embeddings) scores = (query_emb @ code_embs.T)[0] for idx, score in enumerate(scores): print(f"[{score:.4f}] {snippets[idx]}") ``` --- ## On-Device Deployment (Android) This model has been successfully deployed inside a native Android application using: 1. **ONNX Runtime Android AAR** (`com.microsoft.onnxruntime:onnxruntime-android`) for CPU inference. 2. **Custom WordPiece Tokenizer in Kotlin** (`BertTokenizer.kt`) to parse strings directly on-device without JVM-overhead Python dependencies. 3. **Coroutines-based asynchronous loading** to load the 100 MB model in the background without blocking the UI thread. For complete Android source files (MainActivity, OnnxEmbedder, and BertTokenizer), please refer to the GitHub repository: [CoderOMaster/nomic-codesearch-android](https://github.com/CoderOMaster/nomic-codesearch-android). --- ## License This project is licensed under the Apache 2.0 License.