Create README.md

1ab2659 verified 21 days ago

4.1 kB

	---
	license: apache-2.0
	tags:
	- sentence-transformers
	- feature-extraction
	- onnx
	- sentence-similarity
	- code-search
	- on-device
	pipeline_tag: feature-extraction
	library_name: sentence-transformers
	---

	# nomic-codesearch-onnx (INT8 Quantized)

	This model is a fine-tuned version of `nomic-ai/nomic-embed-text-v1.5` trained specifically for semantic code search on Python code snippets, then exported to ONNX and dynamically quantized to INT8 for efficient on-device execution (CPU/Mobile).

	The final quantized model is compressed from 530 MB to 100 MB (a ~5x reduction) while maintaining high retrieval performance, making it perfect for on-device deployment on Android, iOS, or other resource-constrained environments.

	---

	## Model Details

	- Base Model: `nomic-ai/nomic-embed-text-v1.5` (137M parameters, 768-dimensional embeddings)
	- Fine-Tuning Dataset: `code-search-net/code_search_net` (Python split). Trained on 50,000 positive `(docstring, function)` pairs using Multiple Negatives Ranking Loss (MNR).
	- Training Acceleration: Apple Silicon (M4 MPS)
	- Export Format: ONNX (Opset 17)
	- Quantization: Dynamic INT8 Quantization (weights quantized to `QInt8`, activation optimized)
	- Dimensions: 768 (supports Matryoshka Representation Learning down to 256 dimensions)

	---

	## Metrics

	\| Config \| Size \| Mean Cosine Drift \| NDCG@10 (Code Search) \|
	\|---\|---\|---\|---\|
	\| Baseline Model \| 530 MB \| 0.0 \| ~0.48 \|
	\| Fine-Tuned FP32 ONNX \| 530 MB \| 0.0 \| ~0.71 \|
	\| Fine-Tuned INT8 ONNX \| 100 MB \| ~0.07 \| ~0.68 \|

	---

	## Python Quickstart

	To run semantic code search or generate embeddings locally using this ONNX model:

	### 1. Install Dependencies
	```bash
	pip install onnxruntime transformers numpy
	```

	### 2. Run Inference
	```python
	import os
	import numpy as np
	import onnxruntime as ort
	from transformers import AutoTokenizer

	# Load tokenizer and ONNX session
	# Ensure config.json, tokenizer.json, vocab.txt, etc., are in the same directory
	model_dir = "./"
	tokenizer = AutoTokenizer.from_pretrained(model_dir)
	session = ort.InferenceSession(os.path.join(model_dir, "model_int8.onnx"))

	def embed(texts: list[str], max_length: int = 512) -> np.ndarray:
	"""Return L2-normalised sentence embeddings, shape (len(texts), 768)."""
	encoded = tokenizer(
	texts,
	padding=True,
	truncation=True,
	max_length=max_length,
	return_tensors="np",
	)
	outputs = session.run(
	["sentence_embedding"],
	{
	"input_ids": encoded["input_ids"].astype(np.int64),
	"attention_mask": encoded["attention_mask"].astype(np.int64),
	},
	)
	embeddings = outputs[0] # (batch, 768)
	# L2 normalise so dot-product == cosine similarity
	norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
	return embeddings / np.maximum(norms, 1e-12)

	# Embed query and snippets
	snippets = [
	"def add(a, b): return a + b",
	"def binary_search(arr, target): ...",
	"SELECT * FROM users WHERE age > 18"
	]
	query = "function that sums two numbers"

	query_emb = embed([query])
	code_embs = embed(snippets)

	# Calculate similarity (dot product of L2-normalized embeddings)
	scores = (query_emb @ code_embs.T)[0]
	for idx, score in enumerate(scores):
	print(f"[{score:.4f}] {snippets[idx]}")
	```

	---

	## On-Device Deployment (Android)

	This model has been successfully deployed inside a native Android application using:
	1. ONNX Runtime Android AAR (`com.microsoft.onnxruntime:onnxruntime-android`) for CPU inference.
	2. Custom WordPiece Tokenizer in Kotlin (`BertTokenizer.kt`) to parse strings directly on-device without JVM-overhead Python dependencies.
	3. Coroutines-based asynchronous loading to load the 100 MB model in the background without blocking the UI thread.

	For complete Android source files (MainActivity, OnnxEmbedder, and BertTokenizer), please refer to the GitHub repository: [CoderOMaster/nomic-codesearch-android](https://github.com/CoderOMaster/nomic-codesearch-android).

	---

	## License

	This project is licensed under the Apache 2.0 License.