--- license: apache-2.0 tags: - cross-encoder - reranker - code - retrieval - typescript - python pipeline_tag: text-classification --- # Code Reranker MiniLM v1 A fine-tuned cross-encoder reranker for code relevance ranking, trained on TypeScript/Python codebases. ## Model Details - **Base model**: [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) - **Training data**: 19,810 (query, code snippet) pairs from a real-world TypeScript/Python codebase - **Training**: 2 epochs, batch size 8, lr 2e-5 - **Hardware**: GTX 1660 SUPER (6GB VRAM), CPU training ## Benchmarks | Metric | Original | Fine-tuned | |--------|----------|------------| | Accuracy | 88.4% | **98.0%** | | Best Accuracy | 96.2% | **99.0%** | | AUC | 98.64% | **99.95%** | ## Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tokenizer = AutoTokenizer.from_pretrained("trd92/code-reranker-minilm-v1") model = AutoModelForSequenceClassification.from_pretrained("trd92/code-reranker-minilm-v1") query = "How to define configuration?" code = "def define_config(): ..." inputs = tokenizer(query, code, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) score = outputs.logits.item() print(f"Relevance score: {score:.4f}") ``` ## Training Trained using contrastive learning on code triplets (query, positive_code, negative_code) extracted via AST parsing.