| --- |
| license: apache-2.0 |
| tags: |
| - cross-encoder |
| - reranker |
| - code |
| - retrieval |
| - typescript |
| - python |
| pipeline_tag: text-classification |
| --- |
| |
| # Code Reranker MiniLM v1 |
|
|
| A fine-tuned cross-encoder reranker for code relevance ranking, trained on TypeScript/Python codebases. |
|
|
| ## Model Details |
|
|
| - **Base model**: [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) |
| - **Training data**: 19,810 (query, code snippet) pairs from a real-world TypeScript/Python codebase |
| - **Training**: 2 epochs, batch size 8, lr 2e-5 |
| - **Hardware**: GTX 1660 SUPER (6GB VRAM), CPU training |
|
|
| ## Benchmarks |
|
|
| | Metric | Original | Fine-tuned | |
| |--------|----------|------------| |
| | Accuracy | 88.4% | **98.0%** | |
| | Best Accuracy | 96.2% | **99.0%** | |
| | AUC | 98.64% | **99.95%** | |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| tokenizer = AutoTokenizer.from_pretrained("trd92/code-reranker-minilm-v1") |
| model = AutoModelForSequenceClassification.from_pretrained("trd92/code-reranker-minilm-v1") |
| |
| query = "How to define configuration?" |
| code = "def define_config(): ..." |
| |
| inputs = tokenizer(query, code, return_tensors="pt") |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| score = outputs.logits.item() |
| print(f"Relevance score: {score:.4f}") |
| ``` |
|
|
| ## Training |
|
|
| Trained using contrastive learning on code triplets (query, positive_code, negative_code) extracted via AST parsing. |
|
|