|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- feature-extraction |
|
|
- text-embeddings |
|
|
- semantic-search |
|
|
- onnx |
|
|
- transformers.js |
|
|
- bert |
|
|
- knowledge-distillation |
|
|
datasets: |
|
|
- custom |
|
|
pipeline_tag: feature-extraction |
|
|
model-index: |
|
|
- name: typelevel-bert |
|
|
results: |
|
|
- task: |
|
|
type: retrieval |
|
|
name: Document Retrieval |
|
|
dataset: |
|
|
type: custom |
|
|
name: FP-Doc Benchmark v1 |
|
|
metrics: |
|
|
- type: ndcg_at_10 |
|
|
value: 0.853 |
|
|
name: NDCG@10 |
|
|
- type: mrr |
|
|
value: 0.900 |
|
|
name: MRR |
|
|
- type: recall_at_10 |
|
|
value: 0.967 |
|
|
name: Recall@10 |
|
|
--- |
|
|
|
|
|
# Typelevel-BERT |
|
|
|
|
|
A compact, browser-deployable text embedding model specialized for searching Typelevel/FP documentation. Distilled from [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) to achieve fast client-side inference. |
|
|
|
|
|
## Highlights |
|
|
|
|
|
- **93.3%** of teacher model quality (NDCG@10) |
|
|
- **30x smaller** than teacher (11M vs 335M parameters) |
|
|
- **10.7 MB** quantized ONNX model |
|
|
- **1.5ms** inference latency (CPU, seq_len=128) |
|
|
- Optimized for Cats, Cats Effect, FS2, http4s, Doobie, Circe documentation |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| **Model Type** | BERT encoder (text embedding) | |
|
|
| **Architecture** | 4-layer transformer | |
|
|
| **Hidden Size** | 256 | |
|
|
| **Attention Heads** | 4 | |
|
|
| **Parameters** | 11.2M | |
|
|
| **Embedding Dimension** | 256 | |
|
|
| **Max Sequence Length** | 512 | |
|
|
| **Vocabulary** | bert-base-uncased (30,522 tokens) | |
|
|
| **Pooling** | Mean pooling | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Browser/Node.js (transformers.js) |
|
|
|
|
|
```javascript |
|
|
import { pipeline } from '@huggingface/transformers'; |
|
|
|
|
|
// Load the model (downloads automatically) |
|
|
const extractor = await pipeline('feature-extraction', 'djspiewak/typelevel-bert', { |
|
|
quantized: true, // Use INT8 quantized model (10.7 MB) |
|
|
}); |
|
|
|
|
|
// Generate embeddings |
|
|
const embedding = await extractor("How to sequence effects in cats-effect", { |
|
|
pooling: 'mean', |
|
|
normalize: true, |
|
|
}); |
|
|
|
|
|
console.log(embedding.data); // Float32Array(256) |
|
|
``` |
|
|
|
|
|
### Python (ONNX Runtime) |
|
|
|
|
|
```python |
|
|
import onnxruntime as ort |
|
|
import numpy as np |
|
|
from transformers import AutoTokenizer |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
# Download and load quantized model |
|
|
model_path = hf_hub_download("djspiewak/typelevel-bert", "onnx/model_quantized.onnx") |
|
|
tokenizer = AutoTokenizer.from_pretrained("djspiewak/typelevel-bert") |
|
|
session = ort.InferenceSession(model_path) |
|
|
|
|
|
# Tokenize input |
|
|
text = "Resource management and safe cleanup" |
|
|
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True) |
|
|
|
|
|
# Run inference |
|
|
outputs = session.run(None, { |
|
|
"input_ids": inputs["input_ids"].astype(np.int64), |
|
|
"attention_mask": inputs["attention_mask"].astype(np.int64), |
|
|
}) |
|
|
|
|
|
# Model outputs pooled embeddings, just L2 normalize |
|
|
embedding = outputs[0] # (1, 256) |
|
|
embedding = embedding / np.linalg.norm(embedding, axis=1, keepdims=True) |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Metric | Typelevel-BERT | Teacher (BGE-large) | % of Teacher | |
|
|
|--------|----------------|---------------------|--------------| |
|
|
| NDCG@10 | 0.853 | 0.915 | 93.3% | |
|
|
| MRR | 0.900 | 0.963 | 93.5% | |
|
|
| Recall@10 | 96.7% | 96.7% | 100% | |
|
|
| Parameters | 11.2M | 335M | 3.3% | |
|
|
| Model Size | 10.7 MB | ~1.2 GB | 0.9% | |
|
|
| Latency (CPU) | 1.5ms | ~15ms | 10x faster | |
|
|
|
|
|
## Training |
|
|
|
|
|
- **Teacher Model**: BAAI/bge-large-en-v1.5 (335M parameters, 1024-dim embeddings) |
|
|
- **Training Data**: 30,598 text chunks from Typelevel ecosystem documentation |
|
|
- **Distillation Method**: Knowledge distillation with MSE + cosine similarity loss |
|
|
- **Hardware**: Apple M3 Max (MPS) |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
- **Semantic search** in functional programming documentation |
|
|
- **Document retrieval** for Typelevel ecosystem libraries (Cats, Cats Effect, FS2, http4s, Doobie, Circe) |
|
|
- **Browser-based inference** via transformers.js or ONNX Runtime Web |
|
|
- **Client-side embeddings** for privacy-preserving search applications |
|
|
|
|
|
## Limitations |
|
|
|
|
|
1. **Domain Specialization**: Optimized for FP documentation; may underperform on general text |
|
|
2. **English Only**: Trained exclusively on English documentation |
|
|
3. **Vocabulary**: Uses bert-base-uncased vocabulary; some FP-specific terms may be suboptimally tokenized |
|
|
|
|
|
## Files |
|
|
|
|
|
| File | Size | Description | |
|
|
|------|------|-------------| |
|
|
| `model.safetensors` | 42.6 MB | PyTorch weights | |
|
|
| `onnx/model.onnx` | 42.4 MB | Full precision ONNX | |
|
|
| `onnx/model_quantized.onnx` | 10.7 MB | INT8 quantized ONNX | |
|
|
| `config.json` | - | Model configuration | |
|
|
| `tokenizer.json` | - | Fast tokenizer | |
|
|
| `vocab.txt` | - | Vocabulary file | |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{typelevel-bert, |
|
|
title={Typelevel-BERT: Distilled Text Embeddings for FP Documentation Search}, |
|
|
author={Daniel Spiewak}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/djspiewak/typelevel-bert} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|