djspiewak commited on
Commit
1c75a25
·
verified ·
1 Parent(s): 4dd52df

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ tags:
7
+ - sentence-transformers
8
+ - feature-extraction
9
+ - text-embeddings
10
+ - semantic-search
11
+ - onnx
12
+ - transformers.js
13
+ - bert
14
+ - knowledge-distillation
15
+ datasets:
16
+ - custom
17
+ pipeline_tag: feature-extraction
18
+ model-index:
19
+ - name: typelevel-bert
20
+ results:
21
+ - task:
22
+ type: retrieval
23
+ name: Document Retrieval
24
+ dataset:
25
+ type: custom
26
+ name: FP-Doc Benchmark v1
27
+ metrics:
28
+ - type: ndcg_at_10
29
+ value: 0.853
30
+ name: NDCG@10
31
+ - type: mrr
32
+ value: 0.900
33
+ name: MRR
34
+ - type: recall_at_10
35
+ value: 0.967
36
+ name: Recall@10
37
+ ---
38
+
39
+ # Typelevel-BERT
40
+
41
+ A compact, browser-deployable text embedding model specialized for searching Typelevel/FP documentation. Distilled from [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) to achieve fast client-side inference.
42
+
43
+ ## Highlights
44
+
45
+ - **93.3%** of teacher model quality (NDCG@10)
46
+ - **30x smaller** than teacher (11M vs 335M parameters)
47
+ - **10.7 MB** quantized ONNX model
48
+ - **1.5ms** inference latency (CPU, seq_len=128)
49
+ - Optimized for Cats, Cats Effect, FS2, http4s, Doobie, Circe documentation
50
+
51
+ ## Model Details
52
+
53
+ | Property | Value |
54
+ |----------|-------|
55
+ | **Model Type** | BERT encoder (text embedding) |
56
+ | **Architecture** | 4-layer transformer |
57
+ | **Hidden Size** | 256 |
58
+ | **Attention Heads** | 4 |
59
+ | **Parameters** | 11.2M |
60
+ | **Embedding Dimension** | 256 |
61
+ | **Max Sequence Length** | 512 |
62
+ | **Vocabulary** | bert-base-uncased (30,522 tokens) |
63
+ | **Pooling** | Mean pooling |
64
+
65
+ ## Usage
66
+
67
+ ### Browser/Node.js (transformers.js)
68
+
69
+ ```javascript
70
+ import { pipeline } from '@huggingface/transformers';
71
+
72
+ // Load the model (downloads automatically)
73
+ const extractor = await pipeline('feature-extraction', 'djspiewak/typelevel-bert', {
74
+ quantized: true, // Use INT8 quantized model (10.7 MB)
75
+ });
76
+
77
+ // Generate embeddings
78
+ const embedding = await extractor("How to sequence effects in cats-effect", {
79
+ pooling: 'mean',
80
+ normalize: true,
81
+ });
82
+
83
+ console.log(embedding.data); // Float32Array(256)
84
+ ```
85
+
86
+ ### Python (ONNX Runtime)
87
+
88
+ ```python
89
+ import onnxruntime as ort
90
+ import numpy as np
91
+ from transformers import AutoTokenizer
92
+ from huggingface_hub import hf_hub_download
93
+
94
+ # Download and load quantized model
95
+ model_path = hf_hub_download("djspiewak/typelevel-bert", "onnx/model_quantized.onnx")
96
+ tokenizer = AutoTokenizer.from_pretrained("djspiewak/typelevel-bert")
97
+ session = ort.InferenceSession(model_path)
98
+
99
+ # Tokenize input
100
+ text = "Resource management and safe cleanup"
101
+ inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)
102
+
103
+ # Run inference
104
+ outputs = session.run(None, {
105
+ "input_ids": inputs["input_ids"].astype(np.int64),
106
+ "attention_mask": inputs["attention_mask"].astype(np.int64),
107
+ })
108
+
109
+ # Mean pooling
110
+ hidden_states = outputs[0] # (1, seq_len, 256)
111
+ attention_mask = inputs["attention_mask"]
112
+ mask_expanded = np.expand_dims(attention_mask, -1)
113
+ sum_embeddings = np.sum(hidden_states * mask_expanded, axis=1)
114
+ sum_mask = np.sum(mask_expanded, axis=1)
115
+ embedding = sum_embeddings / sum_mask # (1, 256)
116
+
117
+ # L2 normalize
118
+ embedding = embedding / np.linalg.norm(embedding, axis=1, keepdims=True)
119
+ ```
120
+
121
+ ## Performance
122
+
123
+ | Metric | Typelevel-BERT | Teacher (BGE-large) | % of Teacher |
124
+ |--------|----------------|---------------------|--------------|
125
+ | NDCG@10 | 0.853 | 0.915 | 93.3% |
126
+ | MRR | 0.900 | 0.963 | 93.5% |
127
+ | Recall@10 | 96.7% | 96.7% | 100% |
128
+ | Parameters | 11.2M | 335M | 3.3% |
129
+ | Model Size | 10.7 MB | ~1.2 GB | 0.9% |
130
+ | Latency (CPU) | 1.5ms | ~15ms | 10x faster |
131
+
132
+ ## Training
133
+
134
+ - **Teacher Model**: BAAI/bge-large-en-v1.5 (335M parameters, 1024-dim embeddings)
135
+ - **Training Data**: 30,598 text chunks from Typelevel ecosystem documentation
136
+ - **Distillation Method**: Knowledge distillation with MSE + cosine similarity loss
137
+ - **Hardware**: Apple M3 Max (MPS)
138
+
139
+ ## Intended Use
140
+
141
+ This model is designed for:
142
+ - **Semantic search** in functional programming documentation
143
+ - **Document retrieval** for Typelevel ecosystem libraries (Cats, Cats Effect, FS2, http4s, Doobie, Circe)
144
+ - **Browser-based inference** via transformers.js or ONNX Runtime Web
145
+ - **Client-side embeddings** for privacy-preserving search applications
146
+
147
+ ## Limitations
148
+
149
+ 1. **Domain Specialization**: Optimized for FP documentation; may underperform on general text
150
+ 2. **English Only**: Trained exclusively on English documentation
151
+ 3. **Vocabulary**: Uses bert-base-uncased vocabulary; some FP-specific terms may be suboptimally tokenized
152
+
153
+ ## Files
154
+
155
+ | File | Size | Description |
156
+ |------|------|-------------|
157
+ | `model.safetensors` | 42.6 MB | PyTorch weights |
158
+ | `onnx/model.onnx` | 42.4 MB | Full precision ONNX |
159
+ | `onnx/model_quantized.onnx` | 10.7 MB | INT8 quantized ONNX |
160
+ | `config.json` | - | Model configuration |
161
+ | `tokenizer.json` | - | Fast tokenizer |
162
+ | `vocab.txt` | - | Vocabulary file |
163
+
164
+ ## Citation
165
+
166
+ ```bibtex
167
+ @misc{typelevel-bert,
168
+ title={Typelevel-BERT: Distilled Text Embeddings for FP Documentation Search},
169
+ author={Daniel Spiewak},
170
+ year={2025},
171
+ url={https://huggingface.co/djspiewak/typelevel-bert}
172
+ }
173
+ ```
174
+
175
+ ## License
176
+
177
+ MIT
config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "hidden_act": "gelu",
7
+ "hidden_dropout_prob": 0.1,
8
+ "hidden_size": 256,
9
+ "initializer_range": 0.02,
10
+ "intermediate_size": 1024,
11
+ "layer_norm_eps": 1e-12,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "bert",
14
+ "num_attention_heads": 4,
15
+ "num_hidden_layers": 4,
16
+ "pad_token_id": 0,
17
+ "type_vocab_size": 2,
18
+ "vocab_size": 30522
19
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:62d83e918217f992d91973030d36865eb216388512066f99d4101edb27d7e7a2
3
+ size 44690024
onnx/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:601798ea0a04bc7b3119e061c1a80e081b7b4347f2f064f1e232bc313c5d19cc
3
+ size 44459423
onnx/model_quantized.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:341648c8f684aaf372d224ba83c34e17b709aa62309f1bfd79400a3b6ad7bc90
3
+ size 11248141
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "cls_token": "[CLS]",
4
+ "do_lower_case": true,
5
+ "is_local": false,
6
+ "mask_token": "[MASK]",
7
+ "model_max_length": 512,
8
+ "pad_token": "[PAD]",
9
+ "sep_token": "[SEP]",
10
+ "strip_accents": null,
11
+ "tokenize_chinese_chars": true,
12
+ "tokenizer_class": "BertTokenizer",
13
+ "unk_token": "[UNK]"
14
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff