arjunsah21 commited on
Commit
eadc010
·
verified ·
1 Parent(s): 8d75d0c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -3
README.md CHANGED
@@ -1,3 +1,71 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - sentence-embeddings
6
+ - transformers
7
+ - bert
8
+ - contrastive-learning
9
+ datasets:
10
+ - snli
11
+ ---
12
+
13
+ # Embedder (SNLI)
14
+
15
+ ## Model Description
16
+ A lightweight BERT-style encoder trained from scratch on SNLI entailment pairs using an in-batch contrastive loss and mean pooling.
17
+
18
+ ## Training Data
19
+ - Dataset: SNLI (`datasets` library)
20
+ - Filter: label == entailment
21
+ - Subsample: 50,000 pairs from the training split
22
+ - Corpus for tokenizer: premises + hypotheses from the filtered pairs
23
+
24
+ ## Tokenizer
25
+ - Type: WordPiece
26
+ - Vocab size: 30,000
27
+ - Min frequency: 2
28
+ - Special tokens: `[PAD] [UNK] [CLS] [SEP] [MASK]`
29
+ - Max sequence length: 128
30
+
31
+ ## Architecture
32
+ - Model: `BertModel` (trained from scratch)
33
+ - Layers: 6
34
+ - Hidden size: 384
35
+ - Attention heads: 6
36
+ - Intermediate size: 1536
37
+ - Max position embeddings: 128
38
+
39
+ ## Training Procedure
40
+ - Loss: in-batch contrastive loss (temperature = 0.05)
41
+ - Pooling: mean pooling over token embeddings
42
+ - Normalization: L2 normalize sentence embeddings
43
+ - Optimizer: AdamW
44
+ - Learning rate: 3e-4
45
+ - Batch size: 64
46
+ - Epochs: 2
47
+ - Device: CUDA if available, else MPS on macOS, else CPU
48
+
49
+ ## Intended Use
50
+ - Learning/demo purposes for embedding training and similarity search
51
+ - Not intended for production use
52
+
53
+ ## Limitations
54
+ - Trained from scratch; quality is lower than pretrained encoders
55
+ - Trained only on SNLI entailment pairs
56
+ - No downstream evaluation provided
57
+
58
+ ## How to Use
59
+ from transformers import BertModel, BertTokenizerFast
60
+
61
+ model_id = "your-username/embedder-snli"
62
+ tokenizer = BertTokenizerFast.from_pretrained(model_id)
63
+ model = BertModel.from_pretrained(model_id)
64
+
65
+ ## Citation
66
+ @inproceedings{bowman2015snli,
67
+ title={A large annotated corpus for learning natural language inference},
68
+ author={Bowman, Samuel R. and Angeli, Gabor and Potts, Christopher and Manning, Christopher D.},
69
+ booktitle={Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
70
+ year={2015}
71
+ }