|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- sentence-embeddings |
|
|
- transformers |
|
|
- bert |
|
|
- contrastive-learning |
|
|
datasets: |
|
|
- snli |
|
|
--- |
|
|
|
|
|
# Embedder (SNLI) |
|
|
|
|
|
## Model Description |
|
|
A lightweight BERT-style encoder trained from scratch on SNLI entailment pairs using an in-batch contrastive loss and mean pooling. |
|
|
|
|
|
## Training Data |
|
|
- Dataset: SNLI (`datasets` library) |
|
|
- Filter: label == entailment |
|
|
- Subsample: 50,000 pairs from the training split |
|
|
- Corpus for tokenizer: premises + hypotheses from the filtered pairs |
|
|
|
|
|
## Tokenizer |
|
|
- Type: WordPiece |
|
|
- Vocab size: 30,000 |
|
|
- Min frequency: 2 |
|
|
- Special tokens: `[PAD] [UNK] [CLS] [SEP] [MASK]` |
|
|
- Max sequence length: 128 |
|
|
|
|
|
## Architecture |
|
|
- Model: `BertModel` (trained from scratch) |
|
|
- Layers: 6 |
|
|
- Hidden size: 384 |
|
|
- Attention heads: 6 |
|
|
- Intermediate size: 1536 |
|
|
- Max position embeddings: 128 |
|
|
|
|
|
## Training Procedure |
|
|
- Loss: in-batch contrastive loss (temperature = 0.05) |
|
|
- Pooling: mean pooling over token embeddings |
|
|
- Normalization: L2 normalize sentence embeddings |
|
|
- Optimizer: AdamW |
|
|
- Learning rate: 3e-4 |
|
|
- Batch size: 64 |
|
|
- Epochs: 2 |
|
|
- Device: CUDA if available, else MPS on macOS, else CPU |
|
|
|
|
|
## Intended Use |
|
|
- Learning/demo purposes for embedding training and similarity search |
|
|
- Not intended for production use |
|
|
|
|
|
## Limitations |
|
|
- Trained from scratch; quality is lower than pretrained encoders |
|
|
- Trained only on SNLI entailment pairs |
|
|
- No downstream evaluation provided |
|
|
|
|
|
## How to Use |
|
|
from transformers import BertModel, BertTokenizerFast |
|
|
|
|
|
model_id = "your-username/embedder-snli" |
|
|
tokenizer = BertTokenizerFast.from_pretrained(model_id) |
|
|
model = BertModel.from_pretrained(model_id) |
|
|
|
|
|
## Citation |
|
|
@inproceedings{bowman2015snli, |
|
|
title={A large annotated corpus for learning natural language inference}, |
|
|
author={Bowman, Samuel R. and Angeli, Gabor and Potts, Christopher and Manning, Christopher D.}, |
|
|
booktitle={Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing}, |
|
|
year={2015} |
|
|
} |