| Our modification to SentEval: | |
| 1. Add the `all` setting to all STS tasks. | |
| 2. Change STS-B and SICK-R to not use an additional regressor. | |
| # SentEval: evaluation toolkit for sentence embeddings | |
| SentEval is a library for evaluating the quality of sentence embeddings. We assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks. **SentEval currently includes 17 downstream tasks**. We also include a suite of **10 probing tasks** which evaluate what linguistic properties are encoded in sentence embeddings. Our goal is to ease the study and the development of general-purpose fixed-size sentence representations. | |
| **(04/22) SentEval new tasks: Added probing tasks for evaluating what linguistic properties are encoded in sentence embeddings** | |
| **(10/04) SentEval example scripts for three sentence encoders: [SkipThought-LN](https://github.com/ryankiros/layer-norm#skip-thoughts)/[GenSen](https://github.com/Maluuba/gensen)/[Google-USE](https://tfhub.dev/google/universal-sentence-encoder/1)** | |
| ## Dependencies | |
| This code is written in python. The dependencies are: | |
| * Python 2/3 with [NumPy](http://www.numpy.org/)/[SciPy](http://www.scipy.org/) | |
| * [Pytorch](http://pytorch.org/)>=0.4 | |
| * [scikit-learn](http://scikit-learn.org/stable/index.html)>=0.18.0 | |
| ## Transfer tasks | |
| ### Downstream tasks | |
| SentEval allows you to evaluate your sentence embeddings as features for the following *downstream* tasks: | |
| | Task | Type | #train | #test | needs_train | set_classifier | | |
| |---------- |------------------------------ |-----------:|----------:|:-----------:|:----------:| | |
| | [MR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | movie review | 11k | 11k | 1 | 1 | | |
| | [CR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | product review | 4k | 4k | 1 | 1 | | |
| | [SUBJ](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | subjectivity status | 10k | 10k | 1 | 1 | | |
| | [MPQA](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | opinion-polarity | 11k | 11k | 1 | 1 | | |
| | [SST](https://nlp.stanford.edu/sentiment/index.html) | binary sentiment analysis | 67k | 1.8k | 1 | 1 | | |
| | **[SST](https://nlp.stanford.edu/sentiment/index.html)** | **fine-grained sentiment analysis** | 8.5k | 2.2k | 1 | 1 | | |
| | [TREC](http://cogcomp.cs.illinois.edu/Data/QA/QC/) | question-type classification | 6k | 0.5k | 1 | 1 | | |
| | [SICK-E](http://clic.cimec.unitn.it/composes/sick.html) | natural language inference | 4.5k | 4.9k | 1 | 1 | | |
| | [SNLI](https://nlp.stanford.edu/projects/snli/) | natural language inference | 550k | 9.8k | 1 | 1 | | |
| | [MRPC](https://aclweb.org/aclwiki/Paraphrase_Identification_(State_of_the_art)) | paraphrase detection | 4.1k | 1.7k | 1 | 1 | | |
| | [STS 2012](https://www.cs.york.ac.uk/semeval-2012/task6/) | semantic textual similarity | N/A | 3.1k | 0 | 0 | | |
| | [STS 2013](http://ixa2.si.ehu.es/sts/) | semantic textual similarity | N/A | 1.5k | 0 | 0 | | |
| | [STS 2014](http://alt.qcri.org/semeval2014/task10/) | semantic textual similarity | N/A | 3.7k | 0 | 0 | | |
| | [STS 2015](http://alt.qcri.org/semeval2015/task2/) | semantic textual similarity | N/A | 8.5k | 0 | 0 | | |
| | [STS 2016](http://alt.qcri.org/semeval2016/task1/) | semantic textual similarity | N/A | 9.2k | 0 | 0 | | |
| | [STS B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results) | semantic textual similarity | 5.7k | 1.4k | 1 | 0 | | |
| | [SICK-R](http://clic.cimec.unitn.it/composes/sick.html) | semantic textual similarity | 4.5k | 4.9k | 1 | 0 | | |
| | [COCO](http://mscoco.org/) | image-caption retrieval | 567k | 5*1k | 1 | 0 | | |
| where **needs_train** means a model with parameters is learned on top of the sentence embeddings, and **set_classifier** means you can define the parameters of the classifier in the case of a classification task (see below). | |
| Note: COCO comes with ResNet-101 2048d image embeddings. [More details on the tasks.](https://arxiv.org/pdf/1705.02364.pdf) | |
| ### Probing tasks | |
| SentEval also includes a series of [*probing* tasks](https://github.com/facebookresearch/SentEval/tree/master/data/probing) to evaluate what linguistic properties are encoded in your sentence embeddings: | |
| | Task | Type | #train | #test | needs_train | set_classifier | | |
| |---------- |------------------------------ |-----------:|----------:|:-----------:|:----------:| | |
| | [SentLen](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Length prediction | 100k | 10k | 1 | 1 | | |
| | [WC](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Word Content analysis | 100k | 10k | 1 | 1 | | |
| | [TreeDepth](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Tree depth prediction | 100k | 10k | 1 | 1 | | |
| | [TopConst](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Top Constituents prediction | 100k | 10k | 1 | 1 | | |
| | [BShift](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Word order analysis | 100k | 10k | 1 | 1 | | |
| | [Tense](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Verb tense prediction | 100k | 10k | 1 | 1 | | |
| | [SubjNum](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Subject number prediction | 100k | 10k | 1 | 1 | | |
| | [ObjNum](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Object number prediction | 100k | 10k | 1 | 1 | | |
| | [SOMO](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Semantic odd man out | 100k | 10k | 1 | 1 | | |
| | [CoordInv](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Coordination Inversion | 100k | 10k | 1 | 1 | | |
| ## Download datasets | |
| To get all the transfer tasks datasets, run (in data/downstream/): | |
| ```bash | |
| ./get_transfer_data.bash | |
| ``` | |
| This will automatically download and preprocess the downstream datasets, and store them in data/downstream (warning: for MacOS users, you may have to use p7zip instead of unzip). The probing tasks are already in data/probing by default. | |
| ## How to use SentEval: examples | |
| ### examples/bow.py | |
| In examples/bow.py, we evaluate the quality of the average of word embeddings. | |
| To download state-of-the-art fastText embeddings: | |
| ```bash | |
| curl -Lo glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip | |
| curl -Lo crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip | |
| ``` | |
| To reproduce the results for bag-of-vectors, run (in examples/): | |
| ```bash | |
| python bow.py | |
| ``` | |
| As required by SentEval, this script implements two functions: **prepare** (optional) and **batcher** (required) that turn text sentences into sentence embeddings. Then SentEval takes care of the evaluation on the transfer tasks using the embeddings as features. | |
| ### examples/infersent.py | |
| To get the **[InferSent](https://www.github.com/facebookresearch/InferSent)** model and reproduce our results, download our best models and run infersent.py (in examples/): | |
| ```bash | |
| curl -Lo examples/infersent1.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent1.pkl | |
| curl -Lo examples/infersent2.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent2.pkl | |
| ``` | |
| ### examples/skipthought.py - examples/gensen.py - examples/googleuse.py | |
| We also provide example scripts for three other encoders: | |
| * [SkipThought with Layer-Normalization](https://github.com/ryankiros/layer-norm#skip-thoughts) in Theano | |
| * [GenSen encoder](https://github.com/Maluuba/gensen) in Pytorch | |
| * [Google encoder](https://tfhub.dev/google/universal-sentence-encoder/1) in TensorFlow | |
| Note that for SkipThought and GenSen, following the steps of the associated githubs is necessary. | |
| The Google encoder script should work as-is. | |
| ## How to use SentEval | |
| To evaluate your sentence embeddings, SentEval requires that you implement two functions: | |
| 1. **prepare** (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc) | |
| 2. **batcher** (transforms a batch of text sentences into sentence embeddings) | |
| ### 1.) prepare(params, samples) (optional) | |
| *batcher* only sees one batch at a time while the *samples* argument of *prepare* contains all the sentences of a task. | |
| ``` | |
| prepare(params, samples) | |
| ``` | |
| * *params*: senteval parameters. | |
| * *samples*: list of all sentences from the tranfer task. | |
| * *output*: No output. Arguments stored in "params" can further be used by *batcher*. | |
| *Example*: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors. | |
| ### 2.) batcher(params, batch) | |
| ``` | |
| batcher(params, batch) | |
| ``` | |
| * *params*: senteval parameters. | |
| * *batch*: numpy array of text sentences (of size params.batch_size) | |
| * *output*: numpy array of sentence embeddings (of size params.batch_size) | |
| *Example*: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences. | |
| ### 3.) evaluation on transfer tasks | |
| After having implemented the batch and prepare function for your own sentence encoder, | |
| 1) to perform the actual evaluation, first import senteval and set its parameters: | |
| ```python | |
| import senteval | |
| params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10} | |
| ``` | |
| 2) (optional) set the parameters of the classifier (when applicable): | |
| ```python | |
| params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64, | |
| 'tenacity': 5, 'epoch_size': 4} | |
| ``` | |
| You can choose **nhid=0** (Logistic Regression) or **nhid>0** (MLP) and define the parameters for training. | |
| 3) Create an instance of the class SE: | |
| ```python | |
| se = senteval.engine.SE(params, batcher, prepare) | |
| ``` | |
| 4) define the set of transfer tasks and run the evaluation: | |
| ```python | |
| transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark'] | |
| results = se.eval(transfer_tasks) | |
| ``` | |
| The current list of available tasks is: | |
| ```python | |
| ['CR', 'MR', 'MPQA', 'SUBJ', 'SST2', 'SST5', 'TREC', 'MRPC', 'SNLI', | |
| 'SICKEntailment', 'SICKRelatedness', 'STSBenchmark', 'ImageCaptionRetrieval', | |
| 'STS12', 'STS13', 'STS14', 'STS15', 'STS16', | |
| 'Length', 'WordContent', 'Depth', 'TopConstituents','BigramShift', 'Tense', | |
| 'SubjNumber', 'ObjNumber', 'OddManOut', 'CoordinationInversion'] | |
| ``` | |
| ## SentEval parameters | |
| Global parameters of SentEval: | |
| ```bash | |
| # senteval parameters | |
| task_path # path to SentEval datasets (required) | |
| seed # seed | |
| usepytorch # use cuda-pytorch (else scikit-learn) where possible | |
| kfold # k-fold validation for MR/CR/SUB/MPQA. | |
| ``` | |
| Parameters of the classifier: | |
| ```bash | |
| nhid: # number of hidden units (0: Logistic Regression, >0: MLP); Default nonlinearity: Tanh | |
| optim: # optimizer ("sgd,lr=0.1", "adam", "rmsprop" ..) | |
| tenacity: # how many times dev acc does not increase before training stops | |
| epoch_size: # each epoch corresponds to epoch_size pass on the train set | |
| max_epoch: # max number of epoches | |
| dropout: # dropout for MLP | |
| ``` | |
| Note that to get a proxy of the results while **dramatically reducing computation time**, | |
| we suggest the **prototyping config**: | |
| ```python | |
| params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5} | |
| params['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128, | |
| 'tenacity': 3, 'epoch_size': 2} | |
| ``` | |
| which will results in a 5 times speedup for classification tasks. | |
| To produce results that are **comparable to the literature**, use the **default config**: | |
| ```python | |
| params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10} | |
| params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64, | |
| 'tenacity': 5, 'epoch_size': 4} | |
| ``` | |
| which takes longer but will produce better and comparable results. | |
| For probing tasks, we used an MLP with a Sigmoid nonlinearity and and tuned the nhid (in [50, 100, 200]) and dropout (in [0.0, 0.1, 0.2]) on the dev set. | |
| ## References | |
| Please considering citing [[1]](https://arxiv.org/abs/1803.05449) if using this code for evaluating sentence embedding methods. | |
| ### SentEval: An Evaluation Toolkit for Universal Sentence Representations | |
| [1] A. Conneau, D. Kiela, [*SentEval: An Evaluation Toolkit for Universal Sentence Representations*](https://arxiv.org/abs/1803.05449) | |
| ``` | |
| @article{conneau2018senteval, | |
| title={SentEval: An Evaluation Toolkit for Universal Sentence Representations}, | |
| author={Conneau, Alexis and Kiela, Douwe}, | |
| journal={arXiv preprint arXiv:1803.05449}, | |
| year={2018} | |
| } | |
| ``` | |
| Contact: [aconneau@fb.com](mailto:aconneau@fb.com), [dkiela@fb.com](mailto:dkiela@fb.com) | |
| ### Related work | |
| * [J. R Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, S. Fidler - SkipThought Vectors, NIPS 2015](https://arxiv.org/abs/1506.06726) | |
| * [S. Arora, Y. Liang, T. Ma - A Simple but Tough-to-Beat Baseline for Sentence Embeddings, ICLR 2017](https://openreview.net/pdf?id=SyK00v5xx) | |
| * [Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, Y. Goldberg - Fine-grained analysis of sentence embeddings using auxiliary prediction tasks, ICLR 2017](https://arxiv.org/abs/1608.04207) | |
| * [A. Conneau, D. Kiela, L. Barrault, H. Schwenk, A. Bordes - Supervised Learning of Universal Sentence Representations from Natural Language Inference Data, EMNLP 2017](https://arxiv.org/abs/1705.02364) | |
| * [S. Subramanian, A. Trischler, Y. Bengio, C. J Pal - Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning, ICLR 2018](https://arxiv.org/abs/1804.00079) | |
| * [A. Nie, E. D. Bennett, N. D. Goodman - DisSent: Sentence Representation Learning from Explicit Discourse Relations, 2018](https://arxiv.org/abs/1710.04334) | |
| * [D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil - Universal Sentence Encoder, 2018](https://arxiv.org/abs/1803.11175) | |
| * [A. Conneau, G. Kruszewski, G. Lample, L. Barrault, M. Baroni - What you can cram into a single vector: Probing sentence embeddings for linguistic properties, ACL 2018](https://arxiv.org/abs/1805.01070) | |