|
|
--- |
|
|
|
|
|
|
|
|
{} |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
#Encoder from HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding. |
|
|
The model can effectively encode a tweet into topic-level embeddings. It can be used to estimate **topic-level similarity** between tweets. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
#Encoder leverage hashtags to learn inter-post topic relevance (for retrieval) via contrastive learning over 179M tweets. |
|
|
It was pre-trained on pairwise posts, and contrastive learning guided them to learn topic relevance via learning to identify posts with the same hashtag. |
|
|
We randomly noise the hashtags to avoid trivial representation. |
|
|
Please refers to https://github.com/albertan017/HICL for more details. |
|
|
|
|
|
 |
|
|
|
|
|
### Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
- **Developed by:** Hanzhuo Tan, Department of Computing, the Hong Kong Polytechnic University |
|
|
- **Model type:** Roberta |
|
|
- **Language(s) (NLP):** English |
|
|
- **License:** n.a |
|
|
- **Finetuned from model [optional]:** Bertweet |
|
|
|
|
|
### Model Sources [optional] |
|
|
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
|
|
- **Repository:** https://github.com/albertan017/HICL |
|
|
- **Paper [optional]:** HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding |
|
|
|
|
|
## Uses |
|
|
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
``` |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
hashencoder = AutoModel.from_pretrained("albertan017/hashencoder") |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("albertan017/hashencoder") |
|
|
|
|
|
tweet = "here's a sample tweet for encoding" |
|
|
|
|
|
input_ids = torch.tensor([tokenizer.encode(tweet)]) |
|
|
|
|
|
with torch.no_grad(): |
|
|
features = hashencoder(input_ids) # Models outputs are now tuples |
|
|
|
|
|
``` |
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
|
|
We do not inforce semantic similarity. |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
|
|
#Encoder is pre-trained on 15 GB of plain text from 179 million tweets and 4 billion tokens. |
|
|
Following the practice to pre-train BERTweet, the raw data was collected from the archived Twitter stream, containing 4TB of sampled tweets from January 2013 to June 2021. |
|
|
For data pre-processing, we ran the following steps. |
|
|
First, we employed fastText to extract English tweets and only kept tweets with hashtags. |
|
|
Then, low-frequency hashtags appearing in less than 100 tweets were further filtered out to alleviate sparsity. |
|
|
After that, we obtained a large-scale dataset containing 179M tweets, each has at least one hashtag, and hence corresponds to 180K hashtags in total. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
|
|
To leverage hashtag-gathered context in pre-training, we exploit contrastive learning and train #Encoder to identify pairwise posts sharing the same hashtag for gaining topic relevance. |
|
|
|
|
|
|
|
|
## Citation [optional] |
|
|
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
**APA:** |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
|
|
|
|
|
|
|