zen-E/NEWS5M-simcse-roberta-large-embeddings-pca-256
Updated • 12
How to use zen-E/bert-mini-sentence-distil-unsupervised with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("feature-extraction", model="zen-E/bert-mini-sentence-distil-unsupervised") # Load model directly
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("zen-E/bert-mini-sentence-distil-unsupervised")
model = AutoModel.from_pretrained("zen-E/bert-mini-sentence-distil-unsupervised")The model is trained by knowledge distillation between the "princeton-nlp/unsup-simcse-roberta-large" and "prajjwal1/bert-mini" on the 'ffgcc/NEWS5M'.
The model can perform inferenced by Automodel.
The model achieves 0.825 and 0.83 for pearsonr and spearmanr respectively on STS-b test dataset.
For more training detail, the training config and the pytorch forward function is as follows:
config = {
'epoch' = 200,
'learning_rate' = 3e-4,
'batch_size' = 12288,
'temperature' = 0.05
}
def forward_cos_mse_kd_unsup(self, sentences, teacher_sentence_embs):
"""forward function for the unsupervised News5M dataset"""
_, o = self.bert(**sentences)
# cosine similarity between the first half batch and the second half batch
half_batch = o.size(0) // 2
higher_half = half_batch * 2 #skip the last datapoint when the batch size number is odd
cos_sim = cosine_sim(o[:half_batch], o[half_batch:higher_half])
cos_sim_teacher = cosine_sim(teacher_sentence_embs[:half_batch], teacher_sentence_embs[half_batch:higher_half])
# KL Divergence between student and teacher probabilities
soft_teacher_probs = F.softmax(cos_sim_teacher / self.temperature, dim=1)
kd_contrastive_loss = F.kl_div(F.log_softmax(cos_sim / self.temperature, dim=1),
soft_teacher_probs,
reduction='batchmean')
# MSE loss
kd_mse_loss = nn.MSELoss()(o, teacher_sentence_embs)/3
# equal weight for the two losses
total_loss = kd_contrastive_loss*0.5 + kd_mse_loss*0.5
return total_loss, kd_contrastive_loss, kd_mse_loss