Alibaba-NLP/gte-en-mlm-large get worse MLM loss than Alibaba-NLP/gte-en-mlm-base
#2
by
zhichao-geng
- opened
In my downstrea fine-tuning task, I notice gte-large get model collapse during training. And I further verify their MLM loss, and found gte-large always get larger MLM loss than gte-base.
Here is an example using msmarco docs:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
all_docs = [
"The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.",
"The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.",
"Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this project would forever change the world forever making it known that something this powerful can be manmade.",
"The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb. It refers specifically to the period of the project from 194 ⦠2-1946 under the control of the U.S. Army Corps of Engineers, under the administration of General Leslie R. Groves.",
"versions of each volume as well as complementary websites. The first websiteâThe Manhattan Project: An Interactive Historyâis available on the Office of History and Heritage Resources website, http://www.cfo. doe.gov/me70/history. The Office of History and Heritage Resources and the National Nuclear Security"
]
for model_id in ["Alibaba-NLP/gte-en-mlm-base", "Alibaba-NLP/gte-en-mlm-large"]:
model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True)
print(model_id)
for input_text in all_docs:
inputs = tokenizer(input_text, return_tensors="pt")
inputs["labels"] = inputs["input_ids"].clone()
outputs = model(**inputs)
loss = outputs.loss
print(loss)
# output
Alibaba-NLP/gte-en-mlm-base
tensor(0.7704, grad_fn=<NllLossBackward0>)
tensor(1.0996, grad_fn=<NllLossBackward0>)
tensor(0.8022, grad_fn=<NllLossBackward0>)
tensor(0.7335, grad_fn=<NllLossBackward0>)
tensor(0.7789, grad_fn=<NllLossBackward0>)
Alibaba-NLP/gte-en-mlm-large
tensor(1.8980, grad_fn=<NllLossBackward0>)
tensor(2.6543, grad_fn=<NllLossBackward0>)
tensor(1.9553, grad_fn=<NllLossBackward0>)
tensor(1.5670, grad_fn=<NllLossBackward0>)
tensor(2.3082, grad_fn=<NllLossBackward0>)
This result is not intuitive. Can you confirm currently we're presenting the correct checkpoint?