NewsBERT / README.md

Update README.md

a42d2db verified 5 months ago

2.67 kB

language:
  - en
tags:
  - fill-mask
  - masked-lm
  - feature-extraction
  - semantic-similarity
  - historical-text
  - newspapers
license: mit
pipeline_tag: fill-mask

NewsBERT

NewsBERT is a domain-adapted masked language model based on google-bert/bert-base-uncased. It has been fine-tuned with a masked language modeling (MLM) objective on all historical English newspaper text (1800-1920) from the following two collections:

HMD14
LwM

NewsBERT retains the architecture and vocabulary of BERT-base (uncased), with only weights being adapted to these datasets.

Model Details

Model type: BertForMaskedLM
Base model: google-bert/bert-base-uncased
Vocabulary: WordPiece (30,522 tokens)
Hidden size: 768
Layers: 12
Heads: 12
Max sequence length: 512
Fine-tuning objective: Masked language modeling (MLM)

How to Use

1. Fill-Mask Pipeline

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

model_id = "npedrazzini/NewsBERT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

text = "The [MASK] was published in the newspaper."
preds = fill_mask(text)

for p in preds:
    print(f"{p['sequence']} (score={p['score']:.4f})")

2. Use as an Encoder (CLS Embeddings)

import torch
from transformers import AutoTokenizer, AutoModel

model_id = "npedrazzini/NewsBERT"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()

def encode(text, max_length=512):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=max_length
    ).to(device)

    with torch.no_grad():
        outputs = model(**inputs)
        embedding = outputs.last_hidden_state[:, 0, :]  # CLS token

    return embedding.squeeze(0).cpu()  # [768]

embedding = encode("Example newspaper article text...")
print(embedding.shape)  # torch.Size([768])

3. Compute Similarity Between Two Articles

import torch.nn.functional as F

e1 = encode("Article text one...")
e2 = encode("Another article...")

cos_sim = F.cosine_similarity(e1, e2, dim=0)
print("Cosine similarity:", cos_sim.item())