| --- |
| language: |
| - en |
| tags: |
| - fill-mask |
| - masked-lm |
| - feature-extraction |
| - semantic-similarity |
| - historical-text |
| - newspapers |
| license: mit |
| pipeline_tag: fill-mask |
| --- |
| |
| # NewsBERT_1800-1920 |
| |
| **NewsBERT_1800-1920** is a domain-adapted masked language model based on [`google-bert/bert-base-uncased`](https://huggingface.co/google-bert/bert-base-uncased). It has been fine-tuned with a **masked language modeling (MLM)** objective on all **historical English newspaper text** (1800-1920) from the following two collections: |
| - [HMD14](https://bl.iro.bl.uk/concern/datasets/2800eb7d-8b49-4398-a6e9-c2c5692a1304) |
| - [LwM](https://bl.iro.bl.uk/concern/datasets/99dc570a-9460-48ac-baed-9d2b8c4c13c0?locale=en) |
| |
| NewsBERT_1800-1920 retains the architecture and vocabulary of BERT-base (uncased), with only weights being adapted to these datasets. |
|
|
| --- |
|
|
| ## Model Details |
|
|
| - **Model type:** `BertForMaskedLM` |
| - **Base model:** `google-bert/bert-base-uncased` |
| - **Vocabulary:** WordPiece (30,522 tokens) |
| - **Hidden size:** 768 |
| - **Layers:** 12 |
| - **Heads:** 12 |
| - **Max sequence length:** 512 |
| - **Fine-tuning objective:** Masked language modeling (MLM) |
|
|
| --- |
|
|
| ## How to Use |
|
|
| ### 1. **Fill-Mask Pipeline** |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline |
| |
| model_id = "TextMachineProject/NewsBERT_1800-1920" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForMaskedLM.from_pretrained(model_id) |
| |
| fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
| |
| text = "The [MASK] was published in the newspaper." |
| preds = fill_mask(text) |
| |
| for p in preds: |
| print(f"{p['sequence']} (score={p['score']:.4f})") |
| |
| ``` |
|
|
| ### 2. Use as an Encoder (CLS Embeddings) |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModel |
| |
| model_id = "TextMachineProject/NewsBERT_1800-1920" |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModel.from_pretrained(model_id).to(device) |
| model.eval() |
| |
| def encode(text, max_length=512): |
| inputs = tokenizer( |
| text, |
| return_tensors="pt", |
| truncation=True, |
| max_length=max_length |
| ).to(device) |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| embedding = outputs.last_hidden_state[:, 0, :] # CLS token |
| |
| return embedding.squeeze(0).cpu() # [768] |
| |
| embedding = encode("Example newspaper article text...") |
| print(embedding.shape) # torch.Size([768]) |
| ``` |
|
|
|
|
| ### 3. Compute Similarity Between Two Articles |
|
|
| ```python |
| import torch.nn.functional as F |
| |
| e1 = encode("Article text one...") |
| e2 = encode("Another article...") |
| |
| cos_sim = F.cosine_similarity(e1, e2, dim=0) |
| print("Cosine similarity:", cos_sim.item()) |
| ``` |