--- language: - en tags: - fill-mask - masked-lm - feature-extraction - semantic-similarity - historical-text - newspapers license: mit pipeline_tag: fill-mask --- # NewsBERT_1800-1920 **NewsBERT_1800-1920** is a domain-adapted masked language model based on [`google-bert/bert-base-uncased`](https://huggingface.co/google-bert/bert-base-uncased). It has been fine-tuned with a **masked language modeling (MLM)** objective on all **historical English newspaper text** (1800-1920) from the following two collections: - [HMD14](https://bl.iro.bl.uk/concern/datasets/2800eb7d-8b49-4398-a6e9-c2c5692a1304) - [LwM](https://bl.iro.bl.uk/concern/datasets/99dc570a-9460-48ac-baed-9d2b8c4c13c0?locale=en) NewsBERT_1800-1920 retains the architecture and vocabulary of BERT-base (uncased), with only weights being adapted to these datasets. --- ## Model Details - **Model type:** `BertForMaskedLM` - **Base model:** `google-bert/bert-base-uncased` - **Vocabulary:** WordPiece (30,522 tokens) - **Hidden size:** 768 - **Layers:** 12 - **Heads:** 12 - **Max sequence length:** 512 - **Fine-tuning objective:** Masked language modeling (MLM) --- ## How to Use ### 1. **Fill-Mask Pipeline** ```python from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline model_id = "TextMachineProject/NewsBERT_1800-1920" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained(model_id) fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) text = "The [MASK] was published in the newspaper." preds = fill_mask(text) for p in preds: print(f"{p['sequence']} (score={p['score']:.4f})") ``` ### 2. Use as an Encoder (CLS Embeddings) ```python import torch from transformers import AutoTokenizer, AutoModel model_id = "TextMachineProject/NewsBERT_1800-1920" device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModel.from_pretrained(model_id).to(device) model.eval() def encode(text, max_length=512): inputs = tokenizer( text, return_tensors="pt", truncation=True, max_length=max_length ).to(device) with torch.no_grad(): outputs = model(**inputs) embedding = outputs.last_hidden_state[:, 0, :] # CLS token return embedding.squeeze(0).cpu() # [768] embedding = encode("Example newspaper article text...") print(embedding.shape) # torch.Size([768]) ``` ### 3. Compute Similarity Between Two Articles ```python import torch.nn.functional as F e1 = encode("Article text one...") e2 = encode("Another article...") cos_sim = F.cosine_similarity(e1, e2, dim=0) print("Cosine similarity:", cos_sim.item()) ```