File size: 2,729 Bytes
baf35d4 3a6e9bc baf35d4 3a6e9bc baf35d4 3a6e9bc edc3efe 3a6e9bc edc3efe 3a6e9bc edc3efe 3a6e9bc edc3efe 3a6e9bc edc3efe 3a6e9bc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | ---
language:
- en
tags:
- fill-mask
- masked-lm
- feature-extraction
- semantic-similarity
- historical-text
- newspapers
license: mit
pipeline_tag: fill-mask
---
# NewsBERT_1800-1920
**NewsBERT_1800-1920** is a domain-adapted masked language model based on [`google-bert/bert-base-uncased`](https://huggingface.co/google-bert/bert-base-uncased). It has been fine-tuned with a **masked language modeling (MLM)** objective on all **historical English newspaper text** (1800-1920) from the following two collections:
- [HMD14](https://bl.iro.bl.uk/concern/datasets/2800eb7d-8b49-4398-a6e9-c2c5692a1304)
- [LwM](https://bl.iro.bl.uk/concern/datasets/99dc570a-9460-48ac-baed-9d2b8c4c13c0?locale=en)
NewsBERT_1800-1920 retains the architecture and vocabulary of BERT-base (uncased), with only weights being adapted to these datasets.
---
## Model Details
- **Model type:** `BertForMaskedLM`
- **Base model:** `google-bert/bert-base-uncased`
- **Vocabulary:** WordPiece (30,522 tokens)
- **Hidden size:** 768
- **Layers:** 12
- **Heads:** 12
- **Max sequence length:** 512
- **Fine-tuning objective:** Masked language modeling (MLM)
---
## How to Use
### 1. **Fill-Mask Pipeline**
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
model_id = "TextMachineProject/NewsBERT_1800-1920"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
text = "The [MASK] was published in the newspaper."
preds = fill_mask(text)
for p in preds:
print(f"{p['sequence']} (score={p['score']:.4f})")
```
### 2. Use as an Encoder (CLS Embeddings)
```python
import torch
from transformers import AutoTokenizer, AutoModel
model_id = "TextMachineProject/NewsBERT_1800-1920"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()
def encode(text, max_length=512):
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=max_length
).to(device)
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :] # CLS token
return embedding.squeeze(0).cpu() # [768]
embedding = encode("Example newspaper article text...")
print(embedding.shape) # torch.Size([768])
```
### 3. Compute Similarity Between Two Articles
```python
import torch.nn.functional as F
e1 = encode("Article text one...")
e2 = encode("Another article...")
cos_sim = F.cosine_similarity(e1, e2, dim=0)
print("Cosine similarity:", cos_sim.item())
``` |