File size: 2,729 Bytes
baf35d4
3a6e9bc
 
 
 
 
 
 
 
 
baf35d4
3a6e9bc
baf35d4
3a6e9bc
edc3efe
3a6e9bc
edc3efe
3a6e9bc
 
 
edc3efe
3a6e9bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
edc3efe
3a6e9bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
edc3efe
3a6e9bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
language:
- en
tags:
- fill-mask
- masked-lm
- feature-extraction
- semantic-similarity
- historical-text
- newspapers
license: mit
pipeline_tag: fill-mask
---

# NewsBERT_1800-1920

**NewsBERT_1800-1920** is a domain-adapted masked language model based on [`google-bert/bert-base-uncased`](https://huggingface.co/google-bert/bert-base-uncased). It has been fine-tuned with a **masked language modeling (MLM)** objective on all **historical English newspaper text** (1800-1920) from the following two collections:
- [HMD14](https://bl.iro.bl.uk/concern/datasets/2800eb7d-8b49-4398-a6e9-c2c5692a1304)
- [LwM](https://bl.iro.bl.uk/concern/datasets/99dc570a-9460-48ac-baed-9d2b8c4c13c0?locale=en)

NewsBERT_1800-1920 retains the architecture and vocabulary of BERT-base (uncased), with only weights being adapted to these datasets.

---

## Model Details

- **Model type:** `BertForMaskedLM`
- **Base model:** `google-bert/bert-base-uncased`
- **Vocabulary:** WordPiece (30,522 tokens)
- **Hidden size:** 768  
- **Layers:** 12  
- **Heads:** 12  
- **Max sequence length:** 512  
- **Fine-tuning objective:** Masked language modeling (MLM)

---

## How to Use

### 1. **Fill-Mask Pipeline**

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

model_id = "TextMachineProject/NewsBERT_1800-1920"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

text = "The [MASK] was published in the newspaper."
preds = fill_mask(text)

for p in preds:
    print(f"{p['sequence']} (score={p['score']:.4f})")

```

### 2. Use as an Encoder (CLS Embeddings)

```python
import torch
from transformers import AutoTokenizer, AutoModel

model_id = "TextMachineProject/NewsBERT_1800-1920"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()

def encode(text, max_length=512):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=max_length
    ).to(device)

    with torch.no_grad():
        outputs = model(**inputs)
        embedding = outputs.last_hidden_state[:, 0, :]  # CLS token

    return embedding.squeeze(0).cpu()  # [768]

embedding = encode("Example newspaper article text...")
print(embedding.shape)  # torch.Size([768])
```


### 3. Compute Similarity Between Two Articles

```python
import torch.nn.functional as F

e1 = encode("Article text one...")
e2 = encode("Another article...")

cos_sim = F.cosine_similarity(e1, e2, dim=0)
print("Cosine similarity:", cos_sim.item())
```