TextMachineProject
/

NewsBERT_1800-1920

feature-extraction

semantic-similarity

historical-text

Model card Files Files and versions

NewsBERT_1800-1920 / README.md

npedrazzini's picture

Update README.md

edc3efe verified about 18 hours ago

|

History Blame Contribute Delete

2.73 kB

	---
	language:
	- en
	tags:
	- fill-mask
	- masked-lm
	- feature-extraction
	- semantic-similarity
	- historical-text
	- newspapers
	license: mit
	pipeline_tag: fill-mask
	---

	# NewsBERT_1800-1920

	NewsBERT_1800-1920 is a domain-adapted masked language model based on [`google-bert/bert-base-uncased`](https://huggingface.co/google-bert/bert-base-uncased). It has been fine-tuned with a masked language modeling (MLM) objective on all historical English newspaper text (1800-1920) from the following two collections:
	- [HMD14](https://bl.iro.bl.uk/concern/datasets/2800eb7d-8b49-4398-a6e9-c2c5692a1304)
	- [LwM](https://bl.iro.bl.uk/concern/datasets/99dc570a-9460-48ac-baed-9d2b8c4c13c0?locale=en)

	NewsBERT_1800-1920 retains the architecture and vocabulary of BERT-base (uncased), with only weights being adapted to these datasets.

	---

	## Model Details

	- Model type: `BertForMaskedLM`
	- Base model: `google-bert/bert-base-uncased`
	- Vocabulary: WordPiece (30,522 tokens)
	- Hidden size: 768
	- Layers: 12
	- Heads: 12
	- Max sequence length: 512
	- Fine-tuning objective: Masked language modeling (MLM)

	---

	## How to Use

	### 1. Fill-Mask Pipeline

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

	model_id = "TextMachineProject/NewsBERT_1800-1920"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForMaskedLM.from_pretrained(model_id)

	fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

	text = "The [MASK] was published in the newspaper."
	preds = fill_mask(text)

	for p in preds:
	print(f"{p['sequence']} (score={p['score']:.4f})")

	```

	### 2. Use as an Encoder (CLS Embeddings)

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	model_id = "TextMachineProject/NewsBERT_1800-1920"
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModel.from_pretrained(model_id).to(device)
	model.eval()

	def encode(text, max_length=512):
	inputs = tokenizer(
	text,
	return_tensors="pt",
	truncation=True,
	max_length=max_length
	).to(device)

	with torch.no_grad():
	outputs = model(**inputs)
	embedding = outputs.last_hidden_state[:, 0, :] # CLS token

	return embedding.squeeze(0).cpu() # [768]

	embedding = encode("Example newspaper article text...")
	print(embedding.shape) # torch.Size([768])
	```


	### 3. Compute Similarity Between Two Articles

	```python
	import torch.nn.functional as F

	e1 = encode("Article text one...")
	e2 = encode("Another article...")

	cos_sim = F.cosine_similarity(e1, e2, dim=0)
	print("Cosine similarity:", cos_sim.item())
	```