google-links / README.md

Update README.md

7d5fa87 verified 4 days ago

4.16 kB

	---
	base_model:
	- microsoft/deberta-v3-base
	pipeline_tag: token-classification
	tags:
	- links
	---
	# Link Anchor Detection Model

	A fine-tuned DeBERTa v3 model that predicts which words in text should be hyperlinks. Trained on 10,273 pages scraped from [The Keyword](https://blog.google/) (Google's official blog), where editorial linking decisions serve as ground truth labels.

	## How It Works

	Given raw text, the model performs token-level binary classification — each token is labeled `LINK` or `O` (not a link). This identifies anchor text candidates: words that a human editor would likely hyperlink.

	## Pipeline

	```
	sitemap.xml (10,274 URLs from blog.google)
	│
	▼
	scrape.py ──► scraped.db (SQLite, 10,273 pages with markdown + inline links)
	│
	▼
	_prep.py ──► train_windows.jsonl / val_windows.jsonl
	│ • Strip markdown, annotate link spans as [LINK_START]...[LINK_END]
	│ • Tokenize with DeBERTa, align labels to tokens
	│ • Sliding windows (512 tokens, stride 128)
	│ • 90/10 doc-level split
	▼
	train.py ──► model_link_token_cls/
	│ • Fine-tune microsoft/mdeberta-v3-base
	│ • Weighted cross-entropy (~25x for minority class)
	│ • 3 epochs, lr 2e-5, batch 16
	▼
	app.py ──► Streamlit UI
	• Sliding-window inference (handles any text length)
	• Word-level highlighting with confidence scores
	```

	## Data

	Source: [blog.google](https://blog.google/) sitemap (The Keyword — Google's product and technology blog).

	\| Metric \| Value \|
	\|---\|---\|
	\| Pages scraped \| 10,273 \|
	\| Total tokens \| 8.2M \|
	\| Link tokens \| 286,799 (3.48%) \|
	\| Training windows \| 21,264 \|
	\| Validation windows \| 2,402 \|

	The class imbalance (96.5% non-link vs 3.5% link) is handled with weighted cross-entropy loss during training.

	## Model

	- Base: `microsoft/mdeberta-v3-base` (DebertaV2ForTokenClassification)
	- Labels: `O` (0), `LINK` (1)
	- Max position: 512 tokens
	- Parameters: 12 layers, 768 hidden, 12 attention heads

	### Evaluation Results

	\| Metric \| Value \|
	\|---\|---\|
	\| Accuracy \| 95.6% \|
	\| Precision \| 42.4% \|
	\| Recall \| 79.5% \|
	\| F1 \| 0.553 \|

	High recall means the model catches most link-worthy text. Lower precision reflects the inherent ambiguity — many words could be linked, so "false positives" are often reasonable candidates.

	## Usage

	### Streamlit App

	```bash
	streamlit run app.py
	```

	Paste text, adjust the confidence threshold, and see predicted link anchors highlighted in green.

	### Python

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch
	import torch.nn.functional as F

	tokenizer = AutoTokenizer.from_pretrained("model_link_token_cls")
	model = AutoModelForTokenClassification.from_pretrained("model_link_token_cls")
	model.eval()

	text = "Google announced new features for Search and Gmail today."
	enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
	with torch.no_grad():
	logits = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"]).logits
	probs = F.softmax(logits, dim=-1)[0, :, 1] # P(LINK) per token

	for token, offset, p in zip(
	tokenizer.convert_ids_to_tokens(enc["input_ids"][0]),
	enc["offset_mapping"][0],
	probs
	):
	if offset[0] == offset[1]:
	continue # skip special tokens
	if p > 0.5:
	print(f" LINK: {text[offset[0]:offset[1]]} ({p:.2%})")
	```

	## Scripts

	\| File \| Purpose \|
	\|---\|---\|
	\| `scrape.py` \| Async Playwright scraper; reads sitemap.xml, saves to SQLite + markdown files \|
	\| `_prep.py` \| Cleans markdown, annotates link spans, tokenizes, creates sliding windows \|
	\| `train.py` \| Fine-tunes DeBERTa with weighted loss, W&B tracking \|
	\| `app.py` \| Streamlit inference app with sliding-window support \|
	\| `_count.py` \| Token length analysis utility \|
	\| `_detok.py` \| Token ID decoder (Streamlit) \|

	## Requirements

	- Python 3.8+
	- PyTorch
	- Transformers
	- Playwright (for scraping)
	- Streamlit (for inference app)