dejanseo
/

google-links

@@ -1,121 +1,128 @@
-# Link Anchor Detection Model
-A fine-tuned DeBERTa v3 model that predicts which words in text should be hyperlinks. Trained on 10,273 pages scraped from [The Keyword](https://blog.google/) (Google's official blog), where editorial linking decisions serve as ground truth labels.
-## How It Works
-Given raw text, the model performs token-level binary classification — each token is labeled `LINK` or `O` (not a link). This identifies anchor text candidates: words that a human editor would likely hyperlink.
-## Pipeline
-```
-sitemap.xml (10,274 URLs from blog.google)
-        │
-        ▼
-   scrape.py ──► scraped.db (SQLite, 10,273 pages with markdown + inline links)
-        │
-        ▼
-    _prep.py ──► train_windows.jsonl / val_windows.jsonl
-        │         • Strip markdown, annotate link spans as [LINK_START]...[LINK_END]
-        │         • Tokenize with DeBERTa, align labels to tokens
-        │         • Sliding windows (512 tokens, stride 128)
-        │         • 90/10 doc-level split
-        ▼
-   train.py ──► model_link_token_cls/
-        │         • Fine-tune microsoft/mdeberta-v3-base
-        │         • Weighted cross-entropy (~25x for minority class)
-        │         • 3 epochs, lr 2e-5, batch 16
-        ▼
-    app.py ──► Streamlit UI
-                  • Sliding-window inference (handles any text length)
-                  • Word-level highlighting with confidence scores
-```
-## Data
-Source: [blog.google](https://blog.google/) sitemap (The Keyword — Google's product and technology blog).
-| Metric | Value |
-|---|---|
-| Pages scraped | 10,273 |
-| Total tokens | 8.2M |
-| Link tokens | 286,799 (3.48%) |
-| Training windows | 21,264 |
-| Validation windows | 2,402 |
-The class imbalance (96.5% non-link vs 3.5% link) is handled with weighted cross-entropy loss during training.
-## Model
-- **Base**: `microsoft/mdeberta-v3-base` (DebertaV2ForTokenClassification)
-- **Labels**: `O` (0), `LINK` (1)
-- **Max position**: 512 tokens
-- **Parameters**: 12 layers, 768 hidden, 12 attention heads
-### Evaluation Results
-| Metric | Value |
-|---|---|
-| Accuracy | 95.6% |
-| Precision | 42.4% |
-| Recall | 79.5% |
-| F1 | 0.553 |
-High recall means the model catches most link-worthy text. Lower precision reflects the inherent ambiguity — many words *could* be linked, so "false positives" are often reasonable candidates.
-## Usage
-### Streamlit App
-```bash
-streamlit run app.py
-```
-Paste text, adjust the confidence threshold, and see predicted link anchors highlighted in green.
-### Python
-```python
-from transformers import AutoTokenizer, AutoModelForTokenClassification
-import torch
-import torch.nn.functional as F
-tokenizer = AutoTokenizer.from_pretrained("model_link_token_cls")
-model = AutoModelForTokenClassification.from_pretrained("model_link_token_cls")
-model.eval()
-text = "Google announced new features for Search and Gmail today."
-enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
-with torch.no_grad():
-    logits = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"]).logits
-    probs = F.softmax(logits, dim=-1)[0, :, 1]  # P(LINK) per token
-for token, offset, p in zip(
-    tokenizer.convert_ids_to_tokens(enc["input_ids"][0]),
-    enc["offset_mapping"][0],
-    probs
-):
-    if offset[0] == offset[1]:
-        continue  # skip special tokens
-    if p > 0.5:
-        print(f"  LINK: {text[offset[0]:offset[1]]} ({p:.2%})")
-```
-## Scripts
-| File | Purpose |
-|---|---|
-| `scrape.py` | Async Playwright scraper; reads sitemap.xml, saves to SQLite + markdown files |
-| `_prep.py` | Cleans markdown, annotates link spans, tokenizes, creates sliding windows |
-| `train.py` | Fine-tunes DeBERTa with weighted loss, W&B tracking |
-| `app.py` | Streamlit inference app with sliding-window support |
-| `_count.py` | Token length analysis utility |
-| `_detok.py` | Token ID decoder (Streamlit) |
-## Requirements
-- Python 3.8+
-- PyTorch
-- Transformers
-- Playwright (for scraping)
-- Streamlit (for inference app)

+---
+base_model:
+- microsoft/deberta-v3-base
+pipeline_tag: token-classification
+tags:
+- links
+---
+# Link Anchor Detection Model
+A fine-tuned DeBERTa v3 model that predicts which words in text should be hyperlinks. Trained on 10,273 pages scraped from [The Keyword](https://blog.google/) (Google's official blog), where editorial linking decisions serve as ground truth labels.
+## How It Works
+Given raw text, the model performs token-level binary classification — each token is labeled `LINK` or `O` (not a link). This identifies anchor text candidates: words that a human editor would likely hyperlink.
+## Pipeline
+```
+sitemap.xml (10,274 URLs from blog.google)
+        │
+        ▼
+   scrape.py ──► scraped.db (SQLite, 10,273 pages with markdown + inline links)
+        │
+        ▼
+    _prep.py ──► train_windows.jsonl / val_windows.jsonl
+        │         • Strip markdown, annotate link spans as [LINK_START]...[LINK_END]
+        │         • Tokenize with DeBERTa, align labels to tokens
+        │         • Sliding windows (512 tokens, stride 128)
+        │         • 90/10 doc-level split
+        ▼
+   train.py ──► model_link_token_cls/
+        │         • Fine-tune microsoft/mdeberta-v3-base
+        │         • Weighted cross-entropy (~25x for minority class)
+        │         • 3 epochs, lr 2e-5, batch 16
+        ▼
+    app.py ──► Streamlit UI
+                  • Sliding-window inference (handles any text length)
+                  • Word-level highlighting with confidence scores
+```
+## Data
+Source: [blog.google](https://blog.google/) sitemap (The Keyword — Google's product and technology blog).
+| Metric | Value |
+|---|---|
+| Pages scraped | 10,273 |
+| Total tokens | 8.2M |
+| Link tokens | 286,799 (3.48%) |
+| Training windows | 21,264 |
+| Validation windows | 2,402 |
+The class imbalance (96.5% non-link vs 3.5% link) is handled with weighted cross-entropy loss during training.
+## Model
+- **Base**: `microsoft/mdeberta-v3-base` (DebertaV2ForTokenClassification)
+- **Labels**: `O` (0), `LINK` (1)
+- **Max position**: 512 tokens
+- **Parameters**: 12 layers, 768 hidden, 12 attention heads
+### Evaluation Results
+| Metric | Value |
+|---|---|
+| Accuracy | 95.6% |
+| Precision | 42.4% |
+| Recall | 79.5% |
+| F1 | 0.553 |
+High recall means the model catches most link-worthy text. Lower precision reflects the inherent ambiguity — many words *could* be linked, so "false positives" are often reasonable candidates.
+## Usage
+### Streamlit App
+```bash
+streamlit run app.py
+```
+Paste text, adjust the confidence threshold, and see predicted link anchors highlighted in green.
+### Python
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+import torch.nn.functional as F
+tokenizer = AutoTokenizer.from_pretrained("model_link_token_cls")
+model = AutoModelForTokenClassification.from_pretrained("model_link_token_cls")
+model.eval()
+text = "Google announced new features for Search and Gmail today."
+enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
+with torch.no_grad():
+    logits = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"]).logits
+    probs = F.softmax(logits, dim=-1)[0, :, 1]  # P(LINK) per token
+for token, offset, p in zip(
+    tokenizer.convert_ids_to_tokens(enc["input_ids"][0]),
+    enc["offset_mapping"][0],
+    probs
+):
+    if offset[0] == offset[1]:
+        continue  # skip special tokens
+    if p > 0.5:
+        print(f"  LINK: {text[offset[0]:offset[1]]} ({p:.2%})")
+```
+## Scripts
+| File | Purpose |
+|---|---|
+| `scrape.py` | Async Playwright scraper; reads sitemap.xml, saves to SQLite + markdown files |
+| `_prep.py` | Cleans markdown, annotates link spans, tokenizes, creates sliding windows |
+| `train.py` | Fine-tunes DeBERTa with weighted loss, W&B tracking |
+| `app.py` | Streamlit inference app with sliding-window support |
+| `_count.py` | Token length analysis utility |
+| `_detok.py` | Token ID decoder (Streamlit) |
+## Requirements
+- Python 3.8+
+- PyTorch
+- Transformers
+- Playwright (for scraping)
+- Streamlit (for inference app)