Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +213 -3
config.json +41 -0
infer.py +104 -0
model.py +469 -0
pytorch_model.bin +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +945 -0

README.md CHANGED Viewed

@@ -1,3 +1,213 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+pipeline_tag: token-classification
+tags:
+- named-entity-recognition
+- ner
+- span-ner
+- globalpointer
+- pytorch
+library_name: transformers
+model_name: EcomBert_NER_V1
+---
+# EcomBert_NER_V1
+## Model description
+`EcomBert_NER_V1` is a span-based Named Entity Recognition (NER) model built on top of a BERT encoder with a GlobalPointer-style span classification head.
+This repository exports and loads the model using a lightweight HuggingFace-style folder layout:
+- `config.json`
+- `pytorch_model.bin`
+- tokenizer files saved by `transformers.AutoTokenizer.save_pretrained(...)`
+**Parameter size**: ~0.4B parameters (as configured/reported for this model card).
+## Intended uses & limitations
+### Intended uses
+- Extracting entity spans from short-to-medium English texts (e.g., product titles, user queries, support tickets).
+- Offline batch inference and evaluation.
+### Limitations
+- This is a span-scoring model: it predicts `(label, start, end)` spans. Overlapping spans are possible.
+- Output quality depends heavily on:
+  - the training dataset schema and label definitions
+  - the decision threshold (`threshold`)
+  - tokenization behavior (subword boundaries)
+- Long inputs will be truncated to `max_length`.
+## How to use
+### 1) Train and export
+During training, the best checkpoint is exported to a HuggingFace-style directory (by default `checkpoints/hf_export`).
+Example:
+```bash
+python train.py \
+  --splits_dir ./data2/splits \
+  --output_dir checkpoints \
+  --model_name bert-base-chinese \
+  --hf_export_dir hf_export
+```
+This produces:
+- `checkpoints/hf_export/config.json`
+- `checkpoints/hf_export/pytorch_model.bin`
+- `checkpoints/hf_export/tokenizer.*`
+### 2) Inference (CLI)
+```bash
+python infer.py \
+  --model_dir checkpoints/hf_export \
+  --text "Apple released a new iPhone in California."
+```
+You can optionally override the threshold:
+```bash
+python infer.py \
+  --model_dir checkpoints/hf_export \
+  --text "Apple released a new iPhone in California." \
+  --threshold 0.55
+```
+### 3) Inference (Python)
+```python
+import torch
+from transformers import AutoTokenizer
+from model import EcomBertNER
+model_dir = "checkpoints/hf_export"
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model, cfg = EcomBertNER.from_pretrained(model_dir, device=device)
+tokenizer = AutoTokenizer.from_pretrained(model_dir)
+text = "Apple released a new iPhone in California."
+enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
+input_ids = enc["input_ids"].to(device)
+attention_mask = enc["attention_mask"].to(device)
+o = model(input_ids=input_ids, attention_mask=attention_mask)
+logits = o["logits"][0]  # (C, L, L)
+probs = torch.sigmoid(logits)
+threshold = float(cfg.get("threshold", 0.5))
+hits = (probs > threshold).nonzero(as_tuple=False)
+print(hits[:10])
+```
+## Few-shot examples
+The model predicts spans over the following **23 labels**:
+| Label | Description |
+|---|---|
+| `MAIN_PRODUCT` | Primary product being searched/described |
+| `SUB_PRODUCT` | Secondary / accessory product |
+| `BRAND` | Brand name |
+| `MODEL` | Model number or name |
+| `IP` | IP / licensed character / franchise |
+| `MATERIAL` | Material composition |
+| `COLOR` | Color attribute |
+| `SHAPE` | Shape attribute |
+| `PATTERN` | Pattern or print |
+| `STYLE` | Style descriptor |
+| `FUNCTION` | Function or use-case |
+| `ATTRIBUTE` | Other product attribute |
+| `COMPATIBILITY` | Compatible device / platform |
+| `CROWD` | Target audience |
+| `OCCASION` | Use occasion or scene |
+| `LOCATION` | Geographic / location reference |
+| `MEASUREMENT` | Size, dimension, capacity |
+| `TIME` | Time reference |
+| `QUANTITY` | Count or amount |
+| `SALE` | Promotion or sale information |
+| `SHOP` | Shop or seller name |
+| `CONJ` | Conjunction linking entities |
+| `PREP` | Preposition linking entities |
+---
+### Example 1
+**Input**:
+```
+"Nike running shoes for men, breathable mesh upper, size 42"
+```
+**Expected entities**:
+- `BRAND`: "Nike"
+- `MAIN_PRODUCT`: "running shoes"
+- `CROWD`: "men"
+- `MATERIAL`: "breathable mesh"
+- `MEASUREMENT`: "size 42"
+---
+### Example 2
+**Input**:
+```
+"iPhone 15 Pro compatible leather case, black, for outdoor use"
+```
+**Expected entities**:
+- `COMPATIBILITY`: "iPhone 15 Pro"
+- `MAIN_PRODUCT`: "leather case"
+- `MATERIAL`: "leather"
+- `COLOR`: "black"
+- `OCCASION`: "outdoor use"
+---
+### Example 3
+**Input**:
+```
+"Disney Mickey pattern kids cotton pajamas, 3-piece set, buy 2 get 1 free"
+```
+**Expected entities**:
+- `IP`: "Disney Mickey"
+- `PATTERN`: "Mickey pattern"
+- `CROWD`: "kids"
+- `MATERIAL`: "cotton"
+- `MAIN_PRODUCT`: "pajamas"
+- `QUANTITY`: "3-piece set"
+- `SALE`: "buy 2 get 1 free"
+## Training data
+Not provided in this repository model card.
+## Evaluation
+This repository includes `evaluate.py` for evaluating `.pt` checkpoints produced during training.
+## Environmental impact
+Not measured.
+## Citation
+If you use this work, consider citing your dataset and the BERT/Transformer literature relevant to your setup.

config.json ADDED Viewed

	@@ -0,0 +1,41 @@

+{
+  "architectures": [
+    "EcomBertNER"
+  ],
+  "model_name": "/home/jovyan/work/models/answerdotai/ModernBERT-large",
+  "num_labels": 23,
+  "head_size": 64,
+  "loss_type": "circle",
+  "use_rope": true,
+  "dropout": 0.1,
+  "circle_margin": 0.25,
+  "circle_gamma": 32.0,
+  "best_epoch": 5,
+  "best_f1": 0.7364,
+  "threshold": 0.45,
+  "label_list": [
+    "MAIN_PRODUCT",
+    "SUB_PRODUCT",
+    "BRAND",
+    "MODEL",
+    "IP",
+    "MATERIAL",
+    "COLOR",
+    "SHAPE",
+    "PATTERN",
+    "STYLE",
+    "FUNCTION",
+    "ATTRIBUTE",
+    "COMPATIBILITY",
+    "CROWD",
+    "OCCASION",
+    "LOCATION",
+    "MEASUREMENT",
+    "TIME",
+    "QUANTITY",
+    "SALE",
+    "SHOP",
+    "CONJ",
+    "PREP"
+  ]
+}

infer.py ADDED Viewed

	@@ -0,0 +1,104 @@

+"""infer.py — load exported HF-style directory and run NER inference.
+Usage:
+  python infer.py --model_dir checkpoints/hf_export --text "..."
+Notes:
+  - This repo exports a lightweight HF-style folder:
+      config.json
+      pytorch_model.bin
+      tokenizer files (via transformers AutoTokenizer.save_pretrained)
+  - The model class is local (EcomBertNER in model.py).
+"""
+import argparse
+from pathlib import Path
+import torch
+from transformers import AutoTokenizer
+from model import EcomBertNER
+def parse_args():
+    p = argparse.ArgumentParser(description="Inference with exported HF-style NER model")
+    p.add_argument("--model_dir", type=str, required=True, help="Path to HF export dir")
+    p.add_argument("--text", type=str, required=True, help="Input text")
+    p.add_argument("--max_length", type=int, default=256)
+    p.add_argument("--threshold", type=float, default=None, help="Override threshold (default: config.json or 0.5)")
+    p.add_argument("--device", type=str, default=None, help="cuda / cpu; default auto")
+    p.add_argument("--cache_dir", type=str, default=None)
+    return p.parse_args()
+@torch.no_grad()
+def main():
+    args = parse_args()
+    model_dir = Path(args.model_dir)
+    if not model_dir.exists():
+        raise FileNotFoundError(f"model_dir not found: {model_dir}")
+    if args.device is None:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    else:
+        device = torch.device(args.device)
+    model, cfg = EcomBertNER.from_pretrained(model_dir, device=device, cache_dir=args.cache_dir)
+    tokenizer = AutoTokenizer.from_pretrained(model_dir, cache_dir=args.cache_dir)
+    threshold = args.threshold
+    if threshold is None:
+        threshold = float(cfg.get("threshold", 0.5))
+    enc = tokenizer(
+        args.text,
+        max_length=args.max_length,
+        truncation=True,
+        padding=False,
+        return_tensors="pt",
+        return_offsets_mapping=True,
+    )
+    input_ids = enc["input_ids"].to(device)
+    attention_mask = enc["attention_mask"].to(device)
+    offsets = enc["offset_mapping"][0].tolist()
+    out = model(input_ids=input_ids, attention_mask=attention_mask)
+    logits = out["logits"][0]  # (C, L, L)
+    probs = torch.sigmoid(logits)
+    label_list = cfg.get("label_list")
+    if not label_list:
+        label_list = [str(i) for i in range(int(cfg.get("num_labels", probs.size(0))))]
+    hits = (probs > threshold).nonzero(as_tuple=False)
+    results = []
+    for c, s, e in hits.tolist():
+        if s >= len(offsets) or e >= len(offsets):
+            continue
+        char_s = offsets[s][0]
+        char_e = offsets[e][1]
+        if char_s == char_e == 0:
+            continue
+        if char_s < 0 or char_e <= char_s:
+            continue
+        ent_text = args.text[char_s:char_e]
+        results.append({
+            "label": label_list[c] if c < len(label_list) else str(c),
+            "span": [char_s, char_e],
+            "text": ent_text,
+            "score": float(probs[c, s, e].item()),
+        })
+    results.sort(key=lambda x: (-x["score"], x["span"][0], x["span"][1]))
+    print(f"device={device} threshold={threshold}")
+    for r in results:
+        print(f"{r['label']}: {r['text']}  span={r['span']}  score={r['score']:.4f}")
+if __name__ == "__main__":
+    main()

model.py ADDED Viewed

	@@ -0,0 +1,469 @@

+"""
+model.py — GlobalPointer-based NER model on top of BERT
+Changes vs previous version:
+  [FIX-1] Circle Loss: correct two-term formulation (Su Jianlin style),
+           with margin (m) and scale (gamma) params; no more logaddexp merging.
+  [FIX-2] Numerical safety: negated pos_logits no longer turns -1e9 → +1e9;
+           we apply the mask BEFORE negation.
+  [FIX-3] labels .float() cast inside forward (no silent runtime error / nan).
+  [FIX-4] valid_mask (bool, B×L) replaces attention_mask for span masking;
+           attention_mask is still passed to the encoder for self-attention.
+  [FIX-5] use_rope flag for GlobalPointer's span-level RoPE (independent of
+           BERT encoder internals).
+"""
+import json
+from pathlib import Path
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import AutoModel
+# ════════════════════════════════════════════════════════════════════════════
+#  EfficientGlobalPointer head
+#    - shared q/k projection (hidden -> 2D)
+#    - per-label token bias (hidden -> 2C) as start/end bias
+#    - final logits: base_span + start_bias + end_bias
+# ════════════════════════════════════════════════════════════════════════════
+class EfficientGlobalPointer(nn.Module):
+    """
+    EfficientGlobalPointer span scorer (Su Jianlin style).
+    Differences vs standard GlobalPointer:
+      - q/k are shared across labels:  hidden -> 2 * head_size
+      - label-specific bias per token: hidden -> 2 * num_labels
+        (start_bias and end_bias for each label)
+      - logits: (q @ k^T)/sqrt(D)  expanded to C labels, then add biases
+    Output shape: (B, C, L, L)
+    """
+    def __init__(
+        self,
+        hidden_size: int,
+        num_labels:  int,
+        head_size:   int = 64,
+        use_rope:    bool = True,
+        dropout:     float = 0.1,
+    ):
+        super().__init__()
+        self.num_labels = num_labels
+        self.head_size  = head_size
+        self.use_rope   = use_rope
+        self.dropout = nn.Dropout(dropout)
+        # shared q/k: (H -> 2D)
+        self.dense_qk = nn.Linear(hidden_size, head_size * 2)
+        # label bias: (H -> 2C) => per token: start_bias + end_bias
+        self.dense_bias = nn.Linear(hidden_size, num_labels * 2)
+        if use_rope:
+            self.rope = RotaryEmbedding(head_size)
+    def forward(self, hidden: torch.Tensor) -> torch.Tensor:
+        """
+        hidden: (B, L, H)
+        returns logits: (B, C, L, L)
+        """
+        B, L, _ = hidden.shape
+        C = self.num_labels
+        D = self.head_size
+        hidden = self.dropout(hidden)
+        # ── shared q/k ───────────────────────────────────────────────────────
+        qk = self.dense_qk(hidden)            # (B, L, 2D)
+        q, k = qk[..., :D], qk[..., D:]       # each (B, L, D)
+        if self.use_rope:
+            emb  = self.rope(L, hidden.device)     # (L, D)
+            cos_ = emb.cos()[None, :, :]           # (1, L, D)
+            sin_ = emb.sin()[None, :, :]
+            q    = apply_rotary(q, cos_, sin_)     # (B, L, D)
+            k    = apply_rotary(k, cos_, sin_)     # (B, L, D)
+        # base span score (shared across labels): (B, L, L)
+        base = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(D)
+        # ── per-label start/end bias ────────────────────────────────────────
+        bias = self.dense_bias(hidden)             # (B, L, 2C)
+        bias = bias.view(B, L, C, 2)               # (B, L, C, 2)
+        # start/end: (B, C, L)
+        start_bias = bias[..., 0].permute(0, 2, 1)  # (B, C, L)
+        end_bias   = bias[..., 1].permute(0, 2, 1)  # (B, C, L)
+        # combine:
+        # base: (B, 1, L, L)
+        # start_bias: (B, C, L, 1)
+        # end_bias:   (B, C, 1, L)
+        logits = (
+            base[:, None, :, :] +
+            start_bias[:, :, :, None] +
+            end_bias[:, :, None, :]
+        )  # (B, C, L, L)
+        return logits
+# ════════════════════════════════════════════════════════════════════════════
+#  RoPE helper (span-level, applied to GlobalPointer q/k)
+# ════════════════════════════════════════════════════════════════════════════
+class RotaryEmbedding(nn.Module):
+    """Rotary Position Embedding for GlobalPointer span scoring."""
+    def __init__(self, dim: int):
+        super().__init__()
+        assert dim % 2 == 0, "RoPE dim must be even"
+        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
+        self.register_buffer("inv_freq", inv_freq)
+    def forward(self, seq_len: int, device: torch.device) -> torch.Tensor:
+        """Returns cos/sin interleaved tensor of shape (seq_len, dim)."""
+        t   = torch.arange(seq_len, device=device).float()
+        freqs = torch.outer(t, self.inv_freq)          # (L, dim/2)
+        emb   = torch.cat([freqs, freqs], dim=-1)      # (L, dim)
+        return emb                                     # caller does cos/sin
+def rotate_half(x: torch.Tensor) -> torch.Tensor:
+    half = x.shape[-1] // 2
+    x1, x2 = x[..., :half], x[..., half:]
+    return torch.cat([-x2, x1], dim=-1)
+def apply_rotary(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
+    """x: (..., L, D)  cos/sin: (L, D)"""
+    return x * cos + rotate_half(x) * sin
+# ════════════════════════════════════════════════════════════════════════════
+#  Loss functions
+# ════════════════════════════════════════════════════════════════════════════
+def multilabel_circle_loss(
+    logits:  torch.Tensor,   # (B, C, L, L)  raw scores
+    labels:  torch.Tensor,   # (B, C, L, L)  float 0/1
+    mask2d:  torch.Tensor,   # (B, 1, L, L)  bool — True = valid span position
+    margin:  float = 0.25,
+    gamma:   float = 32.0,
+) -> torch.Tensor:
+    """
+    Su Jianlin–style Circle Loss for multi-label span classification.
+    L = log(1 + Σ exp(γ·(s_neg + m))) + log(1 + Σ exp(−γ·(s_pos − m)))
+    Two independent logsumexp terms keep the original loss geometry intact.
+    Mask is applied BEFORE any sign flip to avoid ±1e9 explosions.
+    Args:
+        logits:  raw span scores, shape (B, C, L, L)
+        labels:  float tensor {0, 1}, same shape
+        mask2d:  bool (B, 1, L, L) — True where span is valid (upper-tri + valid tokens)
+        margin:  additive margin (default 0.25)
+        gamma:   temperature / scale (default 32)
+    """
+    B, C, L, _ = logits.shape
+    # ── expand mask to (B, C, L, L) ─────────────────────────────────────────
+    mask = mask2d.expand(B, C, L, L)   # broadcast over C
+    # ── positions that are valid positive / valid negative ───────────────────
+    pos_mask = mask & (labels > 0.5)   # bool
+    neg_mask = mask & (labels < 0.5)   # bool
+    # ── scale logits ─────────────────────────────────────────────────────────
+    s = logits * gamma                  # (B, C, L, L)
+    # ── negative term: log(1 + Σ exp(s_neg + γ·m)) ──────────────────────────
+    # Fill invalid & positive positions with -inf so they don't contribute
+    neg_scores = s.masked_fill(~neg_mask, float("-inf"))
+    # logsumexp over (L, L) for each (b, c)
+    neg_lse = torch.logsumexp(neg_scores.view(B, C, -1), dim=-1)   # (B, C)
+    loss_neg = F.softplus(neg_lse + gamma * margin)                  # log(1+exp(...))
+    # ── positive term: log(1 + Σ exp(−(s_pos − γ·m))) ───────────────────────
+    # Fill invalid & negative positions with -inf (in the negated domain)
+    # To avoid -(-1e9) = +1e9: we mask FIRST, then negate.
+    pos_scores = s.masked_fill(~pos_mask, float("-inf"))
+    neg_pos_scores = (-pos_scores).masked_fill(~pos_mask, float("-inf"))
+    pos_lse = torch.logsumexp(neg_pos_scores.view(B, C, -1), dim=-1)  # (B, C)
+    loss_pos = F.softplus(pos_lse + gamma * margin)
+    # ── average over labels (skip labels with no positive AND no negative) ───
+    loss = (loss_neg + loss_pos).mean()
+    return loss
+def multilabel_bce_loss(
+    logits: torch.Tensor,   # (B, C, L, L)
+    labels: torch.Tensor,   # (B, C, L, L)  float
+    mask2d: torch.Tensor,   # (B, 1, L, L)  bool
+) -> torch.Tensor:
+    mask = mask2d.expand_as(logits)
+    loss = F.binary_cross_entropy_with_logits(logits, labels, reduction="none")
+    loss = loss * mask.float()
+    return loss.sum() / mask.float().sum().clamp(min=1)
+# ════════════════════════════════════════════════════════════════════════════
+#  GlobalPointer head
+# ════════════��═══════════════════════════════════════════════════════════════
+class GlobalPointer(nn.Module):
+    """
+    GlobalPointer span scorer.
+    Projects encoder hidden states to per-label (q, k) vectors and computes
+    an (L×L) score matrix per label.  Optionally applies span-level RoPE.
+    Note: encoder internals (inside self-attention layers) are entirely
+    separate from this span-level RoPE — both can be active simultaneously.
+    """
+    def __init__(
+        self,
+        hidden_size: int,
+        num_labels:  int,
+        head_size:   int = 64,
+        use_rope:    bool = True,
+        dropout:     float = 0.1,
+    ):
+        super().__init__()
+        self.num_labels = num_labels
+        self.head_size  = head_size
+        self.use_rope   = use_rope
+        self.dropout = nn.Dropout(dropout)
+        # Project to 2 * num_labels * head_size  (q and k for every label)
+        self.dense   = nn.Linear(hidden_size, num_labels * head_size * 2)
+        if use_rope:
+            self.rope = RotaryEmbedding(head_size)
+    def forward(
+        self,
+        hidden: torch.Tensor,   # (B, L, H)
+    ) -> torch.Tensor:          # (B, C, L, L)
+        B, L, H = hidden.shape
+        C       = self.num_labels
+        D       = self.head_size
+        hidden = self.dropout(hidden)
+        proj   = self.dense(hidden)                         # (B, L, C*D*2)
+        proj   = proj.view(B, L, C, D * 2)                 # (B, L, C, D*2)
+        q, k   = proj[..., :D], proj[..., D:]              # each (B, L, C, D)
+        if self.use_rope:
+            emb  = self.rope(L, hidden.device)             # (L, D)
+            cos_ = emb.cos()[None, :, None, :]             # (1, L, 1, D)
+            sin_ = emb.sin()[None, :, None, :]
+            q    = apply_rotary(q, cos_, sin_)
+            k    = apply_rotary(k, cos_, sin_)
+        # q: (B, L, C, D) → (B, C, L, D)
+        q = q.permute(0, 2, 1, 3)
+        k = k.permute(0, 2, 1, 3)
+        # Score matrix: (B, C, L, D) × (B, C, D, L) → (B, C, L, L)
+        logits = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(D)
+        return logits
+# ════════════════════════════════════════════════════════════════════════════
+#  Full model
+# ════════════════════════════════════════════════════════════════════════════
+class EcomBertNER(nn.Module):
+    """
+    BERT encoder + GlobalPointer head for span-based NER.
+    forward() signature:
+        input_ids      (B, L)   — token ids
+        attention_mask (B, L)   — passed to encoder (1=real, 0=pad)
+        labels         (B, C, L, L) torch.bool, optional
+        valid_mask     (B, L)   torch.bool, optional — True = valid token
+                                (excludes CLS/SEP/PAD; from dataset collate_fn)
+    If valid_mask is not provided, falls back to attention_mask.bool()
+    (slightly less precise — includes CLS/SEP as negative spans).
+    """
+    def __init__(
+        self,
+        model_name:  str   = "bert-base-chinese",
+        num_labels:  int   = 23,
+        head_size:   int   = 64,
+        loss_type:   str   = "circle",   # "circle" | "bce"
+        use_rope:    bool  = True,
+        dropout:     float = 0.1,
+        cache_dir:   str   = None,
+        # Circle Loss hyper-params (ignored for BCE)
+        circle_margin: float = 0.25,
+        circle_gamma:  float = 32.0,
+    ):
+        super().__init__()
+        assert loss_type in ("circle", "bce"), \
+            f"loss_type must be 'circle' or 'bce', got {loss_type!r}"
+        self.loss_type     = loss_type
+        self.circle_margin = circle_margin
+        self.circle_gamma  = circle_gamma
+        self.encoder = AutoModel.from_pretrained(
+            model_name, cache_dir=cache_dir
+        )
+        hidden_size = self.encoder.config.hidden_size
+        self.global_pointer = EfficientGlobalPointer(
+            hidden_size = hidden_size,
+            num_labels  = num_labels,
+            head_size   = head_size,
+            use_rope    = use_rope,
+            dropout     = dropout,
+        )
+        self.model_name = model_name
+        self.num_labels = num_labels
+        self.head_size  = head_size
+        self.use_rope   = use_rope
+        self.dropout    = dropout
+    # ── span validity mask ────────────────────────────────────────────────────
+    @staticmethod
+    def _build_span_mask(
+        valid_mask: torch.Tensor,   # (B, L) bool
+    ) -> torch.Tensor:
+        """
+        Returns upper-triangular span mask (B, 1, L, L) where
+        mask[b,0,i,j] = True iff i<=j and both token i and j are valid.
+        """
+        # row mask (B, 1, L, 1) & col mask (B, 1, 1, L) → (B, 1, L, L)
+        row = valid_mask[:, None, :, None]    # (B, 1, L, 1)
+        col = valid_mask[:, None, None, :]    # (B, 1, 1, L)
+        pair_mask = row & col                 # (B, 1, L, L)
+        L = valid_mask.size(1)
+        upper_tri = torch.triu(
+            torch.ones(L, L, dtype=torch.bool, device=valid_mask.device)
+        )  # (L, L)
+        return pair_mask & upper_tri          # (B, 1, L, L)
+    # ── forward ───────────────────────────────────────────────────────────────
+    def forward(
+        self,
+        input_ids:      torch.Tensor,                    # (B, L)
+        attention_mask: torch.Tensor,                    # (B, L)
+        labels:         torch.Tensor = None,             # (B, C, L, L) bool
+        valid_mask:     torch.Tensor = None,             # (B, L) bool
+    ) -> dict:
+        # ── encoder ─────────────────────────────────────────────────────────
+        encoder_out = self.encoder(
+            input_ids      = input_ids,
+            attention_mask = attention_mask,
+        )
+        hidden = encoder_out.last_hidden_state   # (B, L, H)
+        # ── GlobalPointer logits ─────────────────────────────────────────────
+        logits = self.global_pointer(hidden)     # (B, C, L, L)
+        # ── span validity mask ───────────────────────────────────────────────
+        # [FIX-4] prefer valid_mask (excludes CLS/SEP) over attention_mask
+        if valid_mask is None:
+            valid_mask = attention_mask.bool()
+        mask2d = self._build_span_mask(valid_mask)   # (B, 1, L, L)
+        # Apply mask to logits for inference (fill invalid with -1e4)
+        logits_masked = logits.masked_fill(
+            ~mask2d.expand_as(logits), -1e4
+        )
+        # ── loss ─────────────────────────────────────────────────────────────
+        loss = None
+        if labels is not None:
+            # [FIX-3] ensure float regardless of bool input from dataset
+            labels_f = labels.float()
+            if self.loss_type == "circle":
+                loss = multilabel_circle_loss(
+                    logits  = logits,         # raw (unmasked) scores
+                    labels  = labels_f,
+                    mask2d  = mask2d,
+                    margin  = self.circle_margin,
+                    gamma   = self.circle_gamma,
+                )
+            else:
+                loss = multilabel_bce_loss(
+                    logits = logits,
+                    labels = labels_f,
+                    mask2d = mask2d,
+                )
+        return {
+            "loss":   loss,
+            "logits": logits_masked,   # (B, C, L, L)
+        }
+    def save_pretrained(self, save_directory: str | Path, *, extra_config: dict | None = None) -> None:
+        save_dir = Path(save_directory)
+        save_dir.mkdir(parents=True, exist_ok=True)
+        config = {
+            "architectures": [self.__class__.__name__],
+            "model_name": self.model_name,
+            "num_labels": self.num_labels,
+            "head_size": self.head_size,
+            "loss_type": self.loss_type,
+            "use_rope": self.use_rope,
+            "dropout": self.dropout,
+            "circle_margin": self.circle_margin,
+            "circle_gamma": self.circle_gamma,
+        }
+        if extra_config:
+            config.update(extra_config)
+        with open(save_dir / "config.json", "w", encoding="utf-8") as f:
+            json.dump(config, f, indent=2, ensure_ascii=False)
+        torch.save(self.state_dict(), save_dir / "pytorch_model.bin")
+    @classmethod
+    def from_pretrained(
+        cls,
+        model_dir: str | Path,
+        *,
+        device: torch.device | str | None = None,
+        cache_dir: str | None = None,
+    ) -> tuple["EcomBertNER", dict]:
+        model_dir = Path(model_dir)
+        with open(model_dir / "config.json", "r", encoding="utf-8") as f:
+            cfg = json.load(f)
+        model = cls(
+            model_name=cfg.get("model_name", "bert-base-chinese"),
+            num_labels=int(cfg.get("num_labels", 23)),
+            head_size=int(cfg.get("head_size", 64)),
+            loss_type=str(cfg.get("loss_type", "circle")),
+            use_rope=bool(cfg.get("use_rope", True)),
+            dropout=float(cfg.get("dropout", 0.1)),
+            cache_dir=cache_dir,
+            circle_margin=float(cfg.get("circle_margin", 0.25)),
+            circle_gamma=float(cfg.get("circle_gamma", 32.0)),
+        )
+        state = torch.load(model_dir / "pytorch_model.bin", map_location="cpu", weights_only=False)
+        model.load_state_dict(state)
+        if device is not None:
+            model.to(device)
+        model.eval()
+        return model, cfg

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4f68f8b26a690304cb7dc1513c3107cc1919e2a0bd3b76b832b2695a89369fd7
+size 1579917023

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,945 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "|||IP_ADDRESS|||",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "1": {
+      "content": "<|padding|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50254": {
+      "content": "                        ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50255": {
+      "content": "                       ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50256": {
+      "content": "                      ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50257": {
+      "content": "                     ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50258": {
+      "content": "                    ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50259": {
+      "content": "                   ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50260": {
+      "content": "                  ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50261": {
+      "content": "                 ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50262": {
+      "content": "                ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50263": {
+      "content": "               ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50264": {
+      "content": "              ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50265": {
+      "content": "             ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50266": {
+      "content": "            ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50267": {
+      "content": "           ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50268": {
+      "content": "          ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50269": {
+      "content": "         ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50270": {
+      "content": "        ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50271": {
+      "content": "       ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50272": {
+      "content": "      ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50273": {
+      "content": "     ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50274": {
+      "content": "    ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50275": {
+      "content": "   ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50276": {
+      "content": "  ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50277": {
+      "content": "|||EMAIL_ADDRESS|||",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50278": {
+      "content": "|||PHONE_NUMBER|||",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50279": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50280": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50281": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50282": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50283": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50284": {
+      "content": "[MASK]",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50285": {
+      "content": "[unused0]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50286": {
+      "content": "[unused1]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50287": {
+      "content": "[unused2]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50288": {
+      "content": "[unused3]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50289": {
+      "content": "[unused4]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50290": {
+      "content": "[unused5]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50291": {
+      "content": "[unused6]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50292": {
+      "content": "[unused7]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50293": {
+      "content": "[unused8]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50294": {
+      "content": "[unused9]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50295": {
+      "content": "[unused10]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50296": {
+      "content": "[unused11]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50297": {
+      "content": "[unused12]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50298": {
+      "content": "[unused13]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50299": {
+      "content": "[unused14]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50300": {
+      "content": "[unused15]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50301": {
+      "content": "[unused16]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50302": {
+      "content": "[unused17]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50303": {
+      "content": "[unused18]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50304": {
+      "content": "[unused19]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50305": {
+      "content": "[unused20]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50306": {
+      "content": "[unused21]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50307": {
+      "content": "[unused22]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50308": {
+      "content": "[unused23]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50309": {
+      "content": "[unused24]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50310": {
+      "content": "[unused25]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50311": {
+      "content": "[unused26]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50312": {
+      "content": "[unused27]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50313": {
+      "content": "[unused28]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50314": {
+      "content": "[unused29]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50315": {
+      "content": "[unused30]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50316": {
+      "content": "[unused31]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50317": {
+      "content": "[unused32]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50318": {
+      "content": "[unused33]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50319": {
+      "content": "[unused34]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50320": {
+      "content": "[unused35]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50321": {
+      "content": "[unused36]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50322": {
+      "content": "[unused37]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50323": {
+      "content": "[unused38]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50324": {
+      "content": "[unused39]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50325": {
+      "content": "[unused40]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50326": {
+      "content": "[unused41]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50327": {
+      "content": "[unused42]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50328": {
+      "content": "[unused43]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50329": {
+      "content": "[unused44]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50330": {
+      "content": "[unused45]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50331": {
+      "content": "[unused46]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50332": {
+      "content": "[unused47]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50333": {
+      "content": "[unused48]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50334": {
+      "content": "[unused49]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50335": {
+      "content": "[unused50]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50336": {
+      "content": "[unused51]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50337": {
+      "content": "[unused52]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50338": {
+      "content": "[unused53]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50339": {
+      "content": "[unused54]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50340": {
+      "content": "[unused55]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50341": {
+      "content": "[unused56]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50342": {
+      "content": "[unused57]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50343": {
+      "content": "[unused58]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50344": {
+      "content": "[unused59]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50345": {
+      "content": "[unused60]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50346": {
+      "content": "[unused61]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50347": {
+      "content": "[unused62]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50348": {
+      "content": "[unused63]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50349": {
+      "content": "[unused64]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50350": {
+      "content": "[unused65]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50351": {
+      "content": "[unused66]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50352": {
+      "content": "[unused67]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50353": {
+      "content": "[unused68]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50354": {
+      "content": "[unused69]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50355": {
+      "content": "[unused70]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50356": {
+      "content": "[unused71]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50357": {
+      "content": "[unused72]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50358": {
+      "content": "[unused73]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50359": {
+      "content": "[unused74]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50360": {
+      "content": "[unused75]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50361": {
+      "content": "[unused76]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50362": {
+      "content": "[unused77]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50363": {
+      "content": "[unused78]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50364": {
+      "content": "[unused79]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50365": {
+      "content": "[unused80]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50366": {
+      "content": "[unused81]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50367": {
+      "content": "[unused82]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 8192,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "[UNK]"
+}