Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +387 -3

README.md CHANGED Viewed

@@ -1,3 +1,387 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+  - en
+tags:
+  - text-classification
+  - token-classification
+  - multi-label-classification
+  - named-entity-recognition
+  - multitask
+  - finance
+  - energy
+  - news
+  - distilbert
+  - quantbridge
+pipeline_tag: token-classification
+base_model: QuantBridge/energy-intelligence-multitask-custom-ner
+datasets:
+  - ag_news
+  - reuters-21578
+  - rmisra/news-category-dataset
+---
+# Energy Intelligence Multitask Model
+**QuantBridge / energy-intelligence-multitask**
+A single DistilBERT model with a **shared encoder** and **two task heads** for energy and financial news analysis. One forward pass returns both named entities **and** topic labels simultaneously.
+| Head | Task | Output shape |
+|------|------|-------------|
+| **NER** | Named entity recognition (BIO) | `(batch, seq_len, 19)` |
+| **CLS** | Multi-label topic classification | `(batch, 10)` |
+---
+## Architecture
+```
+Input headline
+      |
+BertTokenizer  (do_lower_case=True, max_length=128)
+      |
+DistilBERT encoder  (6 layers · 768 dim · 12 heads · ~67M params)
+[weights from QuantBridge/energy-intelligence-multitask-custom-ner]
+      |
+      +──────────────────────────────────────────┐
+      |                                          |
+  all token hidden states                   [CLS] hidden state
+      |                                          |
+  Dropout(0.1)                         Linear(768→768) + ReLU
+      |                                     Dropout(0.2)
+  Linear(768→19)                        Linear(768→10)
+      |                                          |
+  NER logits                             CLS logits
+  argmax → BIO entity tags           sigmoid → topic probabilities
+```
+---
+## NER Label Space — 19 BIO tags
+| Entity Type | Example extractions from test set |
+|---|---|
+| `COMPANY` | ExxonMobil, Gazprom, Maersk, Shell, Chevron, BP, Equinor |
+| `ORGANIZATION` | OPEC+, US Treasury, Federal Reserve, IMF, FERC, IAEA |
+| `COUNTRY` | Saudi Arabia, Russia, China, Iran, Venezuela, Germany |
+| `COMMODITY` | crude oil, natural gas, LNG, methane, aluminum, hydrogen |
+| `LOCATION` | Strait of Hormuz, Red Sea, Gulf of Mexico, North Sea, Kollsnes |
+| `MARKET` | S&P 500, Brent, WTI |
+| `EVENT` | Hurricane Ida, Houthi attacks |
+| `PERSON` | Elon Musk, Jerome Powell |
+| `INFRASTRUCTURE` | pipelines, refineries, terminals |
+Each type uses standard BIO tagging: `B-<TYPE>` starts a span, `I-<TYPE>` continues it, `O` marks non-entities.
+---
+## Classification Label Space — 10 topic labels
+| Label | Description | Avg score (test set) |
+|-------|-------------|---------------------|
+| `macro` | GDP, inflation, central bank policy | **0.323** |
+| `politics` | Government policy, sanctions, diplomacy | **0.307** |
+| `business` | Corporate earnings, M&A, operations | 0.219 |
+| `technology` | Tech, innovation, clean-tech | 0.155 |
+| `energy` | Oil, gas, renewables, power grid | 0.070 |
+| `trade` | Tariffs, import/export, agreements | 0.046 |
+| `shipping` | Maritime logistics, ports | 0.038 |
+| `stocks` | Equity markets, share prices | 0.015 |
+| `regulation` | Compliance, legislation, rules | 0.013 |
+| `risk` | Crises, geopolitical tension | 0.013 |
+> **Note on classification scores:** The classification head was trained on AG News + Reuters + Kaggle — datasets dominated by general `business` and `macro` content. Domain-specific labels (`energy`, `shipping`, `risk`, `regulation`, `stocks`) score lower as a result. The relative ranking of scores is semantically meaningful even when raw values are low. See [Limitations](#limitations).
+---
+## Test Results
+Evaluated on **40 real-world energy & financial news headlines** across 9 domain groups (ENERGY, GEOPOLITICAL, SHIPPING, TRADE, MACRO, CORPORATE, REGULATION, TECHNOLOGY, STOCKS, RISK).
+### NER Results
+| Metric | Value |
+|--------|-------|
+| Total entities detected | **86** across 40 headlines |
+| Average entities per headline | **2.1** |
+| Entity types fired | **7 / 9** |
+**Entity type frequency:**
+| Entity Type | Detections | Example extractions |
+|---|---|---|
+| COMMODITY | 20 | oil production, crude, LNG, natural gas, aluminum, methane, hydrogen |
+| COUNTRY | 19 | Saudi Arabia, Russia, China, Iran, Venezuela, Poland, Bulgaria, UK |
+| ORGANIZATION | 15 | OPEC+, US Treasury, Federal Reserve, IMF, G7, FERC, IAEA, SEC |
+| COMPANY | 15 | ExxonMobil, Gazprom, Maersk, Shell, Chevron, Equinor, BP, Tesla, Vestas |
+| LOCATION | 14 | Kollsnes, Strait of Hormuz, Red Sea, Panama Canal, North Sea, Gulf of Mexico |
+| EVENT | 2 | Hurricane Ida, Houthi (attacks) |
+| MARKET | 1 | S&P 500 |
+| PERSON | 0 | — (not fired on this test set) |
+| INFRASTRUCTURE | 0 | — (not fired on this test set) |
+**Key NER observations:**
+- COMMODITY is the top entity type — the model reliably extracts energy goods (oil, crude, LNG, natural gas, hydrogen) and commodities (aluminum, solar panels)
+- COUNTRY and ORGANIZATION fire consistently across all domain groups
+- COMPANY detection is accurate: correctly identifies both energy majors (ExxonMobil, Shell, BP) and non-energy companies (Tesla, Maersk, Vestas)
+- LOCATION captures geopolitically important hotspots correctly (Red Sea, Strait of Hormuz, Gulf of Mexico, North Sea)
+- MARKET fires on "S&P 500" but misses "Brent" and "WTI" — likely a tokenisation artefact where these are split sub-words during BertTokenizer processing
+- PERSON and INFRASTRUCTURE did not fire on this specific test set; these types are present in the model's label vocabulary and will activate on appropriate inputs
+### Classification Results (threshold = 0.20)
+**Label activation frequency across 40 headlines:**
+| Label | Active headlines | % | Avg score |
+|-------|-----------------|---|-----------|
+| macro | 14 / 40 | 35% | 0.323 |
+| politics | 9 / 40 | 22% | 0.307 |
+| business | 1 / 40 | 2% | 0.219 |
+| technology | 0 / 40 | 0% | 0.155 |
+| energy | 0 / 40 | 0% | 0.070 |
+| trade | 0 / 40 | 0% | 0.046 |
+| shipping | 0 / 40 | 0% | 0.038 |
+| stocks | 0 / 40 | 0% | 0.015 |
+| regulation | 0 / 40 | 0% | 0.013 |
+| risk | 0 / 40 | 0% | 0.013 |
+**Domain-group heatmap** (`>>>` = group average score ≥ 0.35):
+| Domain group | energy | politics | trade | stocks | regulation | shipping | macro | business | technology | risk |
+|---|---|---|---|---|---|---|---|---|---|---|
+| ENERGY | 0.09 | 0.28 | 0.07 | 0.02 | 0.02 | 0.05 | **>>>** | 0.26 | 0.17 | 0.02 |
+| GEOPOLITICAL | 0.06 | 0.30 | 0.04 | 0.01 | 0.01 | 0.03 | 0.30 | 0.19 | 0.12 | 0.01 |
+| SHIPPING | 0.07 | 0.31 | 0.04 | 0.01 | 0.01 | 0.05 | 0.28 | 0.23 | 0.14 | 0.01 |
+| TRADE | 0.06 | 0.30 | 0.05 | 0.01 | 0.01 | 0.04 | 0.31 | 0.23 | 0.17 | 0.01 |
+| MACRO | 0.07 | 0.30 | 0.06 | 0.02 | 0.02 | 0.04 | **>>>** | 0.22 | 0.17 | 0.02 |
+| CORPORATE | 0.09 | 0.33 | 0.05 | 0.02 | 0.02 | 0.04 | **>>>** | 0.23 | 0.16 | 0.02 |
+| REGULATION | 0.04 | 0.26 | 0.03 | 0.01 | 0.01 | 0.02 | 0.32 | 0.26 | 0.18 | 0.01 |
+| TECHNOLOGY | 0.07 | **>>>** | 0.04 | 0.01 | 0.01 | 0.04 | 0.28 | 0.17 | 0.14 | 0.01 |
+| STOCKS | 0.08 | 0.30 | 0.04 | 0.01 | 0.01 | 0.03 | 0.32 | 0.21 | 0.16 | 0.01 |
+| RISK | 0.08 | 0.32 | 0.04 | 0.01 | 0.01 | 0.04 | 0.34 | 0.19 | 0.14 | 0.01 |
+**Key classification observations:**
+- `macro` is the dominant label across all domain groups — a direct consequence of training data composition (AG News World category and Kaggle both map heavily to `macro`)
+- `politics` fires on TECHNOLOGY and GEOPOLITICAL groups, which is semantically reasonable (government energy policy, sanctions)
+- Domain-specific labels (`energy`, `shipping`, `risk`, `regulation`, `stocks`) score consistently low — these categories are underrepresented in training data
+- The score **ranking** is meaningful: for ENERGY headlines, `energy` consistently ranks above `trade`, `shipping`, and `regulation` even when below threshold — the model has learned the correct relative associations
+---
+## Usage
+> **Important:** This model uses custom architecture files. Always pass `trust_remote_code=True`.
+### Installation
+```bash
+pip install transformers torch
+```
+### Full inference — NER + Classification
+```python
+import torch
+import numpy as np
+from transformers import AutoTokenizer, AutoConfig, AutoModel
+MODEL_ID = "QuantBridge/energy-intelligence-multitask"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
+model     = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True)
+model.eval()
+def sigmoid(x):
+    return 1 / (1 + np.exp(-x))
+def predict(text: str, cls_threshold: float = 0.20):
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
+    inputs.pop("token_type_ids", None)   # DistilBERT does not use these
+    with torch.no_grad():
+        output = model(**inputs)
+    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+    # ── Named Entity Recognition ──────────────────────────────────────────
+    ner_id2label = {int(k): v for k, v in model.config.ner_id2label.items()}
+    tag_ids = output.ner_logits[0].argmax(-1).tolist()
+    entities = []
+    current = None
+    for token, tag_id in zip(tokens, tag_ids):
+        if token in ("[CLS]", "[SEP]", "[PAD]"):
+            if current: entities.append(current); current = None
+            continue
+        tag = ner_id2label[tag_id]
+        if tag.startswith("B-"):
+            if current: entities.append(current)
+            current = {"text": token.replace("##", ""), "type": tag[2:]}
+        elif tag.startswith("I-") and current:
+            current["text"] += token[2:] if token.startswith("##") else f" {token}"
+        else:
+            if current: entities.append(current); current = None
+    if current: entities.append(current)
+    # ── Topic Classification ──────────────────��───────────────────────────
+    cls_id2label = {int(k): v for k, v in model.config.cls_id2label.items()}
+    probs  = sigmoid(output.cls_logits[0].numpy())
+    topics = {cls_id2label[i]: float(probs[i]) for i in range(len(probs))}
+    active = {lbl: p for lbl, p in topics.items() if p >= cls_threshold}
+    return entities, active
+# Example
+headline = "Russia cuts natural gas flows to Poland and Bulgaria following payment dispute"
+entities, topics = predict(headline)
+print("Entities found:")
+for e in entities:
+    print(f"  [{e['type']}]  {e['text']}")
+print("\nActive topic labels:")
+for topic, score in sorted(topics.items(), key=lambda x: -x[1]):
+    print(f"  {topic}: {score:.3f}")
+```
+**Expected output:**
+```
+Entities found:
+  [COUNTRY]   Russia
+  [COUNTRY]   Poland
+  [COUNTRY]   Bulgaria
+  [COMMODITY] natural gas
+Active topic labels:
+  politics: 0.362
+  macro: 0.357
+```
+### NER only — decode all entity spans
+```python
+def get_entities(text: str) -> list[dict]:
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
+    inputs.pop("token_type_ids", None)
+    with torch.no_grad():
+        output = model(**inputs)
+    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+    ner_id2label = {int(k): v for k, v in model.config.ner_id2label.items()}
+    tag_ids = output.ner_logits[0].argmax(-1).tolist()
+    # ... (decode as shown above)
+```
+### Classification only — get all label scores
+```python
+def get_topic_scores(text: str) -> dict[str, float]:
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
+    inputs.pop("token_type_ids", None)
+    with torch.no_grad():
+        output = model(**inputs)
+    cls_id2label = {int(k): v for k, v in model.config.cls_id2label.items()}
+    probs = sigmoid(output.cls_logits[0].numpy())
+    return {cls_id2label[i]: float(probs[i]) for i in range(len(probs))}
+```
+---
+## Training Details
+### Encoder
+Transferred from [`QuantBridge/energy-intelligence-multitask-custom-ner`](https://huggingface.co/QuantBridge/energy-intelligence-multitask-custom-ner) — DistilBERT fine-tuned on energy and financial news for BIO entity recognition.
+### NER Head
+Weights transferred directly from the NER backbone (`classifier.*` → `ner_classifier.*`). No additional NER training was performed.
+### Classification Head
+Trained separately from scratch on a merged corpus:
+| Source | HF / NLTK id | Categories used | Mapped to |
+|--------|-------------|-----------------|-----------|
+| AG News | `ag_news` | World (0), Business (2), Sci/Tech (3) | `macro`, `business`, `technology` |
+| Reuters-21578 | `nltk.corpus.reuters` | crude, gas, ship, trade, money-fx, interest, earn, acq | `energy`, `shipping`, `trade`, `macro`, `business` |
+| Kaggle News Category | `rmisra/news-category-dataset` | POLITICS, BUSINESS, TECH, WORLD NEWS | `politics`, `business`, `technology`, `macro` |
+Training split: 80% train / 10% validation / 10% test, seed 42.
+**Hyperparameters:**
+| Parameter | Value |
+|---|---|
+| Epochs | 10 |
+| Train batch size | 32 |
+| Learning rate | 2e-5 |
+| Warmup steps | 500 |
+| Weight decay | 0.01 |
+| Max sequence length | 128 tokens |
+| Loss | BCEWithLogitsLoss |
+| Best checkpoint selected by | micro-F1 on validation set |
+| Hardware | NVIDIA T4 16 GB |
+---
+## Model Files
+```
+energy-intelligence-multitask/
+  configuration_energy_multitask.py   # EnergyMultitaskConfig (DistilBertConfig subclass)
+  modeling_energy_multitask.py        # EnergyMultitaskModel  (two-head architecture)
+  config.json                         # Serialised config with auto_map
+  model.safetensors                   # Combined weights (~256 MB)
+  tokenizer.json                      # Fast tokenizer
+  tokenizer_config.json               # Tokenizer settings
+```
+---
+## Limitations
+- **English only** — trained exclusively on English-language news text.
+- **Classification data bias** — training corpora (AG News, Kaggle) are dominated by `business` and `macro` content. Domain-specific labels (`energy`, `shipping`, `risk`, `regulation`, `stocks`) score lower across the board and may not cross common thresholds even when semantically correct. A recommended threshold for this model is **0.20** rather than the default 0.50.
+- **NER on headlines** — the NER head was fine-tuned on short news headlines; performance may be lower on long-form documents.
+- **Max length** — inputs are truncated to 128 tokens. Longer texts should be chunked.
+- **PERSON / INFRASTRUCTURE** — these entity types exist in the label vocabulary but fired less frequently on financial news headlines compared to COMPANY, COUNTRY, and COMMODITY.
+- **Not for trading** — this model is intended as an intelligence tagging layer, not for real-time trading or financial decision-making.
+---
+## Intended Use
+This model is the **tagging layer** in an energy intelligence pipeline:
+```
+Raw news headline
+       ↓
+EnergyMultitaskModel (this model)
+       ↓
+  entities  ──────────────────→  who / what / where
+  topic labels  ──────────────→  energy / risk / trade / macro / ...
+       ↓
+Structured intelligence signal for downstream analysis
+```
+---
+## Related Models
+- [`QuantBridge/energy-intelligence-multitask-custom-ner`](https://huggingface.co/QuantBridge/energy-intelligence-multitask-custom-ner) — NER backbone (encoder source)
+- [`QuantBridge/energy-news-classifier-ner-multitask`](https://huggingface.co/QuantBridge/energy-news-classifier-ner-multitask) — Classification-only model (single head)
+---
+## Citation
+```bibtex
+@misc{quantbridge2025multitask,
+  author       = {QuantBridge},
+  title        = {Energy Intelligence Multitask Model (NER + Classification)},
+  year         = {2025},
+  publisher    = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/QuantBridge/energy-intelligence-multitask}},
+}
+```