Update README.md
Browse files
README.md
CHANGED
|
@@ -145,6 +145,15 @@ The model classifies documents into this hierarchy:
|
|
| 145 |
| :--- | :--- | :--- |
|
| 146 |
| • M&A Activity<br>• Legal Proceedings Report | • Capital/Financing Update<br>• Interest Rate Notice | • Net Asset Value (NAV)<br>• Fund Factsheet |
|
| 147 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
## 📚 Training Data
|
| 149 |
|
| 150 |
The model was trained on a proprietary **Golden Dataset of 27,671 financial filings**, manually curated to represent the diverse landscape of global corporate reporting.
|
|
@@ -153,3 +162,28 @@ The model was trained on a proprietary **Golden Dataset of 27,671 financial fili
|
|
| 153 |
* **Multilingual:** Includes documents in English, French, German, and other major European languages (leveraging the multilingual capabilities of Jina-V3).
|
| 154 |
* **Diversity:** The dataset preserves the natural "long-tail" distribution of financial data, ranging from massive 500+ page **Annual Reports** to single-page **Press Releases** and complex **ESG Disclosures**.
|
| 155 |
* **Quality Control:** Mapped to a strict 2-level hierarchy to resolve semantic ambiguities common in regulatory filings (e.g., distinguishing a *Share Buyback* announcement from a *Director's Dealing* notification).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
| :--- | :--- | :--- |
|
| 146 |
| • M&A Activity<br>• Legal Proceedings Report | • Capital/Financing Update<br>• Interest Rate Notice | • Net Asset Value (NAV)<br>• Fund Factsheet |
|
| 147 |
|
| 148 |
+
## 📜 The Standard: Financial Reporting Classification Framework (FRCF)
|
| 149 |
+
|
| 150 |
+
The taxonomy used by this model is based on the **[Financial Reporting Classification Framework (FRCF)](https://financialreports.eu/financial-reporting-classification-framework/)**, an open-source standard designed to organize corporate disclosures in a consistent, cross-jurisdictional format.
|
| 151 |
+
|
| 152 |
+
Unlike fragmented regulatory schemes, the FRCF organizes disclosures by **functional purpose**, ensuring comparability across markets (e.g., mapping a US *10-K* and a European *Annual Financial Report* to the same standardized `Annual Report` category).
|
| 153 |
+
|
| 154 |
+
* **[Explore the Framework](https://financialreports.eu/financial-reporting-classification-framework/)**
|
| 155 |
+
* **[Download Methodology (PDF)](https://financialreports.eu/download/frcf-methodology.pdf)**
|
| 156 |
+
|
| 157 |
## 📚 Training Data
|
| 158 |
|
| 159 |
The model was trained on a proprietary **Golden Dataset of 27,671 financial filings**, manually curated to represent the diverse landscape of global corporate reporting.
|
|
|
|
| 162 |
* **Multilingual:** Includes documents in English, French, German, and other major European languages (leveraging the multilingual capabilities of Jina-V3).
|
| 163 |
* **Diversity:** The dataset preserves the natural "long-tail" distribution of financial data, ranging from massive 500+ page **Annual Reports** to single-page **Press Releases** and complex **ESG Disclosures**.
|
| 164 |
* **Quality Control:** Mapped to a strict 2-level hierarchy to resolve semantic ambiguities common in regulatory filings (e.g., distinguishing a *Share Buyback* announcement from a *Director's Dealing* notification).
|
| 165 |
+
|
| 166 |
+
## ⚙️ Deployment & Hardware
|
| 167 |
+
|
| 168 |
+
This model is optimized for **GPU Inference** due to the heavy 8192-token context window of the Jina encoder. While CPU inference is possible, it is significantly slower.
|
| 169 |
+
|
| 170 |
+
### Recommended Configuration
|
| 171 |
+
|
| 172 |
+
| Component | Recommendation | Notes |
|
| 173 |
+
| :--- | :--- | :--- |
|
| 174 |
+
| **GPU** | **NVIDIA T4 (16GB)** | The "Sweet Spot" for cost/performance. Capable of ~50 docs/sec in batch mode. |
|
| 175 |
+
| **Alternative** | NVIDIA L4 / A10 | Recommended for high-concurrency production APIs. |
|
| 176 |
+
| **VRAM** | 16 GB Minimum | Required to embed long documents without OOM errors. |
|
| 177 |
+
| **System RAM** | 16 GB+ | Standard requirement for PyTorch + XGBoost overhead. |
|
| 178 |
+
|
| 179 |
+
### Critical Environment Settings
|
| 180 |
+
|
| 181 |
+
To load the underlying Jina-V3 model, you **must** allow remote code execution in your environment variables (Docker, Kubernetes, or Hugging Face Endpoints):
|
| 182 |
+
|
| 183 |
+
```bash
|
| 184 |
+
HF_TRUST_REMOTE_CODE=True
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
### Throughput Benchmarks (T4 GPU)
|
| 188 |
+
* **Live API Latency:** ~200ms – 500ms per document.
|
| 189 |
+
* **Batch Processing:** ~40 – 50 documents per second (Batch Size: 64).
|