silashundhausen commited on
Commit
1933e19
·
verified ·
1 Parent(s): e39aaa1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -0
README.md CHANGED
@@ -145,6 +145,15 @@ The model classifies documents into this hierarchy:
145
  | :--- | :--- | :--- |
146
  | • M&A Activity<br>• Legal Proceedings Report | • Capital/Financing Update<br>• Interest Rate Notice | • Net Asset Value (NAV)<br>• Fund Factsheet |
147
 
 
 
 
 
 
 
 
 
 
148
  ## 📚 Training Data
149
 
150
  The model was trained on a proprietary **Golden Dataset of 27,671 financial filings**, manually curated to represent the diverse landscape of global corporate reporting.
@@ -153,3 +162,28 @@ The model was trained on a proprietary **Golden Dataset of 27,671 financial fili
153
  * **Multilingual:** Includes documents in English, French, German, and other major European languages (leveraging the multilingual capabilities of Jina-V3).
154
  * **Diversity:** The dataset preserves the natural "long-tail" distribution of financial data, ranging from massive 500+ page **Annual Reports** to single-page **Press Releases** and complex **ESG Disclosures**.
155
  * **Quality Control:** Mapped to a strict 2-level hierarchy to resolve semantic ambiguities common in regulatory filings (e.g., distinguishing a *Share Buyback* announcement from a *Director's Dealing* notification).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
  | :--- | :--- | :--- |
146
  | • M&A Activity<br>• Legal Proceedings Report | • Capital/Financing Update<br>• Interest Rate Notice | • Net Asset Value (NAV)<br>• Fund Factsheet |
147
 
148
+ ## 📜 The Standard: Financial Reporting Classification Framework (FRCF)
149
+
150
+ The taxonomy used by this model is based on the **[Financial Reporting Classification Framework (FRCF)](https://financialreports.eu/financial-reporting-classification-framework/)**, an open-source standard designed to organize corporate disclosures in a consistent, cross-jurisdictional format.
151
+
152
+ Unlike fragmented regulatory schemes, the FRCF organizes disclosures by **functional purpose**, ensuring comparability across markets (e.g., mapping a US *10-K* and a European *Annual Financial Report* to the same standardized `Annual Report` category).
153
+
154
+ * **[Explore the Framework](https://financialreports.eu/financial-reporting-classification-framework/)**
155
+ * **[Download Methodology (PDF)](https://financialreports.eu/download/frcf-methodology.pdf)**
156
+
157
  ## 📚 Training Data
158
 
159
  The model was trained on a proprietary **Golden Dataset of 27,671 financial filings**, manually curated to represent the diverse landscape of global corporate reporting.
 
162
  * **Multilingual:** Includes documents in English, French, German, and other major European languages (leveraging the multilingual capabilities of Jina-V3).
163
  * **Diversity:** The dataset preserves the natural "long-tail" distribution of financial data, ranging from massive 500+ page **Annual Reports** to single-page **Press Releases** and complex **ESG Disclosures**.
164
  * **Quality Control:** Mapped to a strict 2-level hierarchy to resolve semantic ambiguities common in regulatory filings (e.g., distinguishing a *Share Buyback* announcement from a *Director's Dealing* notification).
165
+
166
+ ## ⚙️ Deployment & Hardware
167
+
168
+ This model is optimized for **GPU Inference** due to the heavy 8192-token context window of the Jina encoder. While CPU inference is possible, it is significantly slower.
169
+
170
+ ### Recommended Configuration
171
+
172
+ | Component | Recommendation | Notes |
173
+ | :--- | :--- | :--- |
174
+ | **GPU** | **NVIDIA T4 (16GB)** | The "Sweet Spot" for cost/performance. Capable of ~50 docs/sec in batch mode. |
175
+ | **Alternative** | NVIDIA L4 / A10 | Recommended for high-concurrency production APIs. |
176
+ | **VRAM** | 16 GB Minimum | Required to embed long documents without OOM errors. |
177
+ | **System RAM** | 16 GB+ | Standard requirement for PyTorch + XGBoost overhead. |
178
+
179
+ ### Critical Environment Settings
180
+
181
+ To load the underlying Jina-V3 model, you **must** allow remote code execution in your environment variables (Docker, Kubernetes, or Hugging Face Endpoints):
182
+
183
+ ```bash
184
+ HF_TRUST_REMOTE_CODE=True
185
+ ```
186
+
187
+ ### Throughput Benchmarks (T4 GPU)
188
+ * **Live API Latency:** ~200ms – 500ms per document.
189
+ * **Batch Processing:** ~40 – 50 documents per second (Batch Size: 64).