|
|
--- |
|
|
license: apache-2.0 |
|
|
language: en |
|
|
tags: |
|
|
- text-classification |
|
|
- finance |
|
|
- accounting |
|
|
- financial-text |
|
|
- boilerplate-detection |
|
|
- analyst-reports |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# Boilerplate Detection for Financial Text |
|
|
|
|
|
This model identifies boilerplate (formulaic, repetitive) language in financial analyst reports and distinguishes it from substantive business content. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
The model uses a frozen sentence transformer (all-mpnet-base-v2) combined with a lightweight classification head to identify boilerplate text segments. Training data consisted of analyst reports from 2000-2020, where boilerplate examples were identified as frequently repeated segments across reports from the same brokerage house. To construct the training dataset, we sampled reports to find the most frequently repeated segments. For a segment to be classified as a positive example, it must be among the top 10% most frequently repeated segments and appear at least five times by the same broker within the same year. Negative examples were identified by randomly selecting segments with no repetition in each broker-year sample. |
|
|
|
|
|
The architecture combines mean-pooled embeddings from the sentence transformer with a simple 3-layer neural network (768 → 16 → 8 → 2) for classification. |
|
|
|
|
|
## Usage |
|
|
|
|
|
Since this model uses a custom architecture, you need to use the direct loading approach rather than the pipeline interface: |
|
|
|
|
|
```python |
|
|
import sys |
|
|
import huggingface_hub |
|
|
from transformers import AutoTokenizer |
|
|
import torch |
|
|
|
|
|
# Load model components |
|
|
model_path = huggingface_hub.snapshot_download('maifeng/boilerplate_detection') |
|
|
sys.path.insert(0, model_path) |
|
|
|
|
|
from modeling_boilerplate import BoilerplateDetector, BoilerplateConfig |
|
|
|
|
|
# Initialize model |
|
|
config = BoilerplateConfig.from_pretrained('maifeng/boilerplate_detection') |
|
|
model = BoilerplateDetector.from_pretrained('maifeng/boilerplate_detection') |
|
|
tokenizer = AutoTokenizer.from_pretrained('maifeng/boilerplate_detection') |
|
|
|
|
|
# Move model to GPU if available |
|
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
|
model = model.to(device) |
|
|
model.eval() |
|
|
|
|
|
# Classify texts |
|
|
texts = [ |
|
|
"The securities and related financial instruments described herein may not be eligible for sale in all jurisdictions or to certain categories of investors. This material is not intended as an offer or solicitation for the purchase or sale of any security or other financial instrument.", |
|
|
"Morgan Stanley & Co. LLC and its affiliates disclaim any and all liability relating to these materials, including, without limitation, any express or implied representations or warranties for statements or errors contained in, or omissions from, these materials.", |
|
|
"And while we acknowledge the company has made significant progress on the cost side, Harman will have to consistently execute on those cost cutting initiatives for the next several quarters to help prop-up its low-price and low-margin customized business.", |
|
|
"Microsoft's Azure cloud revenue grew 29% year-over-year in constant currency, with particular strength in AI services where usage increased 180% quarter-over-quarter. The company signed 15 new enterprise AI contracts worth over $100 million each during the quarter." |
|
|
] |
|
|
|
|
|
# Classification threshold (default 0.5, can be adjusted based on precision/recall requirements) |
|
|
threshold = 0.5 |
|
|
|
|
|
results = [] |
|
|
for text in texts: |
|
|
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512) |
|
|
inputs = {k: v.to(device) for k, v in inputs.items()} # Move inputs to device |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0] |
|
|
|
|
|
boilerplate_prob = probs[1].item() |
|
|
label = 'BOILERPLATE' if boilerplate_prob > threshold else 'NOT_BOILERPLATE' |
|
|
|
|
|
results.append({'text': text, 'label': label, 'boilerplate_probability': boilerplate_prob}) |
|
|
|
|
|
for result in results: |
|
|
print(f"{result['label']:>15}: {result['boilerplate_probability']:.3f} - {result['text'][:80]}...") |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find the model useful, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{li2025dissecting, |
|
|
title={Dissecting Corporate Culture Using Generative AI}, |
|
|
author={Li, Kai and Mai, Feng and Shen, Rui and Yang, Chelsea and Zhang, Tengfei}, |
|
|
journal={Review of Financial Studies}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|