maifeng
/

boilerplate_detection

@@ -16,13 +16,13 @@ widget:
 # Boilerplate Detection for Financial Text
-This model identifies boilerplate (formulaic, repetitive) language in financial documents, distinguishing it from substantive business content. It was developed to preprocess analyst reports for research on corporate culture analysis.
 ## Model Description
-The model uses a frozen sentence transformer (all-mpnet-base-v2) combined with a lightweight classification head to identify boilerplate text segments. Training data consisted of analyst reports from 2000-2020, where boilerplate examples were identified as frequently repeated segments across reports from the same brokerage house.
-The architecture combines mean-pooled embeddings from the sentence transformer with a simple 3-layer neural network (768 → 16 → 8 → 2) for classification. This approach preserves semantic understanding while learning patterns specific to financial boilerplate language.
 ## Usage
@@ -49,12 +49,15 @@ model.eval()
 # Classify texts
 texts = [
-    "The securities described herein may not be eligible for sale in all jurisdictions.",
-    "Revenue increased by 15% this quarter due to strong market demand.",
-    "This report contains forward-looking statements involving risks.",
-    "Our new product line exceeded initial sales expectations significantly."
 ]
 results = []
 for text in texts:
     inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
@@ -63,21 +66,19 @@ for text in texts:
         outputs = model(**inputs)
         probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
-    label = 'BOILERPLATE' if probs[1] > 0.5 else 'NOT_BOILERPLATE'
-    confidence = probs[1].item() if label == 'BOILERPLATE' else probs[0].item()
-    results.append({'text': text, 'label': label, 'confidence': confidence})
 for result in results:
-    print(f"{result['label']:>15}: {result['confidence']:.1%} - {result['text'][:60]}...")
 ```
-## Model Limitations
-This model is specifically trained on financial analyst reports from 2000-2020 and performs best on similar English-language financial documents. It may not generalize well to other domains or document types. The model processes text segments up to 512 tokens and provides binary classification only.
 ## Citation
 ```bibtex
 @article{li2025dissecting,
   title={Dissecting Corporate Culture Using Generative AI},
@@ -86,7 +87,3 @@ This model is specifically trained on financial analyst reports from 2000-2020 a
   year={2025}
 }
 ```
-## License
-Apache 2.0

 # Boilerplate Detection for Financial Text
+This model identifies boilerplate (formulaic, repetitive) language in financial analyst reports and distinguishes it from substantive business content.
 ## Model Description
+The model uses a frozen sentence transformer (all-mpnet-base-v2) combined with a lightweight classification head to identify boilerplate text segments. Training data consisted of analyst reports from 2000-2020, where boilerplate examples were identified as frequently repeated segments across reports from the same brokerage house. To construct the training dataset, we sampled reports to find the most frequently repeated segments. For a segment to be classified as a positive example, it must be among the top 10% most frequently repeated segments and appear at least five times by the same broker within the same year. Negative examples were identified by randomly selecting segments with no repetition in each broker-year sample.
+The architecture combines mean-pooled embeddings from the sentence transformer with a simple 3-layer neural network (768 → 16 → 8 → 2) for classification.
 ## Usage
 # Classify texts
 texts = [
+    "The securities and related financial instruments described herein may not be eligible for sale in all jurisdictions or to certain categories of investors. This material is not intended as an offer or solicitation for the purchase or sale of any security or other financial instrument.",
+    "Morgan Stanley & Co. LLC and its affiliates disclaim any and all liability relating to these materials, including, without limitation, any express or implied representations or warranties for statements or errors contained in, or omissions from, these materials.",
+    "And while we acknowledge the company has made significant progress on the cost side, Harman will have to consistently execute on those cost cutting initiatives for the next several quarters to help prop-up its low-price and low-margin customized business.",
+    "Microsoft's Azure cloud revenue grew 29% year-over-year in constant currency, with particular strength in AI services where usage increased 180% quarter-over-quarter. The company signed 15 new enterprise AI contracts worth over $100 million each during the quarter."
 ]
+# Classification threshold (default 0.5, can be adjusted based on precision/recall requirements)
+threshold = 0.5
 results = []
 for text in texts:
     inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
         outputs = model(**inputs)
         probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
+    boilerplate_prob = probs[1].item()
+    label = 'BOILERPLATE' if boilerplate_prob > threshold else 'NOT_BOILERPLATE'
+    results.append({'text': text, 'label': label, 'boilerplate_probability': boilerplate_prob})
 for result in results:
+    print(f"{result['label']:>15}: {result['boilerplate_probability']:.3f} - {result['text'][:80]}...")
 ```
 ## Citation
+If you find the model useful, please cite:
 ```bibtex
 @article{li2025dissecting,
   title={Dissecting Corporate Culture Using Generative AI},
   year={2025}
 }
 ```