Corp-o-Rate-Community
/

statement-extractor

Text Generation

statement-extraction

named-entity-recognition

information-extraction

text2text-generation

Model card Files Files and versions

Neil Ellis commited on Jan 12

Commit

e84207d

·

verified ·

1 Parent(s): 9738283

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +114 -0

README.md ADDED Viewed

	@@ -0,0 +1,114 @@

+---
+license: mit
+language:
+- en
+tags:
+- statement-extraction
+- named-entity-recognition
+- t5
+- gemma
+- seq2seq
+- nlp
+- information-extraction
+- corp-o-rate
+pipeline_tag: text2text-generation
+---
+# Statement Extractor (T5-Gemma 2)
+A fine-tuned T5-Gemma 2 model for extracting structured statements from text. Part of [corp-o-rate.com](https://corp-o-rate.com).
+## Model Description
+This model extracts subject-predicate-object triples from unstructured text, with automatic entity type recognition and coreference resolution.
+- **Architecture**: T5-Gemma 2 (270M-270M, 540M total parameters)
+- **Training Data**: 77,515 examples from corporate and news documents
+- **Final Eval Loss**: 0.209
+- **Max Input Length**: 4,096 tokens
+- **Max Output Length**: 2,048 tokens
+## Usage
+### Python
+```python
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+import torch
+model = AutoModelForSeq2SeqLM.from_pretrained(
+    "Corp-o-Rate-Community/statement-extractor",
+    torch_dtype=torch.bfloat16,
+    trust_remote_code=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "Corp-o-Rate-Community/statement-extractor",
+    trust_remote_code=True,
+)
+text = "Apple Inc. announced a commitment to carbon neutrality by 2030."
+inputs = tokenizer(f"<page>{text}</page>", return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=2048, num_beams=4)
+result = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(result)
+```
+### Input Format
+Wrap your text in `<page>` tags:
+```
+<page>Your text here...</page>
+```
+### Output Format
+The model outputs XML with extracted statements:
+```xml
+<statements>
+  <stmt>
+    <subject type="ORG">Apple Inc.</subject>
+    <object type="EVENT">carbon neutrality by 2030</object>
+    <predicate>committed to</predicate>
+    <text>Apple Inc. committed to achieving carbon neutrality by 2030.</text>
+  </stmt>
+</statements>
+```
+## Entity Types
+| Type | Description |
+|------|-------------|
+| ORG | Organizations (companies, agencies) |
+| PERSON | People (names, titles) |
+| GPE | Geopolitical entities (countries, cities) |
+| LOC | Locations (mountains, rivers) |
+| PRODUCT | Products (devices, services) |
+| EVENT | Events (announcements, meetings) |
+| WORK_OF_ART | Creative works (reports, books) |
+| LAW | Legal documents |
+| DATE | Dates and time periods |
+| MONEY | Monetary values |
+| PERCENT | Percentages |
+| QUANTITY | Quantities and measurements |
+## Demo
+Try the interactive demo at [statement-extractor.vercel.app](https://statement-extractor.vercel.app)
+## Training
+- Base model: `google/t5gemma-2-270m-270m`
+- Training examples: 77,515
+- Final eval loss: 0.209
+- Training with refinement phase (LR=1e-6, epochs=0.2)
+- Beam search: num_beams=4
+## About corp-o-rate
+This model is part of [corp-o-rate.com](https://corp-o-rate.com) - an AI-powered platform for ESG analysis and corporate accountability.
+## License
+MIT License