File size: 2,912 Bytes
e84207d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
license: mit
language:
- en
tags:
- statement-extraction
- named-entity-recognition
- t5
- gemma
- seq2seq
- nlp
- information-extraction
- corp-o-rate
pipeline_tag: text2text-generation
---
# Statement Extractor (T5-Gemma 2)
A fine-tuned T5-Gemma 2 model for extracting structured statements from text. Part of [corp-o-rate.com](https://corp-o-rate.com).
## Model Description
This model extracts subject-predicate-object triples from unstructured text, with automatic entity type recognition and coreference resolution.
- **Architecture**: T5-Gemma 2 (270M-270M, 540M total parameters)
- **Training Data**: 77,515 examples from corporate and news documents
- **Final Eval Loss**: 0.209
- **Max Input Length**: 4,096 tokens
- **Max Output Length**: 2,048 tokens
## Usage
### Python
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
model = AutoModelForSeq2SeqLM.from_pretrained(
"Corp-o-Rate-Community/statement-extractor",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"Corp-o-Rate-Community/statement-extractor",
trust_remote_code=True,
)
text = "Apple Inc. announced a commitment to carbon neutrality by 2030."
inputs = tokenizer(f"<page>{text}</page>", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=2048, num_beams=4)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```
### Input Format
Wrap your text in `<page>` tags:
```
<page>Your text here...</page>
```
### Output Format
The model outputs XML with extracted statements:
```xml
<statements>
<stmt>
<subject type="ORG">Apple Inc.</subject>
<object type="EVENT">carbon neutrality by 2030</object>
<predicate>committed to</predicate>
<text>Apple Inc. committed to achieving carbon neutrality by 2030.</text>
</stmt>
</statements>
```
## Entity Types
| Type | Description |
|------|-------------|
| ORG | Organizations (companies, agencies) |
| PERSON | People (names, titles) |
| GPE | Geopolitical entities (countries, cities) |
| LOC | Locations (mountains, rivers) |
| PRODUCT | Products (devices, services) |
| EVENT | Events (announcements, meetings) |
| WORK_OF_ART | Creative works (reports, books) |
| LAW | Legal documents |
| DATE | Dates and time periods |
| MONEY | Monetary values |
| PERCENT | Percentages |
| QUANTITY | Quantities and measurements |
## Demo
Try the interactive demo at [statement-extractor.vercel.app](https://statement-extractor.vercel.app)
## Training
- Base model: `google/t5gemma-2-270m-270m`
- Training examples: 77,515
- Final eval loss: 0.209
- Training with refinement phase (LR=1e-6, epochs=0.2)
- Beam search: num_beams=4
## About corp-o-rate
This model is part of [corp-o-rate.com](https://corp-o-rate.com) - an AI-powered platform for ESG analysis and corporate accountability.
## License
MIT License
|