|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- statement-extraction |
|
|
- named-entity-recognition |
|
|
- t5 |
|
|
- gemma |
|
|
- seq2seq |
|
|
- nlp |
|
|
- information-extraction |
|
|
- corp-o-rate |
|
|
pipeline_tag: text2text-generation |
|
|
--- |
|
|
|
|
|
# Statement Extractor (T5-Gemma 2) |
|
|
|
|
|
A fine-tuned T5-Gemma 2 model for extracting structured statements from text. Part of [corp-o-rate.com](https://corp-o-rate.com). |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model extracts subject-predicate-object triples from unstructured text, with automatic entity type recognition and coreference resolution. |
|
|
|
|
|
- **Architecture**: T5-Gemma 2 (270M-270M, 540M total parameters) |
|
|
- **Training Data**: 77,515 examples from corporate and news documents |
|
|
- **Final Eval Loss**: 0.209 |
|
|
- **Max Input Length**: 4,096 tokens |
|
|
- **Max Output Length**: 2,048 tokens |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Python |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained( |
|
|
"Corp-o-Rate-Community/statement-extractor", |
|
|
torch_dtype=torch.bfloat16, |
|
|
trust_remote_code=True, |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
|
"Corp-o-Rate-Community/statement-extractor", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
|
|
|
text = "Apple Inc. announced a commitment to carbon neutrality by 2030." |
|
|
inputs = tokenizer(f"<page>{text}</page>", return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=2048, num_beams=4) |
|
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
### Input Format |
|
|
|
|
|
Wrap your text in `<page>` tags: |
|
|
|
|
|
``` |
|
|
<page>Your text here...</page> |
|
|
``` |
|
|
|
|
|
### Output Format |
|
|
|
|
|
The model outputs XML with extracted statements: |
|
|
|
|
|
```xml |
|
|
<statements> |
|
|
<stmt> |
|
|
<subject type="ORG">Apple Inc.</subject> |
|
|
<object type="EVENT">carbon neutrality by 2030</object> |
|
|
<predicate>committed to</predicate> |
|
|
<text>Apple Inc. committed to achieving carbon neutrality by 2030.</text> |
|
|
</stmt> |
|
|
</statements> |
|
|
``` |
|
|
|
|
|
## Entity Types |
|
|
|
|
|
| Type | Description | |
|
|
|------|-------------| |
|
|
| ORG | Organizations (companies, agencies) | |
|
|
| PERSON | People (names, titles) | |
|
|
| GPE | Geopolitical entities (countries, cities) | |
|
|
| LOC | Locations (mountains, rivers) | |
|
|
| PRODUCT | Products (devices, services) | |
|
|
| EVENT | Events (announcements, meetings) | |
|
|
| WORK_OF_ART | Creative works (reports, books) | |
|
|
| LAW | Legal documents | |
|
|
| DATE | Dates and time periods | |
|
|
| MONEY | Monetary values | |
|
|
| PERCENT | Percentages | |
|
|
| QUANTITY | Quantities and measurements | |
|
|
|
|
|
## Demo |
|
|
|
|
|
Try the interactive demo at [statement-extractor.vercel.app](https://statement-extractor.vercel.app) |
|
|
|
|
|
## Training |
|
|
|
|
|
- Base model: `google/t5gemma-2-270m-270m` |
|
|
- Training examples: 77,515 |
|
|
- Final eval loss: 0.209 |
|
|
- Training with refinement phase (LR=1e-6, epochs=0.2) |
|
|
- Beam search: num_beams=4 |
|
|
|
|
|
## About corp-o-rate |
|
|
|
|
|
This model is part of [corp-o-rate.com](https://corp-o-rate.com) - an AI-powered platform for ESG analysis and corporate accountability. |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|