Corp-o-Rate-Community
/

statement-extractor

Text Generation

statement-extraction

named-entity-recognition

information-extraction

text2text-generation

Model card Files Files and versions

statement-extractor / README.md

neileko's picture

Upload README.md with huggingface_hub

e84207d verified 17 days ago

|

history blame contribute delete

2.91 kB

	---
	license: mit
	language:
	- en
	tags:
	- statement-extraction
	- named-entity-recognition
	- t5
	- gemma
	- seq2seq
	- nlp
	- information-extraction
	- corp-o-rate
	pipeline_tag: text2text-generation
	---

	# Statement Extractor (T5-Gemma 2)

	A fine-tuned T5-Gemma 2 model for extracting structured statements from text. Part of [corp-o-rate.com](https://corp-o-rate.com).

	## Model Description

	This model extracts subject-predicate-object triples from unstructured text, with automatic entity type recognition and coreference resolution.

	- Architecture: T5-Gemma 2 (270M-270M, 540M total parameters)
	- Training Data: 77,515 examples from corporate and news documents
	- Final Eval Loss: 0.209
	- Max Input Length: 4,096 tokens
	- Max Output Length: 2,048 tokens

	## Usage

	### Python

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
	import torch

	model = AutoModelForSeq2SeqLM.from_pretrained(
	"Corp-o-Rate-Community/statement-extractor",
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"Corp-o-Rate-Community/statement-extractor",
	trust_remote_code=True,
	)

	text = "Apple Inc. announced a commitment to carbon neutrality by 2030."
	inputs = tokenizer(f"<page>{text}</page>", return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=2048, num_beams=4)
	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(result)
	```

	### Input Format

	Wrap your text in `<page>` tags:

	```
	<page>Your text here...</page>
	```

	### Output Format

	The model outputs XML with extracted statements:

	```xml
	<statements>
	<stmt>
	<subject type="ORG">Apple Inc.</subject>
	<object type="EVENT">carbon neutrality by 2030</object>
	<predicate>committed to</predicate>
	<text>Apple Inc. committed to achieving carbon neutrality by 2030.</text>
	</stmt>
	</statements>
	```

	## Entity Types

	\| Type \| Description \|
	\|------\|-------------\|
	\| ORG \| Organizations (companies, agencies) \|
	\| PERSON \| People (names, titles) \|
	\| GPE \| Geopolitical entities (countries, cities) \|
	\| LOC \| Locations (mountains, rivers) \|
	\| PRODUCT \| Products (devices, services) \|
	\| EVENT \| Events (announcements, meetings) \|
	\| WORK_OF_ART \| Creative works (reports, books) \|
	\| LAW \| Legal documents \|
	\| DATE \| Dates and time periods \|
	\| MONEY \| Monetary values \|
	\| PERCENT \| Percentages \|
	\| QUANTITY \| Quantities and measurements \|

	## Demo

	Try the interactive demo at [statement-extractor.vercel.app](https://statement-extractor.vercel.app)

	## Training

	- Base model: `google/t5gemma-2-270m-270m`
	- Training examples: 77,515
	- Final eval loss: 0.209
	- Training with refinement phase (LR=1e-6, epochs=0.2)
	- Beam search: num_beams=4

	## About corp-o-rate

	This model is part of [corp-o-rate.com](https://corp-o-rate.com) - an AI-powered platform for ESG analysis and corporate accountability.

	## License

	MIT License