--- license: mit language: - en tags: - statement-extraction - named-entity-recognition - t5 - gemma - seq2seq - nlp - information-extraction - corp-o-rate pipeline_tag: text2text-generation --- # Statement Extractor (T5-Gemma 2) A fine-tuned T5-Gemma 2 model for extracting structured statements from text. Part of [corp-o-rate.com](https://corp-o-rate.com). ## Model Description This model extracts subject-predicate-object triples from unstructured text, with automatic entity type recognition and coreference resolution. - **Architecture**: T5-Gemma 2 (270M-270M, 540M total parameters) - **Training Data**: 77,515 examples from corporate and news documents - **Final Eval Loss**: 0.209 - **Max Input Length**: 4,096 tokens - **Max Output Length**: 2,048 tokens ## Usage ### Python ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer import torch model = AutoModelForSeq2SeqLM.from_pretrained( "Corp-o-Rate-Community/statement-extractor", torch_dtype=torch.bfloat16, trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained( "Corp-o-Rate-Community/statement-extractor", trust_remote_code=True, ) text = "Apple Inc. announced a commitment to carbon neutrality by 2030." inputs = tokenizer(f"{text}", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=2048, num_beams=4) result = tokenizer.decode(outputs[0], skip_special_tokens=True) print(result) ``` ### Input Format Wrap your text in `` tags: ``` Your text here... ``` ### Output Format The model outputs XML with extracted statements: ```xml Apple Inc. carbon neutrality by 2030 committed to Apple Inc. committed to achieving carbon neutrality by 2030. ``` ## Entity Types | Type | Description | |------|-------------| | ORG | Organizations (companies, agencies) | | PERSON | People (names, titles) | | GPE | Geopolitical entities (countries, cities) | | LOC | Locations (mountains, rivers) | | PRODUCT | Products (devices, services) | | EVENT | Events (announcements, meetings) | | WORK_OF_ART | Creative works (reports, books) | | LAW | Legal documents | | DATE | Dates and time periods | | MONEY | Monetary values | | PERCENT | Percentages | | QUANTITY | Quantities and measurements | ## Demo Try the interactive demo at [statement-extractor.vercel.app](https://statement-extractor.vercel.app) ## Training - Base model: `google/t5gemma-2-270m-270m` - Training examples: 77,515 - Final eval loss: 0.209 - Training with refinement phase (LR=1e-6, epochs=0.2) - Beam search: num_beams=4 ## About corp-o-rate This model is part of [corp-o-rate.com](https://corp-o-rate.com) - an AI-powered platform for ESG analysis and corporate accountability. ## License MIT License