neileko commited on
Commit
e84207d
·
verified ·
1 Parent(s): 9738283

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +114 -0
README.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - statement-extraction
7
+ - named-entity-recognition
8
+ - t5
9
+ - gemma
10
+ - seq2seq
11
+ - nlp
12
+ - information-extraction
13
+ - corp-o-rate
14
+ pipeline_tag: text2text-generation
15
+ ---
16
+
17
+ # Statement Extractor (T5-Gemma 2)
18
+
19
+ A fine-tuned T5-Gemma 2 model for extracting structured statements from text. Part of [corp-o-rate.com](https://corp-o-rate.com).
20
+
21
+ ## Model Description
22
+
23
+ This model extracts subject-predicate-object triples from unstructured text, with automatic entity type recognition and coreference resolution.
24
+
25
+ - **Architecture**: T5-Gemma 2 (270M-270M, 540M total parameters)
26
+ - **Training Data**: 77,515 examples from corporate and news documents
27
+ - **Final Eval Loss**: 0.209
28
+ - **Max Input Length**: 4,096 tokens
29
+ - **Max Output Length**: 2,048 tokens
30
+
31
+ ## Usage
32
+
33
+ ### Python
34
+
35
+ ```python
36
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
37
+ import torch
38
+
39
+ model = AutoModelForSeq2SeqLM.from_pretrained(
40
+ "Corp-o-Rate-Community/statement-extractor",
41
+ torch_dtype=torch.bfloat16,
42
+ trust_remote_code=True,
43
+ )
44
+ tokenizer = AutoTokenizer.from_pretrained(
45
+ "Corp-o-Rate-Community/statement-extractor",
46
+ trust_remote_code=True,
47
+ )
48
+
49
+ text = "Apple Inc. announced a commitment to carbon neutrality by 2030."
50
+ inputs = tokenizer(f"<page>{text}</page>", return_tensors="pt")
51
+ outputs = model.generate(**inputs, max_new_tokens=2048, num_beams=4)
52
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
53
+ print(result)
54
+ ```
55
+
56
+ ### Input Format
57
+
58
+ Wrap your text in `<page>` tags:
59
+
60
+ ```
61
+ <page>Your text here...</page>
62
+ ```
63
+
64
+ ### Output Format
65
+
66
+ The model outputs XML with extracted statements:
67
+
68
+ ```xml
69
+ <statements>
70
+ <stmt>
71
+ <subject type="ORG">Apple Inc.</subject>
72
+ <object type="EVENT">carbon neutrality by 2030</object>
73
+ <predicate>committed to</predicate>
74
+ <text>Apple Inc. committed to achieving carbon neutrality by 2030.</text>
75
+ </stmt>
76
+ </statements>
77
+ ```
78
+
79
+ ## Entity Types
80
+
81
+ | Type | Description |
82
+ |------|-------------|
83
+ | ORG | Organizations (companies, agencies) |
84
+ | PERSON | People (names, titles) |
85
+ | GPE | Geopolitical entities (countries, cities) |
86
+ | LOC | Locations (mountains, rivers) |
87
+ | PRODUCT | Products (devices, services) |
88
+ | EVENT | Events (announcements, meetings) |
89
+ | WORK_OF_ART | Creative works (reports, books) |
90
+ | LAW | Legal documents |
91
+ | DATE | Dates and time periods |
92
+ | MONEY | Monetary values |
93
+ | PERCENT | Percentages |
94
+ | QUANTITY | Quantities and measurements |
95
+
96
+ ## Demo
97
+
98
+ Try the interactive demo at [statement-extractor.vercel.app](https://statement-extractor.vercel.app)
99
+
100
+ ## Training
101
+
102
+ - Base model: `google/t5gemma-2-270m-270m`
103
+ - Training examples: 77,515
104
+ - Final eval loss: 0.209
105
+ - Training with refinement phase (LR=1e-6, epochs=0.2)
106
+ - Beam search: num_beams=4
107
+
108
+ ## About corp-o-rate
109
+
110
+ This model is part of [corp-o-rate.com](https://corp-o-rate.com) - an AI-powered platform for ESG analysis and corporate accountability.
111
+
112
+ ## License
113
+
114
+ MIT License