jangoepfert commited on
Commit
76f11d3
·
verified ·
1 Parent(s): 28d68c5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -3
README.md CHANGED
@@ -1,3 +1,112 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - nasa-impact/nasa-smd-ibm-distil-v0.1
7
+ pipeline_tag: token-classification
8
+ library_name: transformers
9
+ tags:
10
+ - quantity span identification
11
+ - quantity extraction
12
+ - quantity mention detection
13
+ - quantititative information extraction
14
+ - measurement extraction
15
+ - numeric
16
+ - number
17
+ - unit
18
+ ---
19
+
20
+
21
+ # Model Card for quinex-quantity-v0-30M
22
+
23
+ `quinex-quantity-v0-30M` is based on the [NASA/IBM INDUS-Small model](https://huggingface.co/nasa-impact/nasa-smd-ibm-distil-v0.1) (also known as `nasa-smd-ibm-distil-v0.1`), which is a distilled version of [INDUS](https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1) -- an encoder-only transformer model based on RoBERTa that was pre-trained on scientific literature and Wikipedia. We further fine-tuned this model to identify quantities (i.e., a number and, if applicable, a unit) in text. For more details, please refer to our paper *"Quinex: Quantitative Information Extraction from Text using Open and Lightweight LLMs"* (published soon).
24
+
25
+
26
+ ## Uses
27
+
28
+ This model is intended for detecting quantity mentions in text using sequence labeling. Please note that quantity modifiers (e.g., 'approximately', 'about', 'more than', 'not', etc.) are not considered part of quantity spans in this work.
29
+
30
+
31
+ ## Example output
32
+
33
+ | Token | NER tag |
34
+ |----------------|------------|
35
+ | The | O |
36
+ | hydroelectric | O |
37
+ | complex | O |
38
+ | has | O |
39
+ | a | O |
40
+ | capacity | O |
41
+ | of | O |
42
+ | approximately | O |
43
+ | 2,500 | B-Quantity |
44
+ | megawatts | I-Quantity |
45
+ | and | O |
46
+ | produces | O |
47
+ | about | O |
48
+ | 4.9 | B-Quantity |
49
+ | terawatt | I-Quantity |
50
+ | - | I-Quantity |
51
+ | hours | I-Quantity |
52
+ | yearly | O |
53
+ | ( | O |
54
+ | see | O |
55
+ | Figure | O |
56
+ | 2 | O |
57
+ | ) | O |
58
+ | . | O |
59
+
60
+
61
+ ## Model details
62
+
63
+ - **Base Model**: [INDUS-Small](https://huggingface.co/nasa-impact/nasa-smd-ibm-distil-v0.1)
64
+ - **Tokenizer**: INDUS-Small
65
+ - **Parameters**: 30M
66
+
67
+
68
+ ## Fine-tuning data
69
+
70
+ The model was first fine-tuned on non-curated examples from a filtered variant of [Wiki-Quantities](https://doi.org/10.5281/zenodo.15462002) and subsequently on a [combination of datasets for quantity span identification](https://github.com/FZJ-IEK3-VSA/quinex-datasets), including:
71
+ * Wiki-Quantities (small variant, curated examples only)
72
+ * SOFC-Exp (relabeled)
73
+ * Grobid-quantities (relabeled)
74
+ * MeasEval (relabeled)
75
+ * Custom quinex data
76
+
77
+
78
+ ## Evaluation results
79
+
80
+ Evaluation results on the test set as described in the paper:
81
+
82
+ | F1 | Precision | Recall | Accuracy |
83
+ |-------|-----------|--------|----------|
84
+ | 93.53 | 93.00 | 94.06 | 99.02 |
85
+
86
+ Note that here we report the scores of this specific checkpoint, which slightly differ from the scores averaged over multiple seeds reported in the paper.
87
+
88
+ Also, note that these scores do not account for alternative correct answers (e.g., '1.2 kW and 1.4 kW' could be labeled as a list or individually) or debatable cases (e.g., whether 'bi-layer' or 'quartet' should be considered a quantity). Counting these as correct results in higher scores.
89
+
90
+ For better performance, refer to the larger [quinex-quantity-v0-124M](https://huggingface.co/JuelichSystemsAnalysis/quinex-quantity-v0-124M) variant.
91
+
92
+
93
+ ## Citation
94
+
95
+ If you use this model in your research, please cite the following paper:
96
+
97
+ ```bibtex
98
+ @article{quinex2025,
99
+ title = {{Quinex: Quantitative Information Extraction from Text using Open and Lightweight LLMs}},
100
+ author = {Göpfert, Jan and Kuckertz, Patrick and Müller, Gian and Lütz, Luna and Körner, Celine and Khuat, Hang and Stolten, Detlef and Weinand, Jann M.},
101
+ month = okt,
102
+ year = {2025},
103
+ }
104
+ ```
105
+
106
+
107
+ ### Framework versions
108
+
109
+ - Transformers 4.36.2
110
+ - Pytorch 2.1.2
111
+ - Datasets 2.16.1
112
+ - Tokenizers 0.15.0