c-ho commited on
Commit
1d37b85
·
verified ·
1 Parent(s): 84b6fce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -3
README.md CHANGED
@@ -4,6 +4,7 @@ license: mit
4
  base_model: FacebookAI/xlm-roberta-base
5
  tags:
6
  - generated_from_trainer
 
7
  metrics:
8
  - accuracy
9
  - precision
@@ -12,12 +13,20 @@ metrics:
12
  model-index:
13
  - name: academic_main_text_classifier_en
14
  results: []
 
 
 
 
 
 
 
 
15
  ---
16
 
17
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
  should probably proofread and complete it, then remove this comment. -->
19
 
20
- # academic_main_text_classifier_en
21
 
22
  This model is a fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on the None dataset.
23
  It achieves the following results on the evaluation set:
@@ -29,11 +38,100 @@ It achieves the following results on the evaluation set:
29
 
30
  ## Model description
31
 
32
- More information needed
 
 
 
 
 
33
 
34
  ## Intended uses & limitations
35
 
36
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ## Training and evaluation data
39
 
 
4
  base_model: FacebookAI/xlm-roberta-base
5
  tags:
6
  - generated_from_trainer
7
+ language: en
8
  metrics:
9
  - accuracy
10
  - precision
 
13
  model-index:
14
  - name: academic_main_text_classifier_en
15
  results: []
16
+ widget:
17
+ - text: "In the case of (ioii) and (1 lii), the passive transformation will apply to the embedded sentence, and in all four cases other operations will give the final surface forms of (8) and (g)."
18
+ - text: "(10) (i) Noun Phrase — Verb — Noun Phrase — Sentence (/ — persuaded — a specialist — a specialist will examine John) (ii) Noun Phrase — Verb — Noun Phrase — Sentence (/ — persuaded — John — a specialist will examine John)"
19
+ - text: "184 SOME RESIDUAL PROBLEMS"
20
+ - text: "Peshkovskii, A. M. (1956). Russkii Sintaksis v Nauchnom Osveshchenii. Moscow."
21
+ - text: "S -» NP^Aux^VP"
22
+ - text: "(sincerity, [+N, —Count, +Abstract]) (boy, [+N, —Count, +Common, +Animate, +Human]) (may, [+M])"
23
+
24
  ---
25
 
26
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
27
  should probably proofread and complete it, then remove this comment. -->
28
 
29
+ # Academic Main Text_Classifier (en)
30
 
31
  This model is a fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on the None dataset.
32
  It achieves the following results on the evaluation set:
 
38
 
39
  ## Model description
40
 
41
+ The model is fine-tuned with academic publications in Linguistics, to classify texts in publications into 4 classes as a filter to other tasks. Sentence-based data obtained from OCR-processed PDF files was annotated manually with the following classes:
42
+
43
+ 0: out of scope - materials that are of low significance, eg. page number and page header, noise from OCR/pdf-to-text convertion
44
+ 1: main text - texts that are the main texts of the publication, to be used for down-stream tasks
45
+ 2: examples - texts that are captions of the figures, or quotes or excerpts
46
+ 3: references - references of the publication, excluding in-text citations
47
 
48
  ## Intended uses & limitations
49
 
50
+ Intended uses:
51
+
52
+ - to extract main text in academic texts for down-stream tasks
53
+
54
+ Limitations:
55
+
56
+ - training and evaluation data is limited to English, and academic texts in Linguistics (though still to a higher extent usable for German texts)
57
+
58
+ ## How to run
59
+
60
+ ```python
61
+ from transformers import pipeline
62
+
63
+ # define model name
64
+ model_name = "ubffm/academic_text_filter"
65
+
66
+ # run model with hf pipeline
67
+ ## return output for the best label
68
+ ## eg. [{'label': 'EXAMPLE', 'score': 0.9601941108703613}]
69
+ classifier = pipeline("text-classification", model=model_name, tokenizer=model_name)
70
+
71
+ ## return output for all labels
72
+ ## eg. [[{'label': 'OUT OF SCOPE', 'score': 0.007808608002960682}, {'label': 'MAIN TEXT', 'score': 0.028077520430088043}, {'label': 'EXAMPLE', 'score': 0.9601941108703613}, {'label': 'REFERENCE', 'score': 0.003919811453670263}]]
73
+ classifier = pipeline("text-classification", model=model_name, tokenizer=model_name, return_all_scores=True)
74
+
75
+ # Perform inference on your input text
76
+ your_text = "your text here."
77
+ result = classifier(your_text)
78
+
79
+ print(result)
80
+ ```
81
+
82
+ ## Try it yourself with the following examples (not in training/ evaluation data)
83
+
84
+ Excerpts from Chomsky, N. (2014). Aspects of the Theory of Syntax (No. 11). MIT press.
85
+ retrieved from https://apps.dtic.mil/sti/pdfs/AD0616323.pdf
86
+
87
+ - In the case of (ioii) and (1 lii), the passive transformation will
88
+ apply to the embedded sentence, and in all four cases other
89
+ operations will give the final surface forms of (8) and (g).
90
+
91
+
92
+ - (10) (i) Noun Phrase — Verb — Noun Phrase — Sentence
93
+ (/ — persuaded — a specialist — a specialist will examine
94
+ John)
95
+ (ii) Noun Phrase — Verb — Noun Phrase — Sentence
96
+ (/ — persuaded — John — a specialist will examine John)
97
+
98
+
99
+ - (13) S
100
+ Det
101
+ Predicate-Phrase
102
+ [+Definite] nom VP
103
+ their
104
+ F1...Fm Det N
105
+ destroy [+Definite] G, ... G,
106
+ the property
107
+
108
+ - 184 SOME RESIDUAL PROBLEMS
109
+
110
+ - Peshkovskii, A. M. (1956). Russkii Sintaksis v Nauchnom Osveshchenii.
111
+ Moscow.
112
+
113
+ - S -» NP^Aux^VP
114
+
115
+ - (sincerity, [+N, —Count, +Abstract])
116
+ (boy, [+N, —Count, +Common, +Animate, +Human])
117
+ (may, [+M])
118
+
119
+
120
+ ## Problematic cases
121
+
122
+ Definitions or findings written in point form are challenging for the model. For example:
123
+
124
+ - (2) (i) the string (1) is a Sentence (S); frighten the boy is a Verb
125
+ Phrase (VP) consisting of the Verb (V) frighten and the
126
+ Noun Phrase (NP) the boy; sincerity is also an NP; the
127
+ NP the boy consists of the Determiner (Det) the, followed
128
+ by a Noun (N); the NP sincerity consists of just an N;
129
+ the is, furthermore, an Article (Art); may is a Verbal
130
+ Auxiliary (Aux) and, furthermore, a Modal (M).
131
+
132
+ - (v) specification of a function m such that m(i) is an integer
133
+ associated with the grammar G4 as its value (with, let us
134
+ say, lower value indicated by higher number)
135
 
136
  ## Training and evaluation data
137