c-ho commited on
Commit
5be109b
·
verified ·
1 Parent(s): 29d84c0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -5
README.md CHANGED
@@ -4,6 +4,7 @@ license: mit
4
  base_model: FacebookAI/xlm-roberta-base
5
  tags:
6
  - generated_from_trainer
 
7
  metrics:
8
  - accuracy
9
  - precision
@@ -17,9 +18,9 @@ model-index:
17
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
  should probably proofread and complete it, then remove this comment. -->
19
 
20
- # academic_main_text_classifier_de
21
 
22
- This model is a fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on the None dataset.
23
  It achieves the following results on the evaluation set:
24
  - Loss: 0.2342
25
  - Accuracy: 0.9385
@@ -29,15 +30,60 @@ It achieves the following results on the evaluation set:
29
 
30
  ## Model description
31
 
32
- More information needed
 
 
 
 
 
33
 
34
  ## Intended uses & limitations
35
 
36
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ## Training and evaluation data
39
 
40
- More information needed
 
 
 
 
 
41
 
42
  ## Training procedure
43
 
 
4
  base_model: FacebookAI/xlm-roberta-base
5
  tags:
6
  - generated_from_trainer
7
+ language: de
8
  metrics:
9
  - accuracy
10
  - precision
 
18
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
19
  should probably proofread and complete it, then remove this comment. -->
20
 
21
+ # Academic Main Text_Classifier (de)
22
 
23
+ This model is a fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on a labelled dataset of publications in the Bibliography of Linguistic Literature.
24
  It achieves the following results on the evaluation set:
25
  - Loss: 0.2342
26
  - Accuracy: 0.9385
 
30
 
31
  ## Model description
32
 
33
+ The model is fine-tuned with academic publications in Linguistics, to classify texts in publications into 4 classes as a filter to other tasks. Sentence-based data obtained from OCR-processed PDF files was annotated manually with the following classes:
34
+
35
+ - 0: out of scope - materials that are of low significance, eg. page number and page header, noise from OCR/pdf-to-text convertion
36
+ - 1: main text - texts that are the main texts of the publication, to be used for down-stream tasks
37
+ - 2: examples - texts that are captions of the figures, or quotes or excerpts
38
+ - 3: references - references of the publication, excluding in-text citations
39
 
40
  ## Intended uses & limitations
41
 
42
+ Intended uses:
43
+
44
+ - filter out noise from OCR of academic texts (conference papers, journals, books etc.)
45
+ - extract main text in academic texts for down-stream NLP tasks
46
+
47
+ Limitations:
48
+
49
+ - training and evaluation data is limited to English, and academic texts in Linguistics (though still to a higher extent usable for German texts)
50
+
51
+ ## How to run
52
+
53
+ ```python
54
+ from transformers import pipeline
55
+
56
+ # define model name
57
+ model_name = "ubffm/academic_text_filter_de"
58
+
59
+ # run model with hf pipeline
60
+ ## return output for the best label
61
+ ## eg. [{'label': 'EXAMPLE', 'score': 0.9601941108703613}]
62
+ classifier = pipeline("text-classification", model=model_name, tokenizer=model_name)
63
+
64
+ ## return output for all labels
65
+ ## eg. [[{'label': 'OUT OF SCOPE', 'score': 0.007808608002960682}, {'label': 'MAIN TEXT', 'score': 0.028077520430088043}, {'label': 'EXAMPLE', 'score': 0.9601941108703613}, {'label': 'REFERENCE', 'score': 0.003919811453670263}]]
66
+ classifier = pipeline("text-classification", model=model_name, tokenizer=model_name, return_all_scores=True)
67
+
68
+ # Perform inference on your input text
69
+ your_text = "your text here."
70
+ result = classifier(your_text)
71
+
72
+ print(result)
73
+ ```
74
+
75
+ ## Try it yourself with the following examples (not in training/ evaluation data)
76
+
77
+ ## Problematic cases
78
 
79
  ## Training and evaluation data
80
 
81
+ ### Labelled dataset from open access publications of the Bibliography of Linguistic Literature (BLL)
82
+
83
+ The Bibliography of Linguistic Literature (BLL) is one of the most comprehensive sources of bibliographic information for the general linguistics with its subdomains and neighboring disciplines as well as for the English, German and Romance linguistics. The subject bibliography is based mainly on the library's holdings on linguistics. It lists monographs, dissertations, articles from periodicals, collective works, conference contributions, unpublished research papers, etc. The printed edition is published annually (at the end of each year) and covers the literature of the previous year and some supplements. Usually, it includes about 10,000 references per year. (Frankfurt a. M. : Klostermann, 1.1971/75(1976) - 47.2021 (2022))
84
+
85
+ (See more at https://www.ub.uni-frankfurt.de/linguistik/sammlung_en.html)
86
+
87
 
88
  ## Training procedure
89