tobiasnorlund commited on
Commit
53d94a1
·
verified ·
1 Parent(s): 0ad68e5

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model:
5
+ - severinsimmler/xlm-roberta-longformer-base-16384
6
+ ---
7
+
8
+ # SWEb Markdown Extractor
9
+
10
+ This model is developed by the NLU team at [AI Sweden](ai.se) as a **primary content extractor** from web pages, and was used to produce the [SWEb dataset](https://huggingface.co/datasets/AI-Sweden-Models/SWEb).
11
+ For more details, please see [the SWEb paper](https://arxiv.org/abs/2410.04456) and [SWEb source code](https://github.com/aidotse/SWEb/tree/main).
12
+
13
+ In our source code, you'll find:
14
+
15
+ - Our training and test extraction data
16
+ - Annotation tool for annotating additional data
17
+ - Training and inference scripts
18
+
19
+
20
+ ## Model Details
21
+
22
+ ### Model Description
23
+
24
+ The model can be used to extract the primary content from websites.
25
+ The below example show how it can be used (taken from [here](https://github.com/aidotse/SWEb/tree/main)):
26
+
27
+ ```python
28
+ import os
29
+ import requests
30
+ from torch.nn.functional import sigmoid
31
+ from pipeline.warc_processing import ConvertToMarkdown
32
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
33
+
34
+ # 1. Download a webpage
35
+ resp = requests.get("https://www.ai.se/sv/nyheter/nobelpriset-i-fysik-och-kemi-till-banbrytande-ai-forskning")
36
+
37
+ # 2. Convert HTML to markdown using pandoc
38
+ markdown = ConvertToMarkdown.convert_html_to_markdown(resp.content, pandoc_path=f"{os.environ['HOME']}/bin/pandoc") # path to pandoc 2.9.2.1, see INSTALL.md
39
+
40
+ # 3. Extract text by classifying each line using trained model
41
+ tokenizer = AutoTokenizer.from_pretrained("AI-Sweden-Models/SWEb-markdown-extractor")
42
+ model = AutoModelForTokenClassification.from_pretrained("AI-Sweden-Models/SWEb-markdown-extractor").eval()
43
+ tokens = tokenizer(markdown.replace("\n", tokenizer.sep_token), return_tensors="pt", add_special_tokens=False, truncation=True)
44
+ tokens["line_sep_token_ids"] = (tokens.input_ids[0] == tokenizer.sep_token_id).nonzero()[None, :, 0]
45
+ logits = model(**tokens)[0]
46
+ extracted_lines = [
47
+ line for line, pred in zip(markdown.split("\n"), sigmoid(logits))
48
+ if pred > 0.05
49
+ ]
50
+
51
+ # Print extracted text
52
+ print("\n".join(extracted_lines))
53
+ ```
54
+
55
+ outputs:
56
+
57
+ ```markdown
58
+ # Nobelpriset i fysik och kemi till banbrytande AI-forskning
59
+
60
+ tisdag, oktober 8, 2024
61
+
62
+ Två Nobelpris till AI 2024\! Det i fysik går till forskning som lagt grunden till maskininlärning och artificiell intelligens, och det i kemi till Google DeepMinds AlphaFold2
63
+
64
+ *– Det är fantastiskt att det här viktiga arbetet får ett sådant erkännande. Särskilt den tillämpade AI som uppmärksammas i Kemipriset*, säger Johanna Bergman, Director of Strategic Initiatives på AI Sweden.
65
+
66
+ ...
67
+ ```
68
+
69
+ ### Model Sources
70
+
71
+ - **Repository:** https://github.com/aidotse/SWEb/tree/main
72
+ - **Paper:** https://arxiv.org/abs/2410.04456
73
+
74
+ ## Uses
75
+
76
+ We propose using model based extractors as they provide more flexibility in shaping the extraction through data rather than rules.
77
+ This model was trained on Scandinavian webpages in particular, so we expect extraction for webpages in these languages to work better than other languages.
78
+ However, annotating using our [tool](https://github.com/aidotse/SWEb/tree/main/annotation_tool) is swift and the model learns from small amounts of data.
79
+
80
+ ## Bias, Risks, and Limitations
81
+
82
+ The text extraction model presented here is designed to extract primary content from webpages, but it is important to acknowledge its inherent biases, risks, and limitations. The following aspects should be considered when using this model and the datasets derived from it.
83
+
84
+ - Incomplete Context and Information: Webpages often contain a mix of primary content, supplementary information, and context in surrounding elements (such as comments, metadata, or links). The text extraction model focuses on extracting the "main" content, which can lead to a loss of nuance or essential context. This limitation may affect the quality and usefulness of the pretraining datasets, especially in scenarios where contextual information is crucial for understanding.
85
+ - Domain-Specific Limitations: The effectiveness of the text extraction model may vary depending on the domain or structure of the webpages. For example, pages with heavy advertisements, complex layouts, or dynamically generated content might lead to extraction errors or incomplete outputs. These limitations can lead to a dataset that underrepresents content from such domains or introduces noise due to incorrect extraction.
86
+ - Content Filtering and Ethical Concerns: The extracted text may include offensive, explicit, or otherwise harmful content. Without adequate content filtering, this material could end up in pretraining datasets, affecting the behavior of downstream language models. Users must be aware of the ethical implications and potential harms of training models on unfiltered web data.
87
+ - Regional and Language Bias: The model is trained is predominantly available in the Scandinavian languages and regions, which can lead to an overrepresentation of these languages in the extracted data.
88
+
89
+ These biases, risks, and limitations emphasize the need for careful curation, filtering, and post-processing of the extracted content to mitigate negative impacts on downstream applications. Users of this model should consider integrating diverse sources, employing bias mitigation techniques, and conducting ongoing evaluations to reduce the potential harms associated with large-scale pretraining.
90
+
91
+
92
+ ## Training Details
93
+
94
+ ### Training Data
95
+
96
+ Please find our training and test data [here](https://github.com/aidotse/SWEb/blob/main/annotation_tool/backend/data/data.jsonl)
97
+
98
+ ### Training Script
99
+
100
+ Please find our training script [here](https://github.com/aidotse/SWEb/blob/main/pipeline/line_classification/train.py)
101
+
102
+
103
+ ## Citation
104
+
105
+ To cite this work, please use the following:
106
+
107
+ ```
108
+ @misc{norlund2024sweblargewebdataset,
109
+ title={SWEb: A Large Web Dataset for the Scandinavian Languages},
110
+ author={Tobias Norlund and Tim Isbister and Amaru Cuba Gyllensten and Paul Dos Santos and Danila Petrelli and Ariel Ekgren and Magnus Sahlgren},
111
+ year={2024},
112
+ eprint={2410.04456},
113
+ archivePrefix={arXiv},
114
+ primaryClass={cs.CL},
115
+ url={https://arxiv.org/abs/2410.04456},
116
+ }
117
+ ```
config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "severinsimmler/xlm-roberta-longformer-base-16384",
3
+ "architectures": [
4
+ "LongformerModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "attention_window": [
8
+ 256,
9
+ 256,
10
+ 256,
11
+ 256,
12
+ 256,
13
+ 256,
14
+ 256,
15
+ 256,
16
+ 256,
17
+ 256,
18
+ 256,
19
+ 256
20
+ ],
21
+ "bos_token_id": 0,
22
+ "classifier_dropout": null,
23
+ "eos_token_id": 2,
24
+ "hidden_act": "gelu",
25
+ "hidden_dropout_prob": 0.1,
26
+ "hidden_size": 768,
27
+ "id2label": {
28
+ "0": "LABEL_0"
29
+ },
30
+ "initializer_range": 0.02,
31
+ "intermediate_size": 3072,
32
+ "label2id": {
33
+ "LABEL_0": 0
34
+ },
35
+ "layer_norm_eps": 1e-05,
36
+ "max_position_embeddings": 16386,
37
+ "model_type": "longformer",
38
+ "num_attention_heads": 12,
39
+ "num_hidden_layers": 12,
40
+ "onnx_export": false,
41
+ "pad_token_id": 1,
42
+ "position_embedding_type": "absolute",
43
+ "sep_token_id": 2,
44
+ "torch_dtype": "float32",
45
+ "transformers_version": "4.41.1",
46
+ "type_vocab_size": 1,
47
+ "vocab_size": 250002
48
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:45558dbdcd1613d514e4f886624ec717137f1c06f5fb8b6fbe85eec9115e8fc6
3
+ size 1243653612
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3a56def25aa40facc030ea8b0b87f3688e4b3c39eb8b45d5702b3a1300fe2a20
3
+ size 17082734
tokenizer_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 16384,
50
+ "pad_token": "<pad>",
51
+ "sep_token": "</s>",
52
+ "tokenizer_class": "XLMRobertaTokenizer",
53
+ "unk_token": "<unk>"
54
+ }