Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

.gitattributes +1 -0
README.md +117 -0
config.json +48 -0
model.safetensors +3 -0
special_tokens_map.json +51 -0
tokenizer.json +3 -0
tokenizer_config.json +54 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,117 @@

+---
+library_name: transformers
+license: apache-2.0
+base_model:
+- severinsimmler/xlm-roberta-longformer-base-16384
+---
+# SWEb Markdown Extractor
+This model is developed by the NLU team at [AI Sweden](ai.se) as a **primary content extractor** from web pages, and was used to produce the [SWEb dataset](https://huggingface.co/datasets/AI-Sweden-Models/SWEb).
+For more details, please see [the SWEb paper](https://arxiv.org/abs/2410.04456) and [SWEb source code](https://github.com/aidotse/SWEb/tree/main).
+In our source code, you'll find:
+ - Our training and test extraction data
+ - Annotation tool for annotating additional data
+ - Training and inference scripts
+## Model Details
+### Model Description
+The model can be used to extract the primary content from websites.
+The below example show how it can be used (taken from [here](https://github.com/aidotse/SWEb/tree/main)):
+```python
+import os
+import requests
+from torch.nn.functional import sigmoid
+from pipeline.warc_processing import ConvertToMarkdown
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+# 1. Download a webpage
+resp = requests.get("https://www.ai.se/sv/nyheter/nobelpriset-i-fysik-och-kemi-till-banbrytande-ai-forskning")
+# 2. Convert HTML to markdown using pandoc
+markdown = ConvertToMarkdown.convert_html_to_markdown(resp.content, pandoc_path=f"{os.environ['HOME']}/bin/pandoc")  # path to pandoc 2.9.2.1, see INSTALL.md
+# 3. Extract text by classifying each line using trained model
+tokenizer = AutoTokenizer.from_pretrained("AI-Sweden-Models/SWEb-markdown-extractor")
+model = AutoModelForTokenClassification.from_pretrained("AI-Sweden-Models/SWEb-markdown-extractor").eval()
+tokens = tokenizer(markdown.replace("\n", tokenizer.sep_token), return_tensors="pt", add_special_tokens=False, truncation=True)
+tokens["line_sep_token_ids"] = (tokens.input_ids[0] == tokenizer.sep_token_id).nonzero()[None, :, 0]
+logits = model(**tokens)[0]
+extracted_lines = [
+  line for line, pred in zip(markdown.split("\n"), sigmoid(logits))
+  if pred > 0.05
+]
+# Print extracted text
+print("\n".join(extracted_lines))
+```
+outputs:
+```markdown
+# Nobelpriset i fysik och kemi till banbrytande AI-forskning
+tisdag, oktober 8, 2024
+Två Nobelpris till AI 2024\! Det i fysik går till forskning som lagt grunden till maskininlärning och artificiell intelligens, och det i kemi till Google DeepMinds AlphaFold2
+*– Det är fantastiskt att det här viktiga arbetet får ett sådant erkännande. Särskilt den tillämpade AI som uppmärksammas i Kemipriset*, säger Johanna Bergman, Director of Strategic Initiatives på AI Sweden.
+...
+```
+### Model Sources
+- **Repository:** https://github.com/aidotse/SWEb/tree/main
+- **Paper:** https://arxiv.org/abs/2410.04456
+## Uses
+We propose using model based extractors as they provide more flexibility in shaping the extraction through data rather than rules.
+This model was trained on Scandinavian webpages in particular, so we expect extraction for webpages in these languages to work better than other languages.
+However, annotating using our [tool](https://github.com/aidotse/SWEb/tree/main/annotation_tool) is swift and the model learns from small amounts of data.
+## Bias, Risks, and Limitations
+The text extraction model presented here is designed to extract primary content from webpages, but it is important to acknowledge its inherent biases, risks, and limitations. The following aspects should be considered when using this model and the datasets derived from it.
+ - Incomplete Context and Information: Webpages often contain a mix of primary content, supplementary information, and context in surrounding elements (such as comments, metadata, or links). The text extraction model focuses on extracting the "main" content, which can lead to a loss of nuance or essential context. This limitation may affect the quality and usefulness of the pretraining datasets, especially in scenarios where contextual information is crucial for understanding.
+ - Domain-Specific Limitations: The effectiveness of the text extraction model may vary depending on the domain or structure of the webpages. For example, pages with heavy advertisements, complex layouts, or dynamically generated content might lead to extraction errors or incomplete outputs. These limitations can lead to a dataset that underrepresents content from such domains or introduces noise due to incorrect extraction.
+ - Content Filtering and Ethical Concerns: The extracted text may include offensive, explicit, or otherwise harmful content. Without adequate content filtering, this material could end up in pretraining datasets, affecting the behavior of downstream language models. Users must be aware of the ethical implications and potential harms of training models on unfiltered web data.
+ - Regional and Language Bias: The model is trained is predominantly available in the Scandinavian languages and regions, which can lead to an overrepresentation of these languages in the extracted data.
+These biases, risks, and limitations emphasize the need for careful curation, filtering, and post-processing of the extracted content to mitigate negative impacts on downstream applications. Users of this model should consider integrating diverse sources, employing bias mitigation techniques, and conducting ongoing evaluations to reduce the potential harms associated with large-scale pretraining.
+## Training Details
+### Training Data
+Please find our training and test data [here](https://github.com/aidotse/SWEb/blob/main/annotation_tool/backend/data/data.jsonl)
+### Training Script
+Please find our training script [here](https://github.com/aidotse/SWEb/blob/main/pipeline/line_classification/train.py)
+## Citation
+To cite this work, please use the following:
+```
+@misc{norlund2024sweblargewebdataset,
+      title={SWEb: A Large Web Dataset for the Scandinavian Languages},
+      author={Tobias Norlund and Tim Isbister and Amaru Cuba Gyllensten and Paul Dos Santos and Danila Petrelli and Ariel Ekgren and Magnus Sahlgren},
+      year={2024},
+      eprint={2410.04456},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2410.04456},
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "_name_or_path": "severinsimmler/xlm-roberta-longformer-base-16384",
+  "architectures": [
+    "LongformerModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "attention_window": [
+    256,
+    256,
+    256,
+    256,
+    256,
+    256,
+    256,
+    256,
+    256,
+    256,
+    256,
+    256
+  ],
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 16386,
+  "model_type": "longformer",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "onnx_export": false,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "sep_token_id": 2,
+  "torch_dtype": "float32",
+  "transformers_version": "4.41.1",
+  "type_vocab_size": 1,
+  "vocab_size": 250002
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:45558dbdcd1613d514e4f886624ec717137f1c06f5fb8b6fbe85eec9115e8fc6
+size 1243653612

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3a56def25aa40facc030ea8b0b87f3688e4b3c39eb8b45d5702b3a1300fe2a20
+size 17082734

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,54 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": "<mask>",
+  "model_max_length": 16384,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "unk_token": "<unk>"
+}