Add layoutlm-camembertv2-qa

Browse files

Files changed (7) hide show

README +60 -0
config.json +21 -0
model.safetensors +3 -0
pytorch_model.bin +3 -0
special_tokens_map.json +16 -0
tokenizer.json +0 -0
tokenizer_config.json +64 -0

README ADDED Viewed

	@@ -0,0 +1,60 @@

+# MODEL_NAME
+This repository contains **layoutlm-camembertv2-qa** weights exported to `safetensors` format.
+## Source
+These weights are derived from pretrained models:
+- **Layout encoder (LayoutLM)**: [`microsoft/layoutlm-base-uncased`](https://huggingface.co/microsoft/layoutlm-base-uncased) — pretrained on IIT-CDIP + masked visual-language modeling (LayoutLM paper)
+- **Text encoder**: [`almanach/camembertv2-base`](https://huggingface.co/almanach/camembertv2-base) — French language model (RoBERTa-like architecture)
+## Methodology
+This checkpoint was produced by **weight merging**, not end-to-end training.
+1. Load the pretrained layout encoder weights (LiLT or LayoutLM) — kept intact
+2. Replace the text encoder weights (embeddings, attention layers, FFN) with those from the French model
+3. Update the tokenizer and vocabulary configuration accordingly
+No training or fine-tuning was performed at this stage.
+This checkpoint is intended as a **starting point** for downstream fine-tuning on French document understanding tasks (NER, token classification, extractive QA…).
+## Files
+| File | Description |
+|------|-------------|
+| `model.safetensors` | Model weights |
+| `pytorch_model.bin` | Model weights (PyTorch format) |
+| `config.json` | Model configuration |
+| `tokenizer_config.json` | Tokenizer configuration |
+| `README.md` | This model card |
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("USERNAME/MODEL_NAME")
+model     = AutoModel.from_pretrained("USERNAME/MODEL_NAME")
+```
+## Limitations
+- This model has **not been fine-tuned** on any French document dataset
+- Performance on downstream tasks is **not guaranteed** without task-specific fine-tuning
+- Intended for research and experimentation purposes
+## License
+Weights are derived from models released under the MIT and Apache-2.0 licenses.
+Please refer to the original repositories for full license terms.
+## Acknowledgements
+- [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) — Xu et al., 2020
+- [`microsoft/layoutlm-base-uncased`](https://huggingface.co/microsoft/layoutlm-base-uncased)
+> **Note**: This is not an official release from any of the above organizations.

config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+  "architectures": [
+    "LayoutLMForQuestionAnswering"
+  ],
+  "model_type": "layoutlm",
+  "hidden_size": 768,
+  "num_hidden_layers": 12,
+  "num_attention_heads": 12,
+  "intermediate_size": 3072,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "attention_probs_dropout_prob": 0.1,
+  "max_position_embeddings": 1025,
+  "max_2d_position_embeddings": 1024,
+  "type_vocab_size": 1,
+  "vocab_size": 32768,
+  "pad_token_id": 0,
+  "layer_norm_eps": 1e-12,
+  "initializer_range": 0.02,
+  "num_labels": 2
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6ee3fb60eec62b6a9962c0737e20f45ba42e1f0e956a148483945c07ec0f9e45
+size 457440704

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:886011a878e2cd1ec388342141aec26820337dc2e7d7595d5d1c72bca987d770
+size 457484503

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "additional_special_tokens": [
+    "[PAD]",
+    "[CLS]",
+    "[SEP]",
+    "[UNK]",
+    "[MASK]"
+  ],
+  "bos_token": "[CLS]",
+  "cls_token": "[CLS]",
+  "eos_token": "[SEP]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "add_prefix_space": true,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "[PAD]",
+    "[CLS]",
+    "[SEP]",
+    "[UNK]",
+    "[MASK]"
+  ],
+  "bos_token": "[CLS]",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "eos_token": "[SEP]",
+  "errors": "replace",
+  "mask_token": "[MASK]",
+  "model_max_length": 1024,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "trim_offsets": true,
+  "unk_token": "[UNK]"
+}