| | --- |
| | library_name: transformers |
| | license: mit |
| | datasets: |
| | - chenghao/sec-material-contracts-qa-splitted |
| | - chenghao/sec-material-contracts-qa |
| | - jordyvl/DUDE_subset_100val |
| | language: |
| | - en |
| | pipeline_tag: document-question-answering |
| | --- |
| | |
| | # Idefices2-EDGAR |
| |
|
| | Idefices2 8B fine-tuned on 800+ multi-page documents for Visual DocQA. Make sure you have the latest peft and transformers before loading the model. GPU is required for it to work properly. |
| |
|
| | Compared to the base model, it has a lower edit distance (53% improvement on micro average) on the test set. |
| |
|
| | | | Category | Idefics2-8B | Idefics2-8B-EDGAR | Δ(↑) | |
| | |---:|:----------------------------|--------------:|--------------------:|:-------| |
| | | 0 | agreement_date | 0.878489 | 0.0999479 | 88.62% | |
| | | 1 | agreement_term | 0.907067 | 0.438816 | 51.62% | |
| | | 2 | auto_renewal | 0.634946 | 0.0516129 | 91.87% | |
| | | 3 | contract_value | 0.474438 | 0.418815 | 11.72% | |
| | | 4 | counterparty_address | 0.771387 | 0.59835 | 22.43% | |
| | | 5 | counterparty_name | 0.825491 | 0.633359 | 23.27% | |
| | | 6 | counterparty_signer_name | 0.842091 | 0.480444 | 42.95% | |
| | | 7 | counterparty_signer_title | 0.61746 | 0.496041 | 19.66% | |
| | | 8 | effective_date | 0.903268 | 0.125641 | 86.09% | |
| | | 9 | expiration_date | 0.88673 | 0.235197 | 73.48% | |
| | | 10 | governing_law | 0.881037 | 0.308771 | 64.95% | |
| | | 11 | opt_out_length | 0.431548 | 0.047619 | 88.97% | |
| | | 12 | party_address | 0.730897 | 0.608301 | 16.77% | |
| | | 13 | party_name | 0.726411 | 0.490194 | 32.52% | |
| | | 14 | payment_frequency | 0.686123 | 0.373724 | 45.53% | |
| | | 15 | payment_term | 0.854552 | 0.593333 | 30.57% | |
| | | 16 | renewal_term | 0.92829 | 0.0595238 | 93.59% | |
| | | 17 | termination_for_cause | 0.436 | 0.048 | 88.99% | |
| | | 18 | termination_for_convenience | 0.628261 | 0.156522 | 75.09% | |
| | | 19 | termination_notice_period | 0.329748 | 0.178394 | 45.90% | |
| | | 20 | venue | 0.781417 | 0.61403 | 21.42% | |
| |
|
| |
|
| |
|
| |  |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | Finetuned form [Idefics2](https://huggingface.co/docs/transformers/main/en/model_doc/idefics2). |
| |
|
| | ## Uses |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoProcessor, Idefics2ForConditionalGeneration, BitsAndBytesConfig |
| | from datasets import load_from_disk |
| | |
| | base_model = "HuggingFaceM4/idefics2-8b" |
| | peft_model_id = "chenghao/idefics2-edgar" |
| | quantization_config = BitsAndBytesConfig( |
| | load_in_4bit=True, |
| | bnb_4bit_quant_type="nf4", |
| | bnb_4bit_use_double_quant=True, |
| | bnb_4bit_compute_dtype=torch.float16 |
| | ) |
| | model = Idefics2ForConditionalGeneration.from_pretrained( |
| | peft_model_id, |
| | torch_dtype=torch.float16, |
| | quantization_config=quantization_config, |
| | ) |
| | |
| | model.eval() |
| | processor = AutoProcessor.from_pretrained(base_model, do_image_splitting=True, |
| | size={"longest_edge": 490, "shortest_edge": 350}) |
| | dataset = load_from_disk("local-dataset") |
| | test_example = dataset["test"][30] |
| | images, question, answer = test_example["images"], test_example["question"], test_example["answer"] |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [{"type": "image"} for _ in range(len(images))] + [{"type": "text", "text": question}], |
| | }, |
| | ] |
| | prompt = processor.apply_chat_template(messages, add_generation_prompt=True) |
| | inputs = processor(text=prompt, images=images, return_tensors="pt").to("cuda") |
| | with torch.no_grad(): |
| | generated_ids = model.generate(**inputs, max_new_tokens=1024) |
| | generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True) |
| | preds = [t.split("Assistant:", 1)[-1].strip() for t in generated_texts] |
| | print(f""" |
| | Question: {question} |
| | Answer: {answer} |
| | Prediction: {preds or 'N/A'} |
| | """) |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | [SEC Contract QA](https://huggingface.co/datasets/chenghao/sec-material-contracts-qa) |
| |
|
| | ### Training Procedure |
| |
|
| | 10 epochs with QLoRA. Trained with A100-80GB for about 10 hours. Code: [Github](https://github.com/ChenghaoMou/idefics2-contract-qa). |
| |
|
| | ``` |
| | MAX_LENGTH = 1024 |
| | USE_LORA = False |
| | USE_QLORA = True |
| | MAX_PAGE = 5 |
| | |
| | config = { |
| | "max_epochs": 10, |
| | # "val_check_interval": 0.2, |
| | "check_val_every_n_epoch": 1, |
| | "gradient_clip_val": 1.0, |
| | "accumulate_grad_batches": 12, |
| | "lr": 1e-4, |
| | "batch_size": 2, |
| | "precision": "16-mixed", |
| | "seed": 42, |
| | "warmup_steps": 50, |
| | "result_path": "./result", |
| | "verbose": True, |
| | } |
| | ``` |
| |
|
| | #### Preprocessing [optional] |
| |
|
| | No image splitting due to memory limit. |
| |
|
| | ```python |
| | processor = AutoProcessor.from_pretrained( |
| | "HuggingFaceM4/idefics2-8b", |
| | do_image_splitting=False, |
| | size={"longest_edge": 490, "shortest_edge": 350} |
| | ) |
| | ``` |
| |
|
| | #### Training Hyperparameters |
| |
|
| | ```python |
| | quantization_config = BitsAndBytesConfig( |
| | load_in_4bit=True, |
| | bnb_4bit_quant_type="nf4", |
| | bnb_4bit_use_double_quant=True, |
| | bnb_4bit_compute_dtype=torch.float16 |
| | ) |
| | model = Idefics2ForConditionalGeneration.from_pretrained( |
| | "HuggingFaceM4/idefics2-8b", |
| | torch_dtype=torch.float16, |
| | quantization_config=quantization_config, |
| | ) |
| | ``` |
| |
|
| | #### Speeds, Sizes, Times [optional] |
| |
|
| |
|
| | ## Evaluation |
| |
|
| | ### Testing Data, Factors & Metrics |
| |
|
| | #### Testing Data |
| |
|
| | 20% percent of the dataset. |
| |
|
| | #### Metrics |
| |
|
| | Edit Distance (nltk). |
| |
|
| | ### Results |
| |
|
| | See above. |