layoutlmv3-xfund / README.md

Update README.md

df6e6de verified 7 months ago

6.13 kB

	---
	library_name: transformers
	license: mit
	base_model: microsoft/layoutlmv3-base
	tags:
	- generated_from_trainer
	metrics:
	- precision
	- recall
	- f1
	- accuracy
	model-index:
	- name: layoutlmv3-xfund
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# layoutlmv3-xfund

	This model is a fine-tuned version of [microsoft/layoutlmv3-base](https://huggingface.co/microsoft/layoutlmv3-base) on an unknown dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.6625
	- Precision: 0.7711
	- Recall: 0.8476
	- F1: 0.8075
	- Accuracy: 0.8030

	## Model description

	More information needed

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	More information needed

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 3e-05
	- train_batch_size: 2
	- eval_batch_size: 2
	- seed: 42
	- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- num_epochs: 5
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:---------:\|:------:\|:------:\|:--------:\|
	\| 0.7142 \| 1.0 \| 522 \| 0.7296 \| 0.6225 \| 0.7066 \| 0.6619 \| 0.7212 \|
	\| 0.5881 \| 2.0 \| 1044 \| 0.6032 \| 0.6841 \| 0.8100 \| 0.7417 \| 0.7688 \|
	\| 0.4179 \| 3.0 \| 1566 \| 0.5904 \| 0.7204 \| 0.8222 \| 0.7679 \| 0.7858 \|
	\| 0.3507 \| 4.0 \| 2088 \| 0.6088 \| 0.7600 \| 0.8458 \| 0.8006 \| 0.7979 \|
	\| 0.2618 \| 5.0 \| 2610 \| 0.6625 \| 0.7711 \| 0.8476 \| 0.8075 \| 0.8030 \|



	### Inference

	```bash
	# Install the Python wrapper
	!pip install pytesseract pillow

	# Install the Tesseract engine on a Debian/Ubuntu-based system (like Colab)
	!sudo apt install tesseract-ocr
	```

	```python
	import torch
	from transformers import AutoProcessor, AutoModelForTokenClassification
	from PIL import Image, ImageDraw, ImageFont
	import pytesseract
	import numpy as np
	import os # For setting environment variable

	# --- CRITICAL FOR DEBUGGING: Set this at the very top ---
	os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

	# --- ADD THE NORMALIZATION FUNCTION ---
	def normalize_bbox(bbox, width, height):
	return [
	int(1000 * min(max(bbox[0] / width, 0), 1)),
	int(1000 * min(max(bbox[1] / height, 0), 1)),
	int(1000 * min(max(bbox[2] / width, 0), 1)),
	int(1000 * min(max(bbox[3] / height, 0), 1))
	]
	```

	```python
	# --- 1. Load your Fine-Tuned Model and Processor ---
	MODEL_ID = "nnul/layoutlmv3-xfund"

	print("Loading processor...")
	processor = AutoProcessor.from_pretrained(MODEL_ID)
	print("Loading model...")
	model = AutoModelForTokenClassification.from_pretrained(MODEL_ID)

	print("Moving model to device...")
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.to(device)
	print("Model moved successfully.")
	```

	```python
	# --- 2. Load the Image ---
	image_path = "your_image.png"
	image = Image.open(image_path).convert("RGB")
	width, height = image.size
	```

	```python
	# --- 3. Perform OCR and NORMALIZE Bounding Boxes ---
	print("Performing OCR...")
	data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
	words = []
	unnormalized_boxes = []
	normalized_boxes = []

	for i in range(len(data['text'])):
	if int(data['conf'][i]) > 30 and data['text'][i].strip() != '':
	word = data['text'][i]
	x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]

	actual_box = [x, y, x + w, y + h]
	unnormalized_boxes.append(actual_box)

	normalized_box = normalize_bbox(actual_box, width, height)
	normalized_boxes.append(normalized_box)

	words.append(word)

	print(f"OCR found {len(words)} words.")
	```

	```python
	# --- 4. Manually Preprocess and Predict ---
	print("Preprocessing inputs...")
	encoding = processor(
	image,
	words,
	boxes=normalized_boxes,
	return_tensors="pt",
	truncation=True
	)

	print("Moving inputs to device...")
	for k, v in encoding.items():
	encoding[k] = v.to(device)

	print("Running inference...")
	with torch.no_grad():
	outputs = model(**encoding)

	logits = outputs.logits
	predictions_indices = logits.argmax(-1).squeeze().tolist()

	word_ids = encoding.word_ids()
	previous_word_id = None
	word_predictions = []
	for idx, word_id in enumerate(word_ids):
	if word_id is not None and word_id != previous_word_id:
	label_id = predictions_indices[idx]
	word_predictions.append(model.config.id2label[label_id])
	previous_word_id = word_id
	```

	```python
	def visualize_predictions(image, words, boxes, predictions):
	label2color = {
	"B-QUESTION": "blue", "I-QUESTION": "blue",
	"B-ANSWER": "green", "I-ANSWER": "green",
	"B-HEADER": "orange", "I-HEADER": "orange",
	"O": "gray"
	}
	draw_image = image.copy()
	draw = ImageDraw.Draw(draw_image)
	try:
	font = ImageFont.truetype("arial.ttf", 12)
	except IOError:
	font = ImageFont.load_default()
	for word, box, label in zip(words, boxes, predictions):
	color = label2color.get(label, 'red')
	draw.rectangle(box, outline=color, width=2)
	entity_type = label.split('-')[1] if '-' in label else 'OTHER'
	if entity_type != 'OTHER':
	draw.text((box[0], box[1] - 10), entity_type, fill=color, font=font)
	return draw_image
	```

	```python
	print("Visualizing results...")
	visualized_image = visualize_predictions(image, words, unnormalized_boxes, word_predictions)
	display(visualized_image)
	visualized_image.save("result_visualization_manual.png")
	print("Saved visualization to result_visualization_manual.png")
	```




	### Framework versions

	- Transformers 4.52.4
	- Pytorch 2.6.0+cu124
	- Datasets 3.6.0
	- Tokenizers 0.21.1