Upload README.md

9511daa verified 10 months ago

4.71 kB

	# BERT-Base Quantized Model for Relation Extraction

	This repository hosts a quantized version of the BERT-Base Chinese model, fine-tuned for relation extraction tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.

	---

	## Model Details

	- Model Name: BERT-Base Chinese
	- Model Architecture: BERT Base
	- Task: Relation Extraction/Classification
	- Dataset: Chinese Entity-Relation Dataset
	- Quantization: Float16
	- Fine-tuning Framework: Hugging Face Transformers

	---

	## Usage

	### Installation

	```bash
	pip install transformers torch evaluate
	```

	### Loading the Quantized Model

	```python
	from transformers import BertTokenizerFast, BertForSequenceClassification
	import torch

	# Load the fine-tuned model and tokenizer
	model_path = "final_relation_extraction_model"
	tokenizer = BertTokenizerFast.from_pretrained(model_path)
	model = BertForSequenceClassification.from_pretrained(model_path)
	model.eval()

	# Example input with entity markers
	text = "笔名：[SUBJ] 木斧 [/SUBJ] 出生地：[OBJ] 成都 [/OBJ]"

	# Tokenize input
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)

	# Inference
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	predicted_class = torch.argmax(logits, dim=1).item()

	# Map prediction to relation label
	label_mapping = {0: "出生地", 1: "出生日期", 2: "民族", 3: "职业"} # Customize based on your labels
	predicted_relation = label_mapping[predicted_class]
	print(f"Predicted Relation: {predicted_relation}")
	```

	---

	## Performance Metrics

	- Accuracy: 0.970222
	- F1 Score: 0.964973
	- Training Loss: 0.130104
	- Validation Loss: 0.066986

	---

	## Fine-Tuning Details

	### Dataset

	The model was fine-tuned on a Chinese entity-relation dataset with:
	- Entity pairs marked with special tokens `[SUBJ]` and `[OBJ]`
	- Text preprocessed to include entity boundaries
	- Multiple relation types including biographical information

	### Training Configuration

	- Epochs: 3
	- Batch Size: 16
	- Learning Rate: 2e-5
	- Max Length: 128 tokens
	- Evaluation Strategy: epoch
	- Weight Decay: 0.01
	- Optimizer: AdamW

	### Data Processing

	The original SPO (Subject-Predicate-Object) format was converted to relation classification:
	- Each SPO triple becomes a separate training example
	- Entities are marked with special tokens in the text
	- Relations are encoded as numerical labels for classification

	### Quantization

	Post-training quantization was applied using PyTorch's Float16 precision to reduce the model size and improve inference efficiency.

	---

	## Repository Structure

	```
	.
	├── final_relation_extraction_model/
	│ ├── config.json
	│ ├── pytorch_model.bin # Fine-tuned Model
	│ ├── tokenizer_config.json
	│ ├── special_tokens_map.json
	│ ├── tokenizer.json
	│ ├── vocab.txt
	│ └── added_tokens.json
	├── relationship-extraction.ipynb # Training notebook
	└── README.md # Model documentation
	```

	---

	## Entity Marking Format

	The model expects input text with entities marked using special tokens:
	- Subject entities: `[SUBJ] entity_name [/SUBJ]`
	- Object entities: `[OBJ] entity_name [/OBJ]`

	Example:
	```
	Input: "笔名：[SUBJ] 木斧 [/SUBJ]原名：杨莆曾民族： [OBJ] 回族 [/OBJ]"
	Output: "民族" (ethnicity relation)
	```

	---

	## Supported Relations

	The model can classify various biographical and factual relations in Chinese text, including:
	- 出生地 (Birthplace)
	- 出生日期 (Birth Date)
	- 民族 (Ethnicity)
	- 职业 (Occupation)
	- And many more based on the training dataset

	---

	## Limitations

	- The model is specifically trained for Chinese text and may not work well with other languages
	- Performance depends on proper entity marking in the input text
	- The model may not generalize well to domains outside the fine-tuning dataset
	- Quantization may result in minor accuracy degradation compared to full-precision models

	---

	## Training Environment

	- Platform: Kaggle Notebooks with GPU acceleration
	- GPU: NVIDIA Tesla T4
	- Training Time: Approximately 1 hour 5 minutes
	- Framework: Hugging Face Transformers with PyTorch backend

	---

	## Contributing

	Contributions are welcome! Feel free to open an issue or PR for improvements, fixes, or feature extensions.

	---