| # BERT-Base Quantized Model for Relation Extraction | |
| This repository hosts a quantized version of the BERT-Base Chinese model, fine-tuned for relation extraction tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments. | |
| --- | |
| ## Model Details | |
| - **Model Name:** BERT-Base Chinese | |
| - **Model Architecture:** BERT Base | |
| - **Task:** Relation Extraction/Classification | |
| - **Dataset:** Chinese Entity-Relation Dataset | |
| - **Quantization:** Float16 | |
| - **Fine-tuning Framework:** Hugging Face Transformers | |
| --- | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install transformers torch evaluate | |
| ``` | |
| ### Loading the Quantized Model | |
| ```python | |
| from transformers import BertTokenizerFast, BertForSequenceClassification | |
| import torch | |
| # Load the fine-tuned model and tokenizer | |
| model_path = "final_relation_extraction_model" | |
| tokenizer = BertTokenizerFast.from_pretrained(model_path) | |
| model = BertForSequenceClassification.from_pretrained(model_path) | |
| model.eval() | |
| # Example input with entity markers | |
| text = "η¬εοΌ[SUBJ] ζ¨ζ§ [/SUBJ] εΊηε°οΌ[OBJ] ζι½ [/OBJ]" | |
| # Tokenize input | |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128) | |
| # Inference | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| logits = outputs.logits | |
| predicted_class = torch.argmax(logits, dim=1).item() | |
| # Map prediction to relation label | |
| label_mapping = {0: "εΊηε°", 1: "εΊηζ₯ζ", 2: "ζ°ζ", 3: "θδΈ"} # Customize based on your labels | |
| predicted_relation = label_mapping[predicted_class] | |
| print(f"Predicted Relation: {predicted_relation}") | |
| ``` | |
| --- | |
| ## Performance Metrics | |
| - **Accuracy:** 0.970222 | |
| - **F1 Score:** 0.964973 | |
| - **Training Loss:** 0.130104 | |
| - **Validation Loss:** 0.066986 | |
| --- | |
| ## Fine-Tuning Details | |
| ### Dataset | |
| The model was fine-tuned on a Chinese entity-relation dataset with: | |
| - Entity pairs marked with special tokens `[SUBJ]` and `[OBJ]` | |
| - Text preprocessed to include entity boundaries | |
| - Multiple relation types including biographical information | |
| ### Training Configuration | |
| - **Epochs:** 3 | |
| - **Batch Size:** 16 | |
| - **Learning Rate:** 2e-5 | |
| - **Max Length:** 128 tokens | |
| - **Evaluation Strategy:** epoch | |
| - **Weight Decay:** 0.01 | |
| - **Optimizer:** AdamW | |
| ### Data Processing | |
| The original SPO (Subject-Predicate-Object) format was converted to relation classification: | |
| - Each SPO triple becomes a separate training example | |
| - Entities are marked with special tokens in the text | |
| - Relations are encoded as numerical labels for classification | |
| ### Quantization | |
| Post-training quantization was applied using PyTorch's Float16 precision to reduce the model size and improve inference efficiency. | |
| --- | |
| ## Repository Structure | |
| ``` | |
| . | |
| βββ final_relation_extraction_model/ | |
| β βββ config.json | |
| β βββ pytorch_model.bin # Fine-tuned Model | |
| β βββ tokenizer_config.json | |
| β βββ special_tokens_map.json | |
| β βββ tokenizer.json | |
| β βββ vocab.txt | |
| β βββ added_tokens.json | |
| βββ relationship-extraction.ipynb # Training notebook | |
| βββ README.md # Model documentation | |
| ``` | |
| --- | |
| ## Entity Marking Format | |
| The model expects input text with entities marked using special tokens: | |
| - Subject entities: `[SUBJ] entity_name [/SUBJ]` | |
| - Object entities: `[OBJ] entity_name [/OBJ]` | |
| Example: | |
| ``` | |
| Input: "η¬εοΌ[SUBJ] ζ¨ζ§ [/SUBJ]εεοΌζ¨θζΎζ°ζοΌ [OBJ] εζ [/OBJ]" | |
| Output: "ζ°ζ" (ethnicity relation) | |
| ``` | |
| --- | |
| ## Supported Relations | |
| The model can classify various biographical and factual relations in Chinese text, including: | |
| - εΊηε° (Birthplace) | |
| - εΊηζ₯ζ (Birth Date) | |
| - ζ°ζ (Ethnicity) | |
| - θδΈ (Occupation) | |
| - And many more based on the training dataset | |
| --- | |
| ## Limitations | |
| - The model is specifically trained for Chinese text and may not work well with other languages | |
| - Performance depends on proper entity marking in the input text | |
| - The model may not generalize well to domains outside the fine-tuning dataset | |
| - Quantization may result in minor accuracy degradation compared to full-precision models | |
| --- | |
| ## Training Environment | |
| - **Platform:** Kaggle Notebooks with GPU acceleration | |
| - **GPU:** NVIDIA Tesla T4 | |
| - **Training Time:** Approximately 1 hour 5 minutes | |
| - **Framework:** Hugging Face Transformers with PyTorch backend | |
| --- | |
| ## Contributing | |
| Contributions are welcome! Feel free to open an issue or PR for improvements, fixes, or feature extensions. | |
| --- |