ntphuc149 commited on
Commit
1561b75
·
verified ·
1 Parent(s): 841540e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +140 -3
README.md CHANGED
@@ -1,3 +1,140 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - ntphuc149/ViBidLQA
5
+ language:
6
+ - vi
7
+ metrics:
8
+ - exact_match
9
+ - f1
10
+ base_model:
11
+ - nguyenvulebinh/vi-mrc-large
12
+ pipeline_tag: question-answering
13
+ library_name: transformers
14
+ tags:
15
+ - legal
16
+ - question-answering
17
+ - machine-reading-comprehension
18
+ - vietnamese
19
+ ---
20
+ ---
21
+ license: mit
22
+ datasets:
23
+ - ntphuc149/ViBidLQA
24
+ language:
25
+ - vi
26
+ metrics:
27
+ - exact_match
28
+ - f1
29
+ base_model: nguyenvulebinh/vi-mrc-large
30
+ pipeline_tag: question-answering
31
+ library_name: transformers
32
+ tags:
33
+ - legal
34
+ - question-answering
35
+ - machine-reading-comprehension
36
+ - vietnamese
37
+ - bidding-law
38
+ ---
39
+
40
+ # ViBidLEQA_large: A Vietnamese Bidding Law Extractive Question Answering Model
41
+
42
+ ## Overview
43
+ ViBidLEQA_large is an Extractive Question-Answering (EQA) model specifically developed for the Vietnamese bidding law domain. Built upon the nguyenvulebinh/vi-mrc-large architecture and fine-tuned with a specialized bidding law dataset, this model achieves state-of-the-art performance in extracting precise answers from legal documents for bidding law queries.
44
+
45
+ ## Model Description
46
+
47
+ - **Task**: Extractive Question Answering
48
+ - **Domain**: Vietnamese Bidding Law
49
+ - **Base Model**: nguyenvulebinh/vi-mrc-large
50
+ - **Approach**: Fine-tuning
51
+ - **Language**: Vietnamese
52
+
53
+ ## Dataset
54
+
55
+ The ViBidLQA dataset consists of:
56
+ - **Training set**: 5,300 samples
57
+ - **Test set**: 1,000 samples
58
+ - **Data Creation Process**:
59
+ - Training data was automatically generated using Claude 3.5 Sonnet and validated by two legal experts
60
+ - The test set was manually created and verified by two Vietnamese legal experts
61
+ - All samples focus on Vietnamese bidding law content
62
+
63
+ ## Performance
64
+
65
+ Our model achieves exceptional performance on the test set:
66
+
67
+ | Metric | Score |
68
+ |--------|-------|
69
+ | Exact Match | 88.30 |
70
+ | F1-Score | 94.25 |
71
+
72
+ ## Usage
73
+
74
+ ```python
75
+ from transformers import AutoTokenizer, AutoModelForQuestionAnswering
76
+ import torch
77
+
78
+ # Load model and tokenizer
79
+ tokenizer = AutoTokenizer.from_pretrained("ntphuc149/ViBidLEQA_large")
80
+ model = AutoModelForQuestionAnswering.from_pretrained("ntphuc149/ViBidLEQA_large")
81
+
82
+ # Example usage
83
+ question = "Thế nào là đấu thầu hạn chế?"
84
+ context = "Đấu thầu hạn chế là phương thức lựa chọn nhà thầu trong đó chỉ một số nhà thầu đáp ứng yêu cầu về năng lực và kinh nghiệm được bên mời thầu mời tham gia."
85
+
86
+ # Tokenize input
87
+ inputs = tokenizer(
88
+ question,
89
+ context,
90
+ return_tensors="pt",
91
+ max_length=512,
92
+ truncation=True,
93
+ padding=True
94
+ )
95
+
96
+ # Get model predictions
97
+ with torch.no_grad():
98
+ outputs = model(**inputs)
99
+
100
+ # Get answer span
101
+ answer_start = torch.argmax(outputs.start_logits)
102
+ answer_end = torch.argmax(outputs.end_logits) + 1
103
+
104
+ answer = tokenizer.decode(inputs.input_ids[0][answer_start:answer_end])
105
+ print(f"Question: {question}")
106
+ print(f"Answer: {answer}")
107
+ ```
108
+
109
+ ## Applications
110
+
111
+ This model is advantageous for:
112
+ - Legal document analysis systems
113
+ - Bidding law information retrieval systems
114
+ - Legal advisory chatbots
115
+ - Automated question-answering systems for bidding law
116
+ - Legal research and documentation tools
117
+
118
+ ## Limitations
119
+
120
+ - Domain Specificity: The model is specifically trained for Vietnamese bidding law and may not generalize well to other legal domains
121
+ - Language Constraint: Optimized for Vietnamese language only
122
+ - Context Length: Maximum input length is 512 tokens
123
+ - Legal Disclaimer: Should be used as a reference tool, not as a replacement for professional legal advice
124
+
125
+ ## Citation
126
+
127
+ ```bibtex
128
+ comming soon...
129
+ ```
130
+
131
+ ## Contact
132
+
133
+ For questions, feedback, or collaborations:
134
+ - Email: nguyentruongphuc_12421TN@utehy.edu.vn
135
+ - GitHub Issues: [@ntphuc149](https://github.com/ntphuc149)
136
+ - HuggingFace: [@ntphuc149](https://huggingface.co/ntphuc149)
137
+
138
+ ## License
139
+
140
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.