Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,133 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Model Card for BioGPT-FineTuned-MedicalTextbooks-FP16
|
| 2 |
+
|
| 3 |
+
# Model Overview
|
| 4 |
+
This model is a fine-tuned and quantized version of the microsoft/biogpt model, specifically tailored for medical text understanding. It was fine-tuned on the dmedhi/medical-textbooks dataset from Hugging Face and subsequently quantized to FP16 (half-precision) to reduce memory usage and improve inference speed while maintaining accuracy. The model is designed for tasks like keyword extraction from medical texts and generative tasks in the biomedical domain.
|
| 5 |
+
|
| 6 |
+
# Model Details
|
| 7 |
+
```
|
| 8 |
+
Base Model: microsoft/biogpt
|
| 9 |
+
Fine-Tuning Dataset: dmedhi/medical-textbooks (15,970 rows)
|
| 10 |
+
Quantization: FP16 (half-precision) using PyTorch's .half() method
|
| 11 |
+
Model Type: Causal Language Model
|
| 12 |
+
Language: English
|
| 13 |
+
```
|
| 14 |
+
# Intended Use
|
| 15 |
+
This model is intended for:
|
| 16 |
+
|
| 17 |
+
- Keyword Extraction: Extracting relevant lines containing specific keywords (e.g., "anatomy") from medical textbooks, along with metadata like book names.
|
| 18 |
+
- Generative Tasks: Generating short explanations or summaries in the biomedical domain (e.g., answering questions like "What is anatomy?").
|
| 19 |
+
- Research and Education: Assisting researchers, students, and educators in exploring medical texts and generating insights.
|
| 20 |
+
# Out of Scope
|
| 21 |
+
- Real-time clinical decision-making or medical diagnosis (not evaluated for such tasks).
|
| 22 |
+
- Non-English text processing (not tested on other languages).
|
| 23 |
+
- Tasks requiring high precision in generative output without human oversight.
|
| 24 |
+
# Training Details
|
| 25 |
+
# Dataset
|
| 26 |
+
The model was fine-tuned on the dmedhi/medical-textbooks dataset, which contains excerpts from medical textbooks with two attributes:
|
| 27 |
+
|
| 28 |
+
**text:** The content of the excerpt.
|
| 29 |
+
**book:** The name of the book (e.g., "Gray's Anatomy").
|
| 30 |
+
# Dataset Splits:
|
| 31 |
+
- Original split: train (15,970 rows).
|
| 32 |
+
- Custom splits: 80% train (12,776 rows), 20% validation (3,194 rows).
|
| 33 |
+
# Training Procedure
|
| 34 |
+
# Preprocessing:
|
| 35 |
+
|
| 36 |
+
- Tokenized the text field using the BioGPT tokenizer (microsoft/biogpt).
|
| 37 |
+
- Set max_length=512, with truncation and padding.
|
| 38 |
+
- Used input_ids as labels for causal language modeling.
|
| 39 |
+
# Fine-Tuning:
|
| 40 |
+
- Fine-tuned microsoft/biogpt using Hugging Face's Trainer API.
|
| 41 |
+
```
|
| 42 |
+
Training arguments:
|
| 43 |
+
Epochs: 1
|
| 44 |
+
Batch size: 4 per device
|
| 45 |
+
Learning rate: 2e-5
|
| 46 |
+
Mixed precision: FP16 (fp16=True)
|
| 47 |
+
Evaluation strategy: Steps (every 1000 steps)
|
| 48 |
+
Training loss decreased from 2.8409 to 2.7006 over 3,194 steps.
|
| 49 |
+
Validation loss decreased from 2.7317 to 2.6512.
|
| 50 |
+
```
|
| 51 |
+
# Quantization:
|
| 52 |
+
- Converted the fine-tuned model to FP16 using PyTorch's .half() method.
|
| 53 |
+
- Saved as ./biogpt_finetuned/final_model_fp16.
|
| 54 |
+
- Compute Infrastructure
|
| 55 |
+
- Hardware: 12 GB GPU (NVIDIA)
|
| 56 |
+
- Environment: Jupyter Notebook on Windows
|
| 57 |
+
- Framework: PyTorch, Hugging Face Transformers
|
| 58 |
+
- Training Time: Approximately 27 minutes for 1 epoch
|
| 59 |
+
# Evaluation
|
| 60 |
+
**Metrics**
|
| 61 |
+
```
|
| 62 |
+
Training Loss: Decreased from 2.8409 to 2.7006.
|
| 63 |
+
Validation Loss: Decreased from 2.7317 to 2.6512.
|
| 64 |
+
Memory Usage: Post-quantization memory usage reported as ~661 MB (FP16), though actual savings may vary due to buffers and non-weight tensors.
|
| 65 |
+
```
|
| 66 |
+
# Qualitative Testing
|
| 67 |
+
**Generative Task:** Generated a response to "What is anatomy?" with reasonable output: "What is anatomy? Anatomy is the basis of medicine..."
|
| 68 |
+
**Keyword Extraction:** Successfully extracted up to 10 lines containing keywords (e.g., "anatomy") with corresponding book names (e.g., "Gray's Anatomy").
|
| 69 |
+
|
| 70 |
+
# Usage
|
| 71 |
+
**Installation**
|
| 72 |
+
- Ensure you have the required libraries installed:
|
| 73 |
+
|
| 74 |
+
```
|
| 75 |
+
pip install transformers torch datasets sacremoses
|
| 76 |
+
```
|
| 77 |
+
# Loading the Model
|
| 78 |
+
- Load the quantized FP16 model and tokenizer:
|
| 79 |
+
```
|
| 80 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 81 |
+
import torch
|
| 82 |
+
|
| 83 |
+
model_path = "path/to/biogpt_finetuned/final_model_fp16" # Update with your HF repo path
|
| 84 |
+
model = AutoModelForCausalLM.from_pretrained(model_path)
|
| 85 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 86 |
+
|
| 87 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 88 |
+
model.to(device)
|
| 89 |
+
model.eval()
|
| 90 |
+
```
|
| 91 |
+
# Example 1: Generative Inference
|
| 92 |
+
# Generate text with the quantized model:
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
input_text = "What is anatomy?"
|
| 96 |
+
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
|
| 97 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
| 98 |
+
with torch.no_grad():
|
| 99 |
+
outputs = model.generate(**inputs, max_length=50)
|
| 100 |
+
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 101 |
+
print(output_text)
|
| 102 |
+
```
|
| 103 |
+
# Example 2: Keyword Extraction
|
| 104 |
+
```
|
| 105 |
+
from datasets import load_from_disk
|
| 106 |
+
|
| 107 |
+
original_datasets = load_from_disk('path/to/original_medical_textbooks')
|
| 108 |
+
|
| 109 |
+
def extract_lines_with_keyword(keyword, dataset_split='train', max_results=10):
|
| 110 |
+
dataset = original_datasets[dataset_split]
|
| 111 |
+
matching_lines = []
|
| 112 |
+
for entry in dataset:
|
| 113 |
+
text = entry['text']
|
| 114 |
+
book = entry['book']
|
| 115 |
+
lines = text.split('\n')
|
| 116 |
+
for line in lines:
|
| 117 |
+
if keyword.lower() in line.lower():
|
| 118 |
+
matching_lines.append({'text': line.strip(), 'book': book})
|
| 119 |
+
if len(matching_lines) >= max_results:
|
| 120 |
+
return matching_lines
|
| 121 |
+
return matching_lines
|
| 122 |
+
|
| 123 |
+
keyword = "anatomy"
|
| 124 |
+
matching_lines = extract_lines_with_keyword(keyword)
|
| 125 |
+
for i, match in enumerate(matching_lines, 1):
|
| 126 |
+
print(f"{i}. Text: {match['text']}")
|
| 127 |
+
print(f" Book: {match['book']}\n")
|
| 128 |
+
```
|
| 129 |
+
# Limitations
|
| 130 |
+
- Quantization Trade-offs: FP16 quantization may lead to minor accuracy degradation, though not extensively evaluated.
|
| 131 |
+
- Dataset Bias: Fine-tuned only on dmedhi/medical-textbooks, which may not cover all medical domains or topics.
|
| 132 |
+
- Generative Quality: Generative outputs may require human oversight for correctness.
|
| 133 |
+
- Scalability: Keyword extraction relies on string matching, not semantic understanding, limiting its ability to capture nuanced relationships.
|