AventIQ-AI
/

BioGPT-MedText

Safetensors

biogpt

Model card Files Files and versions

xet

Community

YashikaNagpal commited on Mar 6, 2025

Commit

3c4965f

verified ·

1 Parent(s): 1bdb91b

Create README.md

Browse files

Files changed (1) hide show

README.md +133 -0

README.md ADDED Viewed

	@@ -0,0 +1,133 @@

+# Model Card for BioGPT-FineTuned-MedicalTextbooks-FP16
+# Model Overview
+This model is a fine-tuned and quantized version of the microsoft/biogpt model, specifically tailored for medical text understanding. It was fine-tuned on the dmedhi/medical-textbooks dataset from Hugging Face and subsequently quantized to FP16 (half-precision) to reduce memory usage and improve inference speed while maintaining accuracy. The model is designed for tasks like keyword extraction from medical texts and generative tasks in the biomedical domain.
+# Model Details
+```
+Base Model: microsoft/biogpt
+Fine-Tuning Dataset: dmedhi/medical-textbooks (15,970 rows)
+Quantization: FP16 (half-precision) using PyTorch's .half() method
+Model Type: Causal Language Model
+Language: English
+```
+# Intended Use
+This model is intended for:
+- Keyword Extraction: Extracting relevant lines containing specific keywords (e.g., "anatomy") from medical textbooks, along with metadata like book names.
+- Generative Tasks: Generating short explanations or summaries in the biomedical domain (e.g., answering questions like "What is anatomy?").
+- Research and Education: Assisting researchers, students, and educators in exploring medical texts and generating insights.
+# Out of Scope
+- Real-time clinical decision-making or medical diagnosis (not evaluated for such tasks).
+- Non-English text processing (not tested on other languages).
+- Tasks requiring high precision in generative output without human oversight.
+# Training Details
+# Dataset
+The model was fine-tuned on the dmedhi/medical-textbooks dataset, which contains excerpts from medical textbooks with two attributes:
+**text:** The content of the excerpt.
+**book:** The name of the book (e.g., "Gray's Anatomy").
+# Dataset Splits:
+- Original split: train (15,970 rows).
+- Custom splits: 80% train (12,776 rows), 20% validation (3,194 rows).
+# Training Procedure
+# Preprocessing:
+- Tokenized the text field using the BioGPT tokenizer (microsoft/biogpt).
+- Set max_length=512, with truncation and padding.
+- Used input_ids as labels for causal language modeling.
+# Fine-Tuning:
+- Fine-tuned microsoft/biogpt using Hugging Face's Trainer API.
+```
+Training arguments:
+Epochs: 1
+Batch size: 4 per device
+Learning rate: 2e-5
+Mixed precision: FP16 (fp16=True)
+Evaluation strategy: Steps (every 1000 steps)
+Training loss decreased from 2.8409 to 2.7006 over 3,194 steps.
+Validation loss decreased from 2.7317 to 2.6512.
+```
+# Quantization:
+- Converted the fine-tuned model to FP16 using PyTorch's .half() method.
+- Saved as ./biogpt_finetuned/final_model_fp16.
+- Compute Infrastructure
+- Hardware: 12 GB GPU (NVIDIA)
+- Environment: Jupyter Notebook on Windows
+- Framework: PyTorch, Hugging Face Transformers
+- Training Time: Approximately 27 minutes for 1 epoch
+# Evaluation
+**Metrics**
+```
+Training Loss: Decreased from 2.8409 to 2.7006.
+Validation Loss: Decreased from 2.7317 to 2.6512.
+Memory Usage: Post-quantization memory usage reported as ~661 MB (FP16), though actual savings may vary due to buffers and non-weight tensors.
+```
+# Qualitative Testing
+**Generative Task:** Generated a response to "What is anatomy?" with reasonable output: "What is anatomy? Anatomy is the basis of medicine..."
+**Keyword Extraction:** Successfully extracted up to 10 lines containing keywords (e.g., "anatomy") with corresponding book names (e.g., "Gray's Anatomy").
+# Usage
+**Installation**
+- Ensure you have the required libraries installed:
+```
+pip install transformers torch datasets sacremoses
+```
+# Loading the Model
+- Load the quantized FP16 model and tokenizer:
+```
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model_path = "path/to/biogpt_finetuned/final_model_fp16"  # Update with your HF repo path
+model = AutoModelForCausalLM.from_pretrained(model_path)
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+model.eval()
+```
+# Example 1: Generative Inference
+# Generate text with the quantized model:
+```
+input_text = "What is anatomy?"
+inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
+inputs = {k: v.to(device) for k, v in inputs.items()}
+with torch.no_grad():
+    outputs = model.generate(**inputs, max_length=50)
+output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(output_text)
+```
+# Example 2: Keyword Extraction
+```
+from datasets import load_from_disk
+original_datasets = load_from_disk('path/to/original_medical_textbooks')
+def extract_lines_with_keyword(keyword, dataset_split='train', max_results=10):
+    dataset = original_datasets[dataset_split]
+    matching_lines = []
+    for entry in dataset:
+        text = entry['text']
+        book = entry['book']
+        lines = text.split('\n')
+        for line in lines:
+            if keyword.lower() in line.lower():
+                matching_lines.append({'text': line.strip(), 'book': book})
+                if len(matching_lines) >= max_results:
+                    return matching_lines
+    return matching_lines
+keyword = "anatomy"
+matching_lines = extract_lines_with_keyword(keyword)
+for i, match in enumerate(matching_lines, 1):
+    print(f"{i}. Text: {match['text']}")
+    print(f"   Book: {match['book']}\n")
+```
+# Limitations
+- Quantization Trade-offs: FP16 quantization may lead to minor accuracy degradation, though not extensively evaluated.
+- Dataset Bias: Fine-tuned only on dmedhi/medical-textbooks, which may not cover all medical domains or topics.
+- Generative Quality: Generative outputs may require human oversight for correctness.
+- Scalability: Keyword extraction relies on string matching, not semantic understanding, limiting its ability to capture nuanced relationships.