Spaces:
Sleeping
Sleeping
| """ | |
| Breast Cancer Histopathology Classification using Path Foundation Model | |
| This module implements a comprehensive deep learning pipeline for breast cancer classification | |
| from histopathology images using Google's Path Foundation model as a feature extractor. The | |
| system supports multiple datasets including BreakHis, PatchCamelyon (PCam), and BACH, employing | |
| transfer learning to achieve high classification accuracy. | |
| **Overview:** | |
| This system leverages Google's Path Foundation model, which is pre-trained on a large corpus | |
| of pathology images, to extract meaningful features from breast cancer histopathology images. | |
| The approach uses transfer learning where the foundation model serves as a frozen feature | |
| extractor, followed by a trainable classification head for binary classification (benign vs malignant). | |
| **Model Architecture:** | |
| - Foundation Model: Google's Path Foundation (pre-trained on pathology images) | |
| - Transfer Learning Approach: Feature extraction with frozen foundation model + trainable classifier head | |
| - Classification Head: Multi-layer dense network with regularisation and dropout | |
| - Optimisation: AdamW optimiser with learning rate scheduling and early stopping | |
| **Workflow:** | |
| 1. Authentication & Model Loading: Authenticate with Hugging Face and load Path Foundation | |
| 2. Data Loading: Load and preprocess histopathology datasets | |
| 3. Feature Extraction: Extract embeddings using frozen foundation model | |
| 4. Classifier Training: Train dense neural network on extracted features | |
| 5. Evaluation: Comprehensive performance analysis with multiple metrics and visualisations | |
| **Supported Datasets:** | |
| - BreakHis: Breast cancer histopathology images at multiple magnifications | |
| - PatchCamelyon (PCam): Lymph node metastasis detection patches | |
| - BACH: ICIAR 2018 Breast Cancer Histology Challenge dataset | |
| - Combined: Ensemble of all three datasets for robust training | |
| **Key Features:** | |
| - Multiple dataset support with consistent pre-processing | |
| - Robust error handling and fallback mechanisms | |
| - Comprehensive evaluation metrics and visualisation | |
| - Memory-efficient batch processing | |
| - Data augmentation capabilities | |
| - Model persistence and deployment support | |
| Author: Research Team | |
| Date: 2024 | |
| License: MIT | |
| """ | |
| # Import required libraries and configure environment | |
| import os | |
| import tensorflow as tf | |
| import numpy as np | |
| from PIL import Image | |
| from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score | |
| from pathlib import Path | |
| import h5py | |
| from sklearn.model_selection import train_test_split | |
| from sklearn.utils.class_weight import compute_class_weight | |
| from tensorflow.keras import regularizers | |
| import matplotlib | |
| # Use a non-interactive backend to prevent blocking on plt.show() | |
| matplotlib.use('Agg') | |
| import matplotlib.pyplot as plt | |
| import seaborn as sns | |
| # Suppress TensorFlow logging for cleaner output | |
| os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' | |
| # Configure TensorFlow logging for cleaner output | |
| try: | |
| tf.get_logger().setLevel('ERROR') | |
| except AttributeError: | |
| import logging | |
| logging.getLogger('tensorflow').setLevel(logging.ERROR) | |
| # Configure Hugging Face Hub integration with fallback mechanisms | |
| # This section handles the loading of Google's Path Foundation model from Hugging Face Hub | |
| # with multiple fallback methods to ensure compatibility across different environments | |
| try: | |
| from huggingface_hub import login, hf_hub_download, snapshot_download | |
| # Try different methods for loading Keras models from HF Hub | |
| # Method 1: Direct Keras loading (preferred) | |
| try: | |
| from huggingface_hub import from_pretrained_keras | |
| KERAS_METHOD = "from_pretrained_keras" | |
| except ImportError: | |
| # Method 2: Transformers library fallback | |
| try: | |
| from transformers import TFAutoModel | |
| KERAS_METHOD = "transformers" | |
| except ImportError: | |
| # Method 3: Manual download and TFSMLayer | |
| KERAS_METHOD = "manual" | |
| HF_AVAILABLE = True | |
| print(f"Hugging Face Hub loaded successfully (method: {KERAS_METHOD})") | |
| except ImportError as e: | |
| print(f"Hugging Face Hub unavailable: {e}") | |
| print("Please install required packages: pip install huggingface_hub transformers") | |
| HF_AVAILABLE = False | |
| KERAS_METHOD = None | |
| class BreastCancerClassifier: | |
| """ | |
| A comprehensive breast cancer classification system using Path Foundation model. | |
| This class implements a transfer learning approach where Google's Path Foundation | |
| model serves as a feature extractor, followed by a trainable classification head. | |
| The system supports both feature extraction (frozen foundation model) and | |
| fine-tuning approaches for maximum flexibility. | |
| The classifier can work with multiple histopathology datasets and provides | |
| comprehensive evaluation capabilities including confusion matrices, classification | |
| reports, and performance metrics. | |
| Attributes: | |
| fine_tune (bool): Whether to fine-tune the foundation model or use it frozen | |
| model (tf.keras.Model): The complete classification model | |
| path_foundation: The loaded Path Foundation model from Hugging Face Hub | |
| history: Training history from model.fit() containing loss and accuracy curves | |
| embedding_dim (int): Dimensionality of extracted embeddings from foundation model | |
| num_classes (int): Number of output classes (default: 2 for binary classification) | |
| Example: | |
| >>> classifier = BreastCancerClassifier(fine_tune=False) | |
| >>> classifier.authenticate_huggingface() | |
| >>> classifier.load_path_foundation() | |
| >>> # Load data and train... | |
| """ | |
| def __init__(self, fine_tune=False): | |
| """ | |
| Initialise the breast cancer classifier. | |
| Args: | |
| fine_tune (bool): If True, allows fine-tuning of foundation model. | |
| If False, uses foundation model as frozen feature extractor. | |
| Note: Fine-tuning requires more computational resources and | |
| may lead to overfitting on smaller datasets. Feature extraction | |
| (fine_tune=False) is recommended for most use-cases. | |
| """ | |
| self.fine_tune = fine_tune | |
| self.model = None | |
| self.path_foundation = None | |
| self.history = None | |
| self.embedding_dim = None | |
| self.num_classes = 2 # Binary classification: benign vs malignant | |
| def authenticate_huggingface(self, token=None): | |
| """ | |
| Authenticate with Hugging Face Hub to access Path Foundation model. | |
| This method handles authentication with Hugging Face Hub, which is required | |
| to download and use Google's Path Foundation model. It supports multiple | |
| token sources and provides fallback mechanisms. | |
| Args: | |
| token (str, optional): Hugging Face access token. If None, the method | |
| will attempt to use environment variables: | |
| - HF_TOKEN | |
| - HUGGINGFACE_HUB_TOKEN | |
| Returns: | |
| bool: True if authentication successful, False otherwise | |
| Note: | |
| You can obtain a Hugging Face token by: | |
| 1. Creating an account at https://huggingface.co | |
| 2. Going to Settings > Access Tokens | |
| 3. Creating a new token with read permissions | |
| Example: | |
| >>> classifier = BreastCancerClassifier() | |
| >>> success = classifier.authenticate_huggingface("hf_xxxxxxxxxxxx") | |
| >>> if success: | |
| ... print("Authentication successful") | |
| """ | |
| if not HF_AVAILABLE: | |
| print("Cannot authenticate - Hugging Face Hub not available") | |
| return False | |
| try: | |
| # Try multiple token sources: parameter, environment variables | |
| final_token = token or os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN") | |
| if final_token: | |
| login(token=final_token, add_to_git_credential=False) | |
| print("Hugging Face authentication successful") | |
| return True | |
| else: | |
| print("No token provided, attempting to use cached login") | |
| return True | |
| except Exception as e: | |
| print(f"Authentication failed: {e}") | |
| return False | |
| def load_path_foundation(self): | |
| """ | |
| Load Google's Path Foundation model with multiple fallback mechanisms. | |
| This method attempts to load the Path Foundation model using three different | |
| approaches to ensure maximum compatibility across different environments: | |
| 1. Direct Keras loading via huggingface_hub (preferred) | |
| 2. Transformers library loading (fallback) | |
| 3. Manual download and TFSMLayer loading (last resort) | |
| The method also configures the model's training behavior based on the | |
| fine_tune parameter set during initialization. | |
| Returns: | |
| bool: True if model loaded successfully, False otherwise | |
| Raises: | |
| Various exceptions may be raised during the loading process, but they | |
| are caught and handled gracefully with informative error messages. | |
| Note: | |
| The Path Foundation model is a large pre-trained model (~1GB) that will | |
| be downloaded on first use. Subsequent runs will use the cached version. | |
| Example: | |
| >>> classifier = BreastCancerClassifier(fine_tune=False) | |
| >>> if classifier.load_path_foundation(): | |
| ... print("Model loaded successfully") | |
| ... else: | |
| ... print("Failed to load model") | |
| """ | |
| if not HF_AVAILABLE: | |
| print("Cannot load model - Hugging Face Hub unavailable") | |
| return False | |
| try: | |
| print("Loading Path Foundation model...") | |
| loaded = False | |
| # Method 1: Direct Keras loading (preferred method) | |
| if KERAS_METHOD == "from_pretrained_keras": | |
| try: | |
| self.path_foundation = from_pretrained_keras("google/path-foundation") | |
| loaded = True | |
| print("Successfully loaded via from_pretrained_keras") | |
| except Exception as e: | |
| print(f"Keras loading failed: {e}") | |
| # Method 2: Transformers library fallback | |
| if not loaded and KERAS_METHOD == "transformers": | |
| try: | |
| print("Attempting transformers fallback...") | |
| self.path_foundation = TFAutoModel.from_pretrained("google/path-foundation") | |
| loaded = True | |
| print("Successfully loaded via transformers") | |
| except Exception as e: | |
| print(f"Transformers loading failed: {e}") | |
| # Method 3: Manual download and TFSMLayer (last resort) | |
| if not loaded: | |
| try: | |
| try: | |
| import keras as _standalone_keras | |
| except ImportError as _e: | |
| print(f"Keras 3 not installed: {_e}") | |
| return False | |
| print("Attempting manual download and TFSMLayer loading...") | |
| local_dir = snapshot_download(repo_id="google/path-foundation") | |
| self.path_foundation = _standalone_keras.layers.TFSMLayer( | |
| local_dir, call_endpoint="serving_default" | |
| ) | |
| loaded = True | |
| print("Successfully loaded via TFSMLayer") | |
| except Exception as e: | |
| print(f"TFSMLayer loading failed: {e}") | |
| return False | |
| # Configure training behavior based on fine_tune setting | |
| if self.fine_tune: | |
| self.path_foundation.trainable = True | |
| try: | |
| # Only fine-tune the last 3 layers for stability | |
| for layer in self.path_foundation.layers[:-3]: | |
| layer.trainable = False | |
| print("Fine-tuning enabled: last 3 layers trainable") | |
| except: | |
| print("Fine-tuning enabled: full model trainable") | |
| else: | |
| self.path_foundation.trainable = False | |
| print("Model frozen for feature extraction") | |
| return True | |
| except Exception as e: | |
| print(f"Failed to load Path Foundation model: {e}") | |
| return False | |
| def preprocess_image_batch(self, images): | |
| """ | |
| Pre-process a batch of images for Path Foundation model input. | |
| This method handles multiple input formats and ensures all images are properly | |
| formatted for the Path Foundation model. It performs the following operations: | |
| - Resizes all images to 224x224 pixels (required by Path Foundation) | |
| - Converts images to RGB format | |
| - Normalises pixel values to [0, 1] range | |
| - Handles both file paths and numpy arrays | |
| Args: | |
| images: List or array of images in various formats: | |
| - File paths (strings) pointing to image files | |
| - PIL Images | |
| - NumPy arrays (various shapes and value ranges) | |
| Returns: | |
| np.ndarray: Preprocessed batch of shape (batch_size, 224, 224, 3) | |
| with pixel values normalized to [0, 1] range | |
| Note: | |
| The method automatically handles different input formats and value ranges. | |
| Images are resized using PIL's resize method with default interpolation. | |
| Example: | |
| >>> # Process file paths | |
| >>> image_paths = ['image1.jpg', 'image2.png'] | |
| >>> processed = classifier.preprocess_image_batch(image_paths) | |
| >>> print(processed.shape) # (2, 224, 224, 3) | |
| >>> # Process numpy arrays | |
| >>> image_arrays = [np.random.rand(100, 100, 3) for _ in range(5)] | |
| >>> processed = classifier.preprocess_image_batch(image_arrays) | |
| >>> print(processed.shape) # (5, 224, 224, 3) | |
| """ | |
| processed = [] | |
| for img in images: | |
| if isinstance(img, str): | |
| # Handle file paths | |
| img = Image.open(img).convert('RGB') | |
| img = img.resize((224, 224)) | |
| img_array = np.array(img) / 255.0 | |
| else: | |
| # Handle numpy arrays | |
| if img.shape[:2] != (224, 224): | |
| # Resize if necessary | |
| if img.max() <= 1: | |
| img_pil = Image.fromarray((img * 255).astype('uint8')) | |
| else: | |
| img_pil = Image.fromarray(img.astype('uint8')) | |
| img_pil = img_pil.resize((224, 224)) | |
| img_array = np.array(img_pil) / 255.0 | |
| else: | |
| img_array = img.astype('float32') | |
| if img_array.max() > 1: | |
| img_array = img_array / 255.0 | |
| processed.append(img_array) | |
| return np.array(processed) | |
| def extract_embeddings(self, images, batch_size=16): | |
| """ | |
| Extract feature embeddings from images using Path Foundation model. | |
| This method processes images in batches to extract high-level feature representations | |
| using the pre-trained Path Foundation model. The embeddings capture semantic information | |
| about the histopathology images that can be used for classification. | |
| The method handles different model interface types and provides progress tracking | |
| for large datasets. It automatically determines the embedding dimension on first use. | |
| Args: | |
| images: Array of preprocessed images or list of image paths | |
| batch_size (int): Number of images to process per batch. Smaller batches | |
| use less memory but may be slower. Default: 16 | |
| Returns: | |
| np.ndarray: Extracted embeddings of shape (num_images, embedding_dim) | |
| where embedding_dim is determined by the Path Foundation model | |
| Raises: | |
| ValueError: If no embeddings are successfully extracted | |
| RuntimeError: If the Path Foundation model is not loaded | |
| Note: | |
| The embedding dimension is automatically determined from the first successful | |
| batch and stored in self.embedding_dim for use in classifier construction. | |
| Example: | |
| >>> # Extract embeddings from a dataset | |
| >>> embeddings = classifier.extract_embeddings(images, batch_size=32) | |
| >>> print(f"Extracted {embeddings.shape[0]} embeddings of dimension {embeddings.shape[1]}") | |
| >>> # Process with smaller batch size for memory-constrained environments | |
| >>> embeddings = classifier.extract_embeddings(images, batch_size=8) | |
| """ | |
| print(f"Extracting embeddings from {len(images)} images...") | |
| embeddings = [] | |
| num_batches = (len(images) + batch_size - 1) // batch_size | |
| for i in range(0, len(images), batch_size): | |
| batch = images[i:i + batch_size] | |
| processed_batch = self.preprocess_image_batch(batch) | |
| try: | |
| # Handle different model interface types | |
| if hasattr(self.path_foundation, 'signatures') and "serving_default" in self.path_foundation.signatures: | |
| # TensorFlow SavedModel format | |
| infer = self.path_foundation.signatures["serving_default"] | |
| batch_embeddings = infer(tf.constant(processed_batch)) | |
| elif hasattr(self.path_foundation, 'predict'): | |
| # Standard Keras model | |
| batch_embeddings = self.path_foundation.predict(processed_batch, verbose=0) | |
| else: | |
| # Direct callable | |
| batch_embeddings = self.path_foundation(processed_batch) | |
| # Handle different output formats | |
| if isinstance(batch_embeddings, dict): | |
| key = list(batch_embeddings.keys())[0] | |
| if hasattr(batch_embeddings[key], 'numpy'): | |
| batch_embeddings = batch_embeddings[key].numpy() | |
| else: | |
| batch_embeddings = batch_embeddings[key] | |
| elif hasattr(batch_embeddings, 'numpy'): | |
| batch_embeddings = batch_embeddings.numpy() | |
| embeddings.append(batch_embeddings) | |
| # Progress reporting | |
| batch_num = i // batch_size + 1 | |
| if batch_num % 10 == 0: | |
| print(f"Processed batch {batch_num}/{num_batches}") | |
| except Exception as e: | |
| print(f"Error processing batch {batch_num}: {e}") | |
| continue | |
| if not embeddings: | |
| raise ValueError("No embeddings extracted successfully") | |
| final_embeddings = np.vstack(embeddings) | |
| # Set embedding dimension for classifier head | |
| if self.embedding_dim is None: | |
| self.embedding_dim = final_embeddings.shape[1] | |
| print(f"Embedding dimension: {self.embedding_dim}") | |
| print(f"Final embeddings shape: {final_embeddings.shape}") | |
| return final_embeddings | |
| def build_classifier(self): | |
| """ | |
| Build the classification head architecture. | |
| This method constructs the neural network architecture for breast cancer classification. | |
| It creates different architectures based on the fine_tune setting: | |
| 1. End-to-end model (fine_tune=True): Input -> Path Foundation -> Classifier -> Output | |
| 2. Feature-based model (fine_tune=False): Embeddings -> Classifier -> Output | |
| The architecture includes: | |
| - Progressive dimensionality reduction (768 -> 384 -> 192 -> 2) | |
| - L2 regularisation for weight decay and overfitting prevention | |
| - Batch normalisation for training stability and faster convergence | |
| - Dropout layers for regularization | |
| - AdamW optimizer with appropriate learning rates | |
| Returns: | |
| None: The model is stored in self.model and compiled | |
| Raises: | |
| ValueError: If embedding dimension is not set (run extract_embeddings first) | |
| Note: | |
| The method automatically selects appropriate learning rates: | |
| - Lower learning rate (1e-5) for fine-tuning to preserve pre-trained features | |
| - Higher learning rate (0.001) for training from scratch on embeddings | |
| Architecture Details: | |
| - Input: Either raw images (224x224x3) or embeddings (embedding_dim,) | |
| - Hidden layers: 768 -> 384 -> 192 neurons with ReLU activation | |
| - Output: 2 neurons with softmax activation (benign/malignant) | |
| - Regularisation: L2 weight decay (1e-4), Dropout (0.5, 0.3, 0.2) | |
| - Normalisation: Batch normalisation after each dense layer | |
| Example: | |
| >>> classifier = BreastCancerClassifier(fine_tune=False) | |
| >>> classifier.load_path_foundation() | |
| >>> embeddings = classifier.extract_embeddings(images) | |
| >>> classifier.build_classifier() | |
| >>> print(f"Model has {classifier.model.count_params():,} parameters") | |
| """ | |
| if self.embedding_dim is None: | |
| raise ValueError("Embedding dimension not set - run extract_embeddings first") | |
| if self.fine_tune: | |
| # End-to-end fine-tuning architecture | |
| inputs = tf.keras.Input(shape=(224, 224, 3)) | |
| x = self.path_foundation(inputs) | |
| # Classification head with regularization | |
| x = tf.keras.layers.Dense(768, activation='relu', | |
| kernel_regularizer=regularizers.l2(1e-4))(x) | |
| x = tf.keras.layers.BatchNormalization()(x) | |
| x = tf.keras.layers.Dropout(0.5)(x) | |
| x = tf.keras.layers.Dense(384, activation='relu', | |
| kernel_regularizer=regularizers.l2(1e-4))(x) | |
| x = tf.keras.layers.BatchNormalization()(x) | |
| x = tf.keras.layers.Dropout(0.3)(x) | |
| x = tf.keras.layers.Dense(192, activation='relu', | |
| kernel_regularizer=regularizers.l2(1e-4))(x) | |
| x = tf.keras.layers.Dropout(0.2)(x) | |
| outputs = tf.keras.layers.Dense(self.num_classes, activation='softmax')(x) | |
| self.model = tf.keras.Model(inputs, outputs) | |
| # Lower learning rate for fine-tuning to preserve pre-trained features | |
| optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-5, weight_decay=1e-5) | |
| else: | |
| # Feature extraction architecture (recommended approach) | |
| self.model = tf.keras.Sequential([ | |
| tf.keras.layers.Input(shape=(self.embedding_dim,)), | |
| # First dense block | |
| tf.keras.layers.Dense(768, activation='relu', | |
| kernel_regularizer=regularizers.l2(1e-4)), | |
| tf.keras.layers.BatchNormalization(), | |
| tf.keras.layers.Dropout(0.5), | |
| # Second dense block | |
| tf.keras.layers.Dense(384, activation='relu', | |
| kernel_regularizer=regularizers.l2(1e-4)), | |
| tf.keras.layers.BatchNormalization(), | |
| tf.keras.layers.Dropout(0.3), | |
| # Third dense block | |
| tf.keras.layers.Dense(192, activation='relu', | |
| kernel_regularizer=regularizers.l2(1e-4)), | |
| tf.keras.layers.Dropout(0.2), | |
| # Output layer | |
| tf.keras.layers.Dense(self.num_classes, activation='softmax') | |
| ]) | |
| # Higher learning rate for training from scratch | |
| optimizer = tf.keras.optimizers.AdamW(learning_rate=0.001, weight_decay=1e-5) | |
| # Compile model with sparse categorical crossentropy for integer labels | |
| self.model.compile( | |
| optimizer=optimizer, | |
| loss=tf.keras.losses.SparseCategoricalCrossentropy(), | |
| metrics=['accuracy'] | |
| ) | |
| print(f"Model architecture built - Fine-tuning: {self.fine_tune}") | |
| print(f"Total parameters: {self.model.count_params():,}") | |
| def train_model(self, X_train, y_train, X_val, y_val, epochs=50): | |
| """ | |
| Train the classification model with advanced techniques and callbacks. | |
| This method implements a comprehensive training pipeline with: | |
| - Class balancing to handle imbalanced datasets | |
| - Early stopping to prevent overfitting | |
| - Learning rate reduction on plateau | |
| - Model checkpointing to save best weights | |
| - Adaptive batch sizing based on training mode | |
| Args: | |
| X_train: Training features (embeddings or images) | |
| y_train: Training labels (0 for benign, 1 for malignant) | |
| X_val: Validation features | |
| y_val: Validation labels | |
| epochs (int): Maximum number of training epochs. Default: 50 | |
| Returns: | |
| tf.keras.callbacks.History: Training history containing loss and accuracy curves | |
| Note: | |
| The method automatically handles class imbalance by computing balanced weights. | |
| Training uses different batch sizes: 32 for fine-tuning, 64 for feature extraction. | |
| Callbacks Used: | |
| - EarlyStopping: Stops training if validation accuracy doesn't improve for 10 epochs | |
| - ReduceLROnPlateau: Reduces learning rate by 50% if validation loss plateaus | |
| - ModelCheckpoint: Saves the best model based on validation accuracy | |
| Example: | |
| >>> # Train the model | |
| >>> history = classifier.train_model(X_train, y_train, X_val, y_val, epochs=30) | |
| >>> | |
| >>> # Access training metrics | |
| >>> print(f"Final training accuracy: {history.history['accuracy'][-1]:.4f}") | |
| >>> print(f"Final validation accuracy: {history.history['val_accuracy'][-1]:.4f}") | |
| """ | |
| # Compute class weights to handle imbalanced datasets | |
| try: | |
| classes = np.unique(y_train) | |
| weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train) | |
| class_weight = {int(c): float(w) for c, w in zip(classes, weights)} | |
| print(f"Class weights computed: {class_weight}") | |
| except Exception: | |
| class_weight = None | |
| print("Using uniform class weights") | |
| # Define training callbacks for robust training | |
| callbacks = [ | |
| tf.keras.callbacks.EarlyStopping( | |
| monitor='val_accuracy', | |
| patience=10, | |
| restore_best_weights=True, | |
| verbose=1 | |
| ), | |
| tf.keras.callbacks.ReduceLROnPlateau( | |
| monitor='val_loss', | |
| factor=0.5, | |
| patience=5, | |
| min_lr=1e-7, | |
| verbose=1 | |
| ), | |
| tf.keras.callbacks.ModelCheckpoint( | |
| 'best_model.keras', | |
| monitor='val_accuracy', | |
| save_best_only=True, | |
| verbose=0 | |
| ) | |
| ] | |
| print("Starting model training...") | |
| print(f"Training samples: {len(X_train)}, Validation samples: {len(X_val)}") | |
| # Adaptive batch sizing based on training mode | |
| batch_size = 32 if self.fine_tune else 64 | |
| print(f"Using batch size: {batch_size}") | |
| # Train the model | |
| self.history = self.model.fit( | |
| X_train, y_train, | |
| validation_data=(X_val, y_val), | |
| epochs=epochs, | |
| batch_size=batch_size, | |
| callbacks=callbacks, | |
| verbose=1, | |
| class_weight=class_weight | |
| ) | |
| print("Training completed successfully!") | |
| return self.history | |
| def evaluate_model(self, X_test, y_test): | |
| """ | |
| Comprehensive model evaluation with multiple performance metrics and visualisations. | |
| This method provides a thorough evaluation of the trained model including: | |
| - Accuracy, Precision, Recall, and F1-score calculations | |
| - Detailed classification report with per-class metrics | |
| - Confusion matrix visualisation and analysis | |
| - Model predictions and probabilities for further analysis | |
| Args: | |
| X_test: Test features (embeddings or images) | |
| y_test: True test labels (0 for benign, 1 for malignant) | |
| Returns: | |
| dict: Dictionary containing comprehensive evaluation results: | |
| - 'accuracy': Overall accuracy score | |
| - 'precision': Weighted average precision | |
| - 'recall': Weighted average recall | |
| - 'f1': Weighted average F1-score | |
| - 'predictions': Predicted class labels | |
| - 'probabilities': Prediction probabilities for each class | |
| - 'confusion_matrix': 2x2 confusion matrix | |
| Note: | |
| The method generates and saves a confusion matrix plot as 'confusion_matrix.png' | |
| and displays it using matplotlib. The plot uses a blue color scheme for clarity. | |
| Metrics Explanation: | |
| - Accuracy: Overall correctness of predictions | |
| - Precision: True positives / (True positives + False positives) | |
| - Recall: True positives / (True positives + False negatives) | |
| - F1-score: Harmonic mean of precision and recall | |
| Example: | |
| >>> # Evaluate the trained model | |
| >>> results = classifier.evaluate_model(X_test, y_test) | |
| >>> | |
| >>> # Access specific metrics | |
| >>> print(f"Test Accuracy: {results['accuracy']:.4f}") | |
| >>> print(f"F1-Score: {results['f1']:.4f}") | |
| >>> | |
| >>> # Analyze predictions | |
| >>> predictions = results['predictions'] | |
| >>> probabilities = results['probabilities'] | |
| """ | |
| print("Evaluating model performance...") | |
| # Generate predictions and probabilities | |
| y_pred_proba = self.model.predict(X_test) | |
| y_pred = np.argmax(y_pred_proba, axis=1) | |
| # Calculate comprehensive metrics | |
| accuracy = accuracy_score(y_test, y_pred) | |
| precision = precision_score(y_test, y_pred, average='weighted') | |
| recall = recall_score(y_test, y_pred, average='weighted') | |
| f1 = f1_score(y_test, y_pred, average='weighted') | |
| # Display results | |
| print(f"Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)") | |
| print(f"Precision: {precision:.4f}") | |
| print(f"Recall: {recall:.4f}") | |
| print(f"F1-Score: {f1:.4f}") | |
| # Detailed classification report | |
| class_names = ['Benign', 'Malignant'] | |
| print("\nDetailed Classification Report:") | |
| print(classification_report(y_test, y_pred, target_names=class_names)) | |
| # Generate and display confusion matrix | |
| cm = confusion_matrix(y_test, y_pred) | |
| # Create confusion matrix visualization | |
| plt.figure(figsize=(8, 6)) | |
| sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', | |
| xticklabels=class_names, yticklabels=class_names) | |
| plt.title('Confusion Matrix - Breast Cancer Classification') | |
| plt.xlabel('Predicted Label') | |
| plt.ylabel('True Label') | |
| plt.tight_layout() | |
| plt.savefig('confusion_matrix.png', dpi=300, bbox_inches='tight') | |
| # Close the figure to free resources and avoid blocking | |
| plt.close() | |
| # Print confusion matrix in text format | |
| print("\nConfusion Matrix:") | |
| print(f" Predicted") | |
| print(f" {class_names[0]:>8} {class_names[1]:>8}") | |
| print(f"Actual {class_names[0]:>6} {cm[0,0]:>8} {cm[0,1]:>8}") | |
| print(f" {class_names[1]:>6} {cm[1,0]:>8} {cm[1,1]:>8}") | |
| return { | |
| 'accuracy': accuracy, | |
| 'precision': precision, | |
| 'recall': recall, | |
| 'f1': f1, | |
| 'predictions': y_pred, | |
| 'probabilities': y_pred_proba, | |
| 'confusion_matrix': cm | |
| } | |
| def load_breakhis_data(data_dir="datasets/breakhis/histology_slides/breast", max_samples_per_class=2000, magnification="40X"): | |
| """ | |
| Load and preprocess the BreakHis breast cancer histopathology dataset. | |
| The BreakHis dataset contains microscopic images of breast tumor tissue | |
| collected from clinical studies. Images are organized by: | |
| - Tumor type (benign/malignant) | |
| - Specific histological type (adenosis, fibroadenoma, etc.) | |
| - Patient ID | |
| - Magnification level (40X, 100X, 200X, 400X) | |
| This function loads images from the specified magnification level and | |
| preprocesses them for use with the Path Foundation model. | |
| Args: | |
| data_dir (str): Path to BreakHis dataset root directory. Default structure: | |
| datasets/breakhis/histology_slides/breast/ | |
| max_samples_per_class (int): Maximum images to load per class (benign/malignant). | |
| Helps manage memory usage for large datasets. | |
| magnification (str): Magnification level to use. Options: "40X", "100X", "200X", "400X". | |
| Higher magnifications provide more detail but larger file sizes. | |
| Returns: | |
| tuple: (images, labels) as numpy arrays | |
| - images: Array of shape (num_images, 224, 224, 3) with normalized pixel values | |
| - labels: Array of shape (num_images,) with 0 for benign, 1 for malignant | |
| Dataset Structure: | |
| The function expects the following directory structure: | |
| data_dir/ | |
| βββ benign/SOB/ | |
| β βββ adenosis/ | |
| β βββ fibroadenoma/ | |
| β βββ phyllodes_tumor/ | |
| β βββ tubular_adenoma/ | |
| βββ malignant/SOB/ | |
| βββ ductal_carcinoma/ | |
| βββ lobular_carcinoma/ | |
| βββ mucinous_carcinoma/ | |
| βββ papillary_carcinoma/ | |
| Note: | |
| Images are automatically resized to 224x224 pixels and normalized to [0,1] range. | |
| The function handles various image formats (PNG, JPG, JPEG, TIF, TIFF). | |
| Example: | |
| >>> # Load BreakHis dataset with 40X magnification | |
| >>> images, labels = load_breakhis_data( | |
| ... data_dir="datasets/breakhis/histology_slides/breast", | |
| ... max_samples_per_class=1000, | |
| ... magnification="40X" | |
| ... ) | |
| >>> print(f"Loaded {len(images)} images") | |
| >>> print(f"Benign: {np.sum(labels == 0)}, Malignant: {np.sum(labels == 1)}") | |
| """ | |
| print(f"Loading BreakHis dataset (magnification: {magnification})...") | |
| benign_dir = os.path.join(data_dir, "benign", "SOB") | |
| malignant_dir = os.path.join(data_dir, "malignant", "SOB") | |
| images = [] | |
| labels = [] | |
| def load_images_from_category(base_dir, label, max_count): | |
| """ | |
| Helper function to load images from a specific category (benign/malignant). | |
| Traverses the directory structure: base_dir/tumor_type/patient_id/magnification/images | |
| and loads images with progress reporting. | |
| """ | |
| if not os.path.exists(base_dir): | |
| print(f"Warning: Directory {base_dir} not found") | |
| return 0 | |
| count = 0 | |
| # Traverse: base_dir/tumor_type/patient_id/magnification/images | |
| for tumor_type in os.listdir(base_dir): | |
| tumor_dir = os.path.join(base_dir, tumor_type) | |
| if not os.path.isdir(tumor_dir): | |
| continue | |
| for patient_id in os.listdir(tumor_dir): | |
| patient_dir = os.path.join(tumor_dir, patient_id) | |
| if not os.path.isdir(patient_dir): | |
| continue | |
| mag_dir = os.path.join(patient_dir, magnification) | |
| if not os.path.exists(mag_dir): | |
| continue | |
| for filename in os.listdir(mag_dir): | |
| if count >= max_count: | |
| return count | |
| if filename.lower().endswith(('.png', '.jpg', '.jpeg', '.tif', '.tiff')): | |
| try: | |
| img_path = os.path.join(mag_dir, filename) | |
| img = Image.open(img_path).convert('RGB') | |
| img = img.resize((224, 224)) | |
| img_array = np.array(img).astype('float32') / 255.0 | |
| images.append(img_array) | |
| labels.append(label) | |
| count += 1 | |
| if count % 100 == 0: | |
| category = 'benign' if label == 0 else 'malignant' | |
| print(f"Loaded {count} {category} images") | |
| except Exception as e: | |
| print(f"Error loading {filename}: {e}") | |
| continue | |
| return count | |
| # Load both categories | |
| benign_count = load_images_from_category(benign_dir, 0, max_samples_per_class) | |
| malignant_count = load_images_from_category(malignant_dir, 1, max_samples_per_class) | |
| print(f"BreakHis dataset loaded: {benign_count} benign, {malignant_count} malignant images") | |
| return np.array(images), np.array(labels) | |
| def load_pcam_data(data_dir="datasets/pcam", label_dir="datasets/Labels", max_samples=3000, augment=True): | |
| """ | |
| Load and preprocess the PatchCamelyon (PCam) dataset. | |
| PCam contains 96x96 pixel patches extracted from histopathologic scans | |
| of lymph node sections. Each patch is labeled with the presence of | |
| metastatic tissue. This function includes data augmentation capabilities | |
| to improve model generalization. | |
| The dataset is stored in HDF5 format with separate files for images and labels, | |
| and comes pre-split into training, validation, and test sets. | |
| Args: | |
| data_dir (str): Path to PCam image data directory containing: | |
| - training_split.h5 | |
| - validation_split.h5 | |
| - test_split.h5 | |
| label_dir (str): Path to PCam label files directory containing: | |
| - camelyonpatch_level_2_split_train_y.h5 | |
| - camelyonpatch_level_2_split_valid_y.h5 | |
| - camelyonpatch_level_2_split_test_y.h5 | |
| max_samples (int): Maximum total samples to load across all splits. | |
| Distributed as: train=50%, val=25%, test=25% | |
| augment (bool): Whether to apply data augmentation to training set. | |
| Augmentation includes: horizontal flip, rotation, brightness adjustment | |
| Returns: | |
| dict: Dictionary with 'train', 'valid', 'test' keys containing (images, labels) tuples | |
| - 'train': (train_images, train_labels) - Training data with optional augmentation | |
| - 'valid': (val_images, val_labels) - Validation data | |
| - 'test': (test_images, test_labels) - Test data | |
| Dataset Details: | |
| - Original patch size: 96x96 pixels | |
| - Resized to: 224x224 pixels for Path Foundation compatibility | |
| - Labels: 0 (normal tissue), 1 (metastatic tissue) | |
| - Format: HDF5 files with 'x' key for images, 'y' key for labels | |
| Data Augmentation (if enabled): | |
| - Horizontal flip: 50% probability | |
| - Rotation: Random 0Β°, 90Β°, 180Β°, or 270Β° rotation | |
| - Brightness adjustment: 30% probability, factor between 0.9-1.1 | |
| Note: | |
| The function automatically handles HDF5 file loading and memory management. | |
| Images are resized from 96x96 to 224x224 pixels and normalized to [0,1] range. | |
| Example: | |
| >>> # Load PCam dataset with augmentation | |
| >>> pcam_data = load_pcam_data( | |
| ... data_dir="datasets/pcam", | |
| ... label_dir="datasets/Labels", | |
| ... max_samples=2000, | |
| ... augment=True | |
| ... ) | |
| >>> | |
| >>> # Access training data | |
| >>> train_images, train_labels = pcam_data['train'] | |
| >>> print(f"Training samples: {len(train_images)}") | |
| >>> print(f"Image shape: {train_images[0].shape}") | |
| """ | |
| print("Loading PatchCamelyon (PCam) dataset...") | |
| # Define file paths | |
| train_file = os.path.join(data_dir, "training_split.h5") | |
| val_file = os.path.join(data_dir, "validation_split.h5") | |
| test_file = os.path.join(data_dir, "test_split.h5") | |
| train_label_file = os.path.join(label_dir, "camelyonpatch_level_2_split_train_y.h5") | |
| val_label_file = os.path.join(label_dir, "camelyonpatch_level_2_split_valid_y.h5") | |
| test_label_file = os.path.join(label_dir, "camelyonpatch_level_2_split_test_y.h5") | |
| def preprocess(images): | |
| """Resize and normalize images from 96x96 to 224x224 pixels.""" | |
| processed = [] | |
| for img in images: | |
| im = Image.fromarray(img) | |
| im = im.resize((224, 224)) # Resize to match Path Foundation input | |
| arr = np.array(im).astype('float32') / 255.0 | |
| processed.append(arr) | |
| return np.array(processed) | |
| def safe_load(img_file, label_file, limit): | |
| """Safely load data from HDF5 files with memory management.""" | |
| with h5py.File(img_file, 'r') as f_img, h5py.File(label_file, 'r') as f_lbl: | |
| x = f_img['x'][:limit] | |
| y = f_lbl['y'][:limit] | |
| y = y.reshape(-1) # Ensure 1D label array | |
| return x, y | |
| # Load data splits with sample limits | |
| train_images, train_labels = safe_load(train_file, train_label_file, max_samples//2) | |
| val_images, val_labels = safe_load(val_file, val_label_file, max_samples//4) | |
| test_images, test_labels = safe_load(test_file, test_label_file, max_samples//4) | |
| # Preprocess all splits | |
| train_images = preprocess(train_images) | |
| val_images = preprocess(val_images) | |
| test_images = preprocess(test_images) | |
| # Apply data augmentation to training set | |
| if augment: | |
| print("Applying data augmentation to training set...") | |
| for i in range(len(train_images)): | |
| # Random horizontal flip | |
| if np.random.rand() > 0.5: | |
| train_images[i] = np.fliplr(train_images[i]) | |
| # Random rotation (0, 90, 180, 270 degrees) | |
| k = np.random.randint(0, 4) | |
| if k: | |
| train_images[i] = np.rot90(train_images[i], k) | |
| # Random brightness adjustment | |
| if np.random.rand() > 0.7: | |
| im = Image.fromarray((train_images[i] * 255).astype('uint8')) | |
| brightness_factor = 0.9 + 0.2 * np.random.rand() | |
| im = Image.fromarray( | |
| np.clip(np.array(im, dtype=np.float32) * brightness_factor, 0, 255).astype('uint8') | |
| ) | |
| train_images[i] = np.array(im).astype('float32') / 255.0 | |
| print(f"PCam dataset loaded - Train: {len(train_images)}, Val: {len(val_images)}, Test: {len(test_images)}") | |
| return { | |
| 'train': (train_images, train_labels), | |
| 'valid': (val_images, val_labels), | |
| 'test': (test_images, test_labels) | |
| } | |
| def load_bach_data(data_dir="datasets/BACH/ICIAR2018_BACH_Challenge/Photos", max_samples=400, augment=True): | |
| """ | |
| Load and preprocess the BACH (ICIAR 2018) breast cancer histology dataset. | |
| BACH contains microscopy images classified into four categories: | |
| - Normal tissue | |
| - Benign lesions | |
| - In situ carcinoma | |
| - Invasive carcinoma | |
| For binary classification, this function maps: | |
| - Normal + Benign β Benign (label 0) | |
| - In situ + Invasive β Malignant (label 1) | |
| Args: | |
| data_dir (str): Path to BACH dataset directory containing class subdirectories: | |
| - Normal/ | |
| - Benign/ | |
| - InSitu/ | |
| - Invasive/ | |
| max_samples (int): Maximum total samples to load across all classes. | |
| Distributed evenly across the 4 classes. | |
| augment (bool): Whether to apply data augmentation (currently not implemented | |
| for BACH dataset but parameter kept for consistency) | |
| Returns: | |
| dict: Dictionary with 'train', 'valid', 'test' keys containing (images, labels) tuples | |
| - 'train': (train_images, train_labels) - Training data | |
| - 'valid': (val_images, val_labels) - Validation data | |
| - 'test': (test_images, test_labels) - Test data | |
| Dataset Details: | |
| - Original categories: 4 classes (Normal, Benign, InSitu, Invasive) | |
| - Binary mapping: Normal(0), Benign(1) β Benign(0); InSitu(2), Invasive(3) β Malignant(1) | |
| - Image format: TIF, TIFF, PNG, JPG, JPEG | |
| - Resized to: 224x224 pixels for Path Foundation compatibility | |
| - Normalized to: [0, 1] range | |
| Data Splitting: | |
| - Test set: 20% of total data | |
| - Training set: 60% of total data (75% of remaining after test split) | |
| - Validation set: 20% of total data (25% of remaining after test split) | |
| - Stratified splitting to maintain class distribution | |
| Note: | |
| The function automatically handles the 4-class to binary classification mapping. | |
| Images are resized to 224x224 pixels and normalized to [0,1] range. | |
| The augment parameter is kept for API consistency but augmentation is not | |
| currently implemented for the BACH dataset. | |
| Example: | |
| >>> # Load BACH dataset | |
| >>> bach_data = load_bach_data( | |
| ... data_dir="datasets/BACH/ICIAR2018_BACH_Challenge/Photos", | |
| ... max_samples=400, | |
| ... augment=True | |
| ... ) | |
| >>> | |
| >>> # Access training data | |
| >>> train_images, train_labels = bach_data['train'] | |
| >>> print(f"Training samples: {len(train_images)}") | |
| >>> print(f"Class distribution: Benign={np.sum(train_labels==0)}, Malignant={np.sum(train_labels==1)}") | |
| """ | |
| print("Loading BACH (ICIAR 2018) dataset...") | |
| # Original BACH categories mapped to binary classification | |
| class_dirs = { | |
| 'Normal': 0, # Normal tissue β Benign | |
| 'Benign': 1, # Benign lesions β Benign | |
| 'InSitu': 2, # In situ carcinoma β Malignant | |
| 'Invasive': 3, # Invasive carcinoma β Malignant | |
| } | |
| images = [] | |
| labels = [] | |
| per_class_limit = None if not max_samples else max_samples // 4 | |
| counters = {0: 0, 1: 0, 2: 0, 3: 0} | |
| # Load images from each category | |
| for cls_name, cls_label in class_dirs.items(): | |
| cls_path = os.path.join(data_dir, cls_name) | |
| if not os.path.isdir(cls_path): | |
| print(f"Warning: Directory {cls_path} not found") | |
| continue | |
| for fname in os.listdir(cls_path): | |
| if per_class_limit and counters[cls_label] >= per_class_limit: | |
| break | |
| if not fname.lower().endswith((".tif", ".tiff", ".png", ".jpg", ".jpeg")): | |
| continue | |
| fpath = os.path.join(cls_path, fname) | |
| try: | |
| im = Image.open(fpath).convert('RGB') | |
| im = im.resize((224, 224)) | |
| arr = np.array(im).astype('float32') / 255.0 | |
| images.append(arr) | |
| labels.append(cls_label) | |
| counters[cls_label] += 1 | |
| except Exception as e: | |
| print(f"Error loading {fname}: {e}") | |
| continue | |
| images = np.array(images) | |
| labels = np.array(labels) | |
| # Convert 4-class to binary classification | |
| if labels.size > 0: | |
| # Map: Normal(0), Benign(1) β Benign(0); InSitu(2), Invasive(3) β Malignant(1) | |
| labels = np.where(np.isin(labels, [0, 1]), 0, 1) | |
| print(f"BACH dataset loaded: {len(images)} images") | |
| print(f"Class distribution - Benign: {np.sum(labels == 0)}, Malignant: {np.sum(labels == 1)}") | |
| # Split into train/validation/test sets | |
| X_temp, X_test, y_temp, y_test = train_test_split( | |
| images, labels, test_size=0.2, | |
| stratify=labels if len(set(labels)) > 1 else None, | |
| random_state=42 | |
| ) | |
| X_train, X_val, y_train, y_val = train_test_split( | |
| X_temp, y_temp, test_size=0.25, | |
| stratify=y_temp if len(set(y_temp)) > 1 else None, | |
| random_state=42 | |
| ) | |
| return { | |
| 'train': (X_train, y_train), | |
| 'valid': (X_val, y_val), | |
| 'test': (X_test, y_test) | |
| } | |
| def load_combined_data(dataset_choice="breakhis", max_samples=5000): | |
| """ | |
| Unified data loading function supporting multiple datasets and combinations. | |
| This function serves as the main entry point for data loading, supporting: | |
| - Individual datasets (BreakHis, PCam, BACH) | |
| - Combined dataset training for improved generalization | |
| - Consistent data splitting and preprocessing across all datasets | |
| The combined dataset approach leverages multiple histopathology datasets to | |
| create a more robust and generalizable model by training on diverse data sources. | |
| Args: | |
| dataset_choice (str): Dataset to load. Options: | |
| - "breakhis": BreakHis breast cancer histopathology dataset | |
| - "pcam": PatchCamelyon lymph node metastasis dataset | |
| - "bach": BACH ICIAR 2018 breast cancer histology dataset | |
| - "combined": Ensemble of all three datasets for robust training | |
| max_samples (int): Maximum total samples to load. For individual datasets, | |
| this is the total limit. For combined datasets, this is | |
| distributed across the constituent datasets. | |
| Returns: | |
| dict: Dictionary with 'train', 'valid', 'test' keys containing (images, labels) tuples | |
| - 'train': (train_images, train_labels) - Training data | |
| - 'valid': (val_images, val_labels) - Validation data | |
| - 'test': (test_images, test_labels) - Test data | |
| Dataset Combinations: | |
| When dataset_choice="combined", the function: | |
| 1. Loads BreakHis, PCam, and BACH datasets | |
| 2. Combines their training data | |
| 3. Shuffles the combined dataset | |
| 4. Splits into train/validation/test sets | |
| 5. Maintains class balance through stratified splitting | |
| Sample Distribution (for combined datasets): | |
| - BreakHis: max_samples // 6 (per-class limit) | |
| - PCam: max_samples // 3 (total limit) | |
| - BACH: max_samples // 3 (total limit) | |
| Data Splitting: | |
| - Test set: 20% of total data | |
| - Training set: 60% of total data (75% of remaining after test split) | |
| - Validation set: 20% of total data (25% of remaining after test split) | |
| - Stratified splitting to maintain class distribution | |
| Note: | |
| All datasets are automatically preprocessed to 224x224 pixels and normalized | |
| to [0,1] range for compatibility with the Path Foundation model. | |
| Example: | |
| >>> # Load individual dataset | |
| >>> data = load_combined_data("breakhis", max_samples=2000) | |
| >>> | |
| >>> # Load combined dataset for robust training | |
| >>> combined_data = load_combined_data("combined", max_samples=6000) | |
| >>> | |
| >>> # Access training data | |
| >>> train_images, train_labels = combined_data['train'] | |
| >>> print(f"Combined training samples: {len(train_images)}") | |
| """ | |
| if dataset_choice.lower() == "breakhis": | |
| print("Loading BreakHis dataset only...") | |
| images, labels = load_breakhis_data(max_samples_per_class=max_samples//2) | |
| # Split into train/validation/test | |
| X_temp, X_test, y_temp, y_test = train_test_split( | |
| images, labels, test_size=0.2, stratify=labels, random_state=42 | |
| ) | |
| X_train, X_val, y_train, y_val = train_test_split( | |
| X_temp, y_temp, test_size=0.25, stratify=y_temp, random_state=42 | |
| ) | |
| return { | |
| 'train': (X_train, y_train), | |
| 'valid': (X_val, y_val), | |
| 'test': (X_test, y_test) | |
| } | |
| elif dataset_choice.lower() == "pcam": | |
| return load_pcam_data(max_samples=max_samples) | |
| elif dataset_choice.lower() == "bach": | |
| return load_bach_data(max_samples=max_samples) | |
| elif dataset_choice.lower() == "combined": | |
| print("Loading combined datasets for enhanced generalization...") | |
| # Distribute samples across datasets | |
| if max_samples is None: | |
| per_bh = None | |
| per_pc = None | |
| per_ba = None | |
| else: | |
| per_dataset = max(1, max_samples // 3) | |
| per_bh = per_dataset // 2 # BreakHis uses per-class limit | |
| per_pc = per_dataset | |
| per_ba = per_dataset | |
| # Load individual datasets | |
| print("Loading BreakHis component...") | |
| bh_images, bh_labels = load_breakhis_data( | |
| max_samples_per_class=per_bh if per_bh else 10**9 | |
| ) | |
| print("Loading PCam component...") | |
| pcam = load_pcam_data(max_samples=per_pc, augment=True) | |
| pc_train_images, pc_train_labels = pcam["train"] | |
| print("Loading BACH component...") | |
| bach = load_bach_data(max_samples=per_ba, augment=True) | |
| b_train_images, b_train_labels = bach["train"] | |
| # Combine all datasets | |
| images = np.concatenate([bh_images, pc_train_images, b_train_images], axis=0) | |
| labels = np.concatenate([bh_labels, pc_train_labels, b_train_labels], axis=0) | |
| print(f"Combined dataset: {len(images)} total images") | |
| print(f"Final distribution - Benign: {np.sum(labels == 0)}, Malignant: {np.sum(labels == 1)}") | |
| # Shuffle combined data | |
| idx = np.arange(len(images)) | |
| np.random.shuffle(idx) | |
| images, labels = images[idx], labels[idx] | |
| # Split combined data | |
| X_temp, X_test, y_temp, y_test = train_test_split( | |
| images, labels, test_size=0.2, | |
| stratify=labels if len(set(labels)) > 1 else None, | |
| random_state=42 | |
| ) | |
| X_train, X_val, y_train, y_val = train_test_split( | |
| X_temp, y_temp, test_size=0.25, | |
| stratify=y_temp if len(set(y_temp)) > 1 else None, | |
| random_state=42 | |
| ) | |
| return { | |
| 'train': (X_train, y_train), | |
| 'valid': (X_val, y_val), | |
| 'test': (X_test, y_test) | |
| } | |
| else: | |
| raise ValueError(f"Unknown dataset choice: {dataset_choice}. " | |
| f"Choose from: 'breakhis', 'pcam', 'bach', 'combined'") | |
| def main(): | |
| """ | |
| Execute the complete breast cancer classification pipeline. | |
| This function coordinates all components of the machine learning workflow: | |
| 1. Environment validation and setup | |
| 2. Model authentication and loading | |
| 3. Dataset loading and preprocessing | |
| 4. Feature extraction using Path Foundation | |
| 5. Classifier training with advanced techniques | |
| 6. Comprehensive model evaluation | |
| 7. Model persistence for future use | |
| The pipeline implements a robust transfer learning approach using Google's | |
| Path Foundation model as a feature extractor, followed by a trainable | |
| classification head for binary breast cancer classification. | |
| Returns: | |
| tuple: (classifier_instance, evaluation_results) or (None, None) if failed | |
| - classifier_instance: Trained BreastCancerClassifier object | |
| - evaluation_results: Dictionary containing performance metrics and predictions | |
| Configuration: | |
| The function uses global variables for configuration (can be modified): | |
| - DATASET_CHOICE: Dataset to use ("breakhis", "pcam", "bach", "combined") | |
| - MAX_SAMPLES: Maximum samples to load (adjust based on available memory) | |
| - EPOCHS: Number of training epochs (default: 50) | |
| - HF_TOKEN: Hugging Face authentication token (optional) | |
| Pipeline Steps: | |
| 1. Prerequisites Check: Validates required packages and dependencies | |
| 2. Authentication: Authenticates with Hugging Face Hub | |
| 3. Model Loading: Downloads and loads Path Foundation model | |
| 4. Data Loading: Loads and preprocesses histopathology dataset | |
| 5. Feature Extraction: Extracts embeddings using frozen foundation model | |
| 6. Classifier Building: Constructs trainable classification head | |
| 7. Training: Trains classifier with callbacks and monitoring | |
| 8. Evaluation: Comprehensive performance assessment | |
| 9. Model Saving: Persists trained model for future use | |
| Error Handling: | |
| The function includes comprehensive error handling with detailed error messages | |
| and stack traces to aid in debugging and troubleshooting. | |
| Example: | |
| >>> # Run the complete pipeline | |
| >>> classifier, results = main() | |
| >>> | |
| >>> if results: | |
| ... print(f"Pipeline successful! Accuracy: {results['accuracy']:.4f}") | |
| ... # Use the trained classifier for inference | |
| ... else: | |
| ... print("Pipeline failed - check error messages") | |
| Note: | |
| This function is designed to be run as a standalone script or imported | |
| and called from other modules. It provides a complete end-to-end | |
| machine learning pipeline for breast cancer classification. | |
| """ | |
| print("="*60) | |
| print("BREAST CANCER CLASSIFICATION WITH PATH FOUNDATION") | |
| print("="*60) | |
| # Validate prerequisites | |
| if not HF_AVAILABLE: | |
| print("ERROR: Prerequisites not met") | |
| print("Required installations: pip install tensorflow huggingface_hub transformers") | |
| return None, None | |
| # Configuration parameters | |
| EPOCHS = 50 | |
| HF_TOKEN = None # Set your Hugging Face token here if needed | |
| # Global configuration (can be modified in notebook) | |
| if 'DATASET_CHOICE' not in globals(): | |
| DATASET_CHOICE = 'combined' # Options: 'breakhis', 'pcam', 'bach', 'combined' | |
| if 'MAX_SAMPLES' not in globals(): | |
| MAX_SAMPLES = 4000 | |
| print(f"Configuration:") | |
| print(f" - Epochs: {EPOCHS}") | |
| print(f" - Dataset: {DATASET_CHOICE}") | |
| print(f" - Max samples: {MAX_SAMPLES}") | |
| print(f" - Method: Feature extraction (frozen foundation model)") | |
| try: | |
| # Initialize classifier in feature extraction mode | |
| classifier = BreastCancerClassifier(fine_tune=False) | |
| print("\n" + "="*40) | |
| print("STEP 1: HUGGING FACE AUTHENTICATION") | |
| print("="*40) | |
| if not classifier.authenticate_huggingface(HF_TOKEN): | |
| raise Exception("Authentication failed - check your HF token") | |
| print("\n" + "="*40) | |
| print("STEP 2: LOADING PATH FOUNDATION MODEL") | |
| print("="*40) | |
| if not classifier.load_path_foundation(): | |
| raise Exception("Model loading failed - check network connection") | |
| print("\n" + "="*40) | |
| print(f"STEP 3: LOADING {DATASET_CHOICE.upper()} DATASET") | |
| print("="*40) | |
| data = load_combined_data(DATASET_CHOICE, MAX_SAMPLES) | |
| X_train, y_train = data['train'] | |
| X_val, y_val = data['valid'] | |
| X_test, y_test = data['test'] | |
| print(f"Dataset splits:") | |
| print(f" - Training: {len(X_train)} samples") | |
| print(f" - Validation: {len(X_val)} samples") | |
| print(f" - Test: {len(X_test)} samples") | |
| print("\n" + "="*40) | |
| print("STEP 4: EXTRACTING FEATURE EMBEDDINGS") | |
| print("="*40) | |
| print("Extracting training embeddings...") | |
| X_train = classifier.extract_embeddings(X_train) | |
| print("Extracting validation embeddings...") | |
| X_val = classifier.extract_embeddings(X_val) | |
| print("Extracting test embeddings...") | |
| X_test = classifier.extract_embeddings(X_test) | |
| print("\n" + "="*40) | |
| print("STEP 5: BUILDING CLASSIFICATION HEAD") | |
| print("="*40) | |
| classifier.num_classes = 2 | |
| classifier.build_classifier() | |
| print("\n" + "="*40) | |
| print("STEP 6: TRAINING CLASSIFIER") | |
| print("="*40) | |
| classifier.train_model(X_train, y_train, X_val, y_val, EPOCHS) | |
| print("\n" + "="*40) | |
| print("STEP 7: MODEL EVALUATION") | |
| print("="*40) | |
| results = classifier.evaluate_model(X_test, y_test) | |
| # Save trained model | |
| model_name = f"{DATASET_CHOICE}_breast_cancer_classifier.keras" | |
| classifier.model.save(model_name) | |
| print(f"\nModel saved as: {model_name}") | |
| print("\n" + "="*60) | |
| print("PIPELINE COMPLETED SUCCESSFULLY") | |
| print("="*60) | |
| print(f"Final Performance Metrics:") | |
| print(f" - Accuracy: {results['accuracy']:.4f} ({results['accuracy']*100:.2f}%)") | |
| print(f" - F1-Score: {results['f1']:.4f}") | |
| print(f" - Precision: {results['precision']:.4f}") | |
| print(f" - Recall: {results['recall']:.4f}") | |
| return classifier, results | |
| except Exception as e: | |
| print(f"\nERROR: Pipeline failed - {e}") | |
| import traceback | |
| traceback.print_exc() | |
| return None, None | |
| # Script execution section | |
| if __name__ == "__main__": | |
| """ | |
| Main execution block for running the breast cancer classification pipeline. | |
| This section is executed when the script is run directly (not imported). | |
| It provides a simple interface to run the complete machine learning pipeline | |
| and displays the final results. | |
| Usage: | |
| python model2.py | |
| The script will: | |
| 1. Initialize and run the complete pipeline | |
| 2. Display progress and intermediate results | |
| 3. Show final performance metrics | |
| 4. Save the trained model for future use | |
| """ | |
| print("Starting Breast Cancer Classification Pipeline...") | |
| print("This may take several minutes depending on your hardware and dataset size.") | |
| print("="*60) | |
| # Execute the complete pipeline | |
| classifier, results = main() | |
| # Display final results | |
| if results: | |
| print("\n" + "="*60) | |
| print("π PIPELINE EXECUTION SUCCESSFUL! π") | |
| print("="*60) | |
| print(f"Final Accuracy: {results['accuracy']:.4f} ({results['accuracy']*100:.2f}%)") | |
| print(f"F1-Score: {results['f1']:.4f}") | |
| print(f"Precision: {results['precision']:.4f}") | |
| print(f"Recall: {results['recall']:.4f}") | |
| print("\nThe trained model has been saved and is ready for inference!") | |
| print("You can now use the classifier for breast cancer classification tasks.") | |
| else: | |
| print("\n" + "="*60) | |
| print("β PIPELINE EXECUTION FAILED β") | |
| print("="*60) | |
| print("Please check the error messages above for troubleshooting.") | |
| print("Common issues:") | |
| print("- Missing dependencies (install with: pip install tensorflow huggingface_hub transformers)") | |
| print("- Network connectivity issues (for downloading Path Foundation model)") | |
| print("- Insufficient memory (reduce MAX_SAMPLES parameter)") | |
| print("- Invalid dataset paths (check dataset directory structure)") |