Histolab

LC25000 Histopathology Classification

A custom CNN architecture optimized for histopathological image classification without using pretrained weights. This implementation provides three model variants with increasing complexity and performance, specifically designed for the LC25000 dataset.

Overview

This project implements custom convolutional neural networks for classifying histopathological images into five distinct categories. The models are trained from scratch without transfer learning, demonstrating the effectiveness of carefully designed architectures for medical image analysis.

Dataset

LC25000 Lung and Colon Histopathological Image Dataset

  • Total Images: 25,000
  • Image Size: 768 x 768 pixels (resized to 224 x 224)
  • Number of Classes: 5
  • Format: RGB histopathological images

Classes

  1. Colon Adenocarcinoma
  2. Colon Benign Tissue
  3. Lung Adenocarcinoma
  4. Lung Benign Tissue
  5. Lung Squamous Cell Carcinoma

Model Architectures

Three distinct architectures are provided, each with different complexity-performance tradeoffs:

Version 1: Simple CNN

  • Architecture: Classic VGG-style sequential convolutions
  • Training Time: Fast
  • Expected Accuracy: 90-93%
  • Parameters: ~15M
  • Best For: Baseline experiments, quick iterations

Version 2: Residual Network

  • Architecture: ResNet-inspired with residual connections
  • Training Time: Moderate
  • Expected Accuracy: 92-95%
  • Parameters: ~20M
  • Best For: Balanced performance and training efficiency

Version 3: Attention Network (Recommended)

  • Architecture: Advanced design with Squeeze-and-Excitation blocks
  • Training Time: Longer
  • Expected Accuracy: 94-97%
  • Parameters: ~25M
  • Best For: Maximum performance

Key Features

  • Data Augmentation: Comprehensive augmentation pipeline including random flips, rotations, zoom, translation, contrast, and brightness adjustments
  • Regularization: L2 weight decay, dropout, and label smoothing
  • Optimization: AdamW optimizer with learning rate scheduling
  • Callbacks: Early stopping, learning rate reduction on plateau, model checkpointing
  • Metrics: Accuracy, AUC, Precision, Recall
  • Visualization: Training history plots and confusion matrices

Installation

pip install tensorflow>=2.13.0
pip install numpy
pip install matplotlib
pip install seaborn
pip install scikit-learn

Usage

Basic Training

from lc25000_classifier import main

# Update paths in main() function
train_directory = 'path/to/train'
test_directory = 'path/to/test'

# Run training
main()

Model Selection

Choose your desired architecture by uncommenting the appropriate line:

# Simple CNN (faster training)
model = build_model_v1_simple()

# Residual Network (balanced)
model = build_model_v2_residual()

# Attention Network (best performance)
model = build_model_v3_attention()

Custom Training

from lc25000_classifier import load_datasets, build_model_v3_attention, compile_model, get_callbacks

# Load data
train_ds, val_ds, test_ds = load_datasets(train_dir, test_dir)

# Build and compile model
model = build_model_v3_attention()
model = compile_model(model)

# Train
history = model.fit(
    train_ds,
    epochs=150,
    validation_data=val_ds,
    callbacks=get_callbacks()
)

Configuration

Key hyperparameters can be adjusted in the CONFIG dictionary:

CONFIG = {
    'image_size': (224, 224),
    'batch_size': 32,
    'epochs': 150,
    'initial_lr': 0.001,
    'weight_decay': 1e-4,
    'dropout_rate': 0.4,
    'num_classes': 5,
    'seed': 42
}

Training Details

Optimization Strategy

  • Optimizer: AdamW with weight decay
  • Initial Learning Rate: 0.001
  • Learning Rate Schedule: ReduceLROnPlateau (factor=0.5, patience=7)
  • Loss Function: Categorical Crossentropy with label smoothing (0.1)

Regularization Techniques

  • L2 weight regularization (1e-4)
  • Dropout (0.4 in classifier, 0.2-0.3 in feature extractor)
  • Batch normalization after each convolution
  • Label smoothing

Training Strategy

  • Early stopping (patience=15)
  • Model checkpointing (saves best model based on validation accuracy)
  • TensorBoard logging for monitoring

Performance Metrics

The model is evaluated using multiple metrics:

  • Accuracy: Overall classification accuracy
  • AUC: Area under the ROC curve (multi-label)
  • Precision: Positive predictive value
  • Recall: Sensitivity
  • Confusion Matrix: Detailed per-class performance

Model Output

Training produces the following artifacts:

  • lc25000_scratch_best.keras: Best model checkpoint
  • lc25000_scratch_final.keras: Final trained model
  • lc25000_scratch_weights.weights.h5: Model weights only
  • training_history.png: Visualization of training metrics
  • confusion_matrix.png: Confusion matrix heatmap
  • ./logs/: TensorBoard logs

Inference Example

import tensorflow as tf
import numpy as np

# Load model
model = tf.keras.models.load_model('lc25000_scratch_final.keras')

# Load and preprocess image
image = tf.keras.preprocessing.image.load_img('path/to/image.png', target_size=(224, 224))
image_array = tf.keras.preprocessing.image.img_to_array(image)
image_array = np.expand_dims(image_array, axis=0)

# Predict
predictions = model.predict(image_array)
predicted_class = CLASS_NAMES[np.argmax(predictions)]

print(f"Predicted class: {predicted_class}")
print(f"Confidence: {np.max(predictions):.2%}")

Requirements

  • Python 3.8+
  • TensorFlow 2.13+
  • NumPy
  • Matplotlib
  • Seaborn
  • scikit-learn

Hardware Recommendations

  • Minimum: 8GB RAM, CPU training (slow)
  • Recommended: 16GB RAM, NVIDIA GPU with 8GB+ VRAM
  • Optimal: 32GB RAM, NVIDIA GPU with 16GB+ VRAM

Training time varies by architecture and hardware:

  • Simple CNN: ~2-4 hours (GPU)
  • Residual Network: ~4-6 hours (GPU)
  • Attention Network: ~6-10 hours (GPU)

Citation

If you use this implementation, please cite the LC25000 dataset:

@article{,
title= {LC25000 Lung and colon histopathological image dataset},
keywords= {cancer,histopathology},
author= {Andrew A. Borkowski, Marilyn M. Bui, L. Brannon Thomas, Catherine P. Wilson, Lauren A. DeLand, Stephen M. Mastorides},
url= {https://github.com/tampapath/lung_colon_image_set}
}

License

This implementation is provided for research and educational purposes. Please refer to the LC25000 dataset license for data usage terms.

Acknowledgments

  • LC25000 dataset creators for providing high-quality histopathological images
  • TensorFlow team for the deep learning framework
  • Medical imaging community for advancing computational pathology

Contact

For questions, issues, or contributions, please open an issue in the repository.


Note: This model is intended for research purposes only and should not be used for clinical diagnosis without proper validation and regulatory approval.

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support