Histolab

LC25000 Histopathology Classification

A custom CNN architecture optimized for histopathological image classification without using pretrained weights. This implementation provides three model variants with increasing complexity and performance, specifically designed for the LC25000 dataset.

Overview

This project implements custom convolutional neural networks for classifying histopathological images into five distinct categories. The models are trained from scratch without transfer learning, demonstrating the effectiveness of carefully designed architectures for medical image analysis.

Dataset

LC25000 Lung and Colon Histopathological Image Dataset

Total Images: 25,000
Image Size: 768 x 768 pixels (resized to 224 x 224)
Number of Classes: 5
Format: RGB histopathological images

Classes

Colon Adenocarcinoma
Colon Benign Tissue
Lung Adenocarcinoma
Lung Benign Tissue
Lung Squamous Cell Carcinoma

Model Architectures

Three distinct architectures are provided, each with different complexity-performance tradeoffs:

Version 1: Simple CNN

Architecture: Classic VGG-style sequential convolutions
Training Time: Fast
Expected Accuracy: 90-93%
Parameters: ~15M
Best For: Baseline experiments, quick iterations

Version 2: Residual Network

Architecture: ResNet-inspired with residual connections
Training Time: Moderate
Expected Accuracy: 92-95%
Parameters: ~20M
Best For: Balanced performance and training efficiency

Version 3: Attention Network (Recommended)

Architecture: Advanced design with Squeeze-and-Excitation blocks
Training Time: Longer
Expected Accuracy: 94-97%
Parameters: ~25M
Best For: Maximum performance

Key Features

Data Augmentation: Comprehensive augmentation pipeline including random flips, rotations, zoom, translation, contrast, and brightness adjustments
Regularization: L2 weight decay, dropout, and label smoothing
Optimization: AdamW optimizer with learning rate scheduling
Callbacks: Early stopping, learning rate reduction on plateau, model checkpointing
Metrics: Accuracy, AUC, Precision, Recall
Visualization: Training history plots and confusion matrices

Installation

pip install tensorflow>=2.13.0
pip install numpy
pip install matplotlib
pip install seaborn
pip install scikit-learn

Usage

Basic Training

from lc25000_classifier import main

# Update paths in main() function
train_directory = 'path/to/train'
test_directory = 'path/to/test'

# Run training
main()

Model Selection

Choose your desired architecture by uncommenting the appropriate line:

# Simple CNN (faster training)
model = build_model_v1_simple()

# Residual Network (balanced)
model = build_model_v2_residual()

# Attention Network (best performance)
model = build_model_v3_attention()

Custom Training

from lc25000_classifier import load_datasets, build_model_v3_attention, compile_model, get_callbacks

# Load data
train_ds, val_ds, test_ds = load_datasets(train_dir, test_dir)

# Build and compile model
model = build_model_v3_attention()
model = compile_model(model)

# Train
history = model.fit(
    train_ds,
    epochs=150,
    validation_data=val_ds,
    callbacks=get_callbacks()
)

Configuration

Key hyperparameters can be adjusted in the CONFIG dictionary:

CONFIG = {
    'image_size': (224, 224),
    'batch_size': 32,
    'epochs': 150,
    'initial_lr': 0.001,
    'weight_decay': 1e-4,
    'dropout_rate': 0.4,
    'num_classes': 5,
    'seed': 42
}

Training Details

Optimization Strategy

Optimizer: AdamW with weight decay
Initial Learning Rate: 0.001
Learning Rate Schedule: ReduceLROnPlateau (factor=0.5, patience=7)
Loss Function: Categorical Crossentropy with label smoothing (0.1)

Regularization Techniques

L2 weight regularization (1e-4)
Dropout (0.4 in classifier, 0.2-0.3 in feature extractor)
Batch normalization after each convolution
Label smoothing

Training Strategy

Early stopping (patience=15)
Model checkpointing (saves best model based on validation accuracy)
TensorBoard logging for monitoring

Performance Metrics

The model is evaluated using multiple metrics:

Accuracy: Overall classification accuracy
AUC: Area under the ROC curve (multi-label)
Precision: Positive predictive value
Recall: Sensitivity
Confusion Matrix: Detailed per-class performance

Model Output

Training produces the following artifacts:

lc25000_scratch_best.keras: Best model checkpoint
lc25000_scratch_final.keras: Final trained model
lc25000_scratch_weights.weights.h5: Model weights only
training_history.png: Visualization of training metrics
confusion_matrix.png: Confusion matrix heatmap
./logs/: TensorBoard logs

Inference Example

import tensorflow as tf
import numpy as np

# Load model
model = tf.keras.models.load_model('lc25000_scratch_final.keras')

# Load and preprocess image
image = tf.keras.preprocessing.image.load_img('path/to/image.png', target_size=(224, 224))
image_array = tf.keras.preprocessing.image.img_to_array(image)
image_array = np.expand_dims(image_array, axis=0)

# Predict
predictions = model.predict(image_array)
predicted_class = CLASS_NAMES[np.argmax(predictions)]

print(f"Predicted class: {predicted_class}")
print(f"Confidence: {np.max(predictions):.2%}")

Requirements

Python 3.8+
TensorFlow 2.13+
NumPy
Matplotlib
Seaborn
scikit-learn

Hardware Recommendations

Minimum: 8GB RAM, CPU training (slow)
Recommended: 16GB RAM, NVIDIA GPU with 8GB+ VRAM
Optimal: 32GB RAM, NVIDIA GPU with 16GB+ VRAM

Training time varies by architecture and hardware:

Simple CNN: ~2-4 hours (GPU)
Residual Network: ~4-6 hours (GPU)
Attention Network: ~6-10 hours (GPU)

Citation

If you use this implementation, please cite the LC25000 dataset:

@article{,
title= {LC25000 Lung and colon histopathological image dataset},
keywords= {cancer,histopathology},
author= {Andrew A. Borkowski, Marilyn M. Bui, L. Brannon Thomas, Catherine P. Wilson, Lauren A. DeLand, Stephen M. Mastorides},
url= {https://github.com/tampapath/lung_colon_image_set}
}

License

This implementation is provided for research and educational purposes. Please refer to the LC25000 dataset license for data usage terms.

Acknowledgments

LC25000 dataset creators for providing high-quality histopathological images
TensorFlow team for the deep learning framework
Medical imaging community for advancing computational pathology

Contact

For questions, issues, or contributions, please open an issue in the repository.

Note: This model is intended for research purposes only and should not be used for clinical diagnosis without proper validation and regulatory approval.

Downloads last month: 2