Histolab
LC25000 Histopathology Classification
A custom CNN architecture optimized for histopathological image classification without using pretrained weights. This implementation provides three model variants with increasing complexity and performance, specifically designed for the LC25000 dataset.
Overview
This project implements custom convolutional neural networks for classifying histopathological images into five distinct categories. The models are trained from scratch without transfer learning, demonstrating the effectiveness of carefully designed architectures for medical image analysis.
Dataset
LC25000 Lung and Colon Histopathological Image Dataset
- Total Images: 25,000
- Image Size: 768 x 768 pixels (resized to 224 x 224)
- Number of Classes: 5
- Format: RGB histopathological images
Classes
- Colon Adenocarcinoma
- Colon Benign Tissue
- Lung Adenocarcinoma
- Lung Benign Tissue
- Lung Squamous Cell Carcinoma
Model Architectures
Three distinct architectures are provided, each with different complexity-performance tradeoffs:
Version 1: Simple CNN
- Architecture: Classic VGG-style sequential convolutions
- Training Time: Fast
- Expected Accuracy: 90-93%
- Parameters: ~15M
- Best For: Baseline experiments, quick iterations
Version 2: Residual Network
- Architecture: ResNet-inspired with residual connections
- Training Time: Moderate
- Expected Accuracy: 92-95%
- Parameters: ~20M
- Best For: Balanced performance and training efficiency
Version 3: Attention Network (Recommended)
- Architecture: Advanced design with Squeeze-and-Excitation blocks
- Training Time: Longer
- Expected Accuracy: 94-97%
- Parameters: ~25M
- Best For: Maximum performance
Key Features
- Data Augmentation: Comprehensive augmentation pipeline including random flips, rotations, zoom, translation, contrast, and brightness adjustments
- Regularization: L2 weight decay, dropout, and label smoothing
- Optimization: AdamW optimizer with learning rate scheduling
- Callbacks: Early stopping, learning rate reduction on plateau, model checkpointing
- Metrics: Accuracy, AUC, Precision, Recall
- Visualization: Training history plots and confusion matrices
Installation
pip install tensorflow>=2.13.0
pip install numpy
pip install matplotlib
pip install seaborn
pip install scikit-learn
Usage
Basic Training
from lc25000_classifier import main
# Update paths in main() function
train_directory = 'path/to/train'
test_directory = 'path/to/test'
# Run training
main()
Model Selection
Choose your desired architecture by uncommenting the appropriate line:
# Simple CNN (faster training)
model = build_model_v1_simple()
# Residual Network (balanced)
model = build_model_v2_residual()
# Attention Network (best performance)
model = build_model_v3_attention()
Custom Training
from lc25000_classifier import load_datasets, build_model_v3_attention, compile_model, get_callbacks
# Load data
train_ds, val_ds, test_ds = load_datasets(train_dir, test_dir)
# Build and compile model
model = build_model_v3_attention()
model = compile_model(model)
# Train
history = model.fit(
train_ds,
epochs=150,
validation_data=val_ds,
callbacks=get_callbacks()
)
Configuration
Key hyperparameters can be adjusted in the CONFIG dictionary:
CONFIG = {
'image_size': (224, 224),
'batch_size': 32,
'epochs': 150,
'initial_lr': 0.001,
'weight_decay': 1e-4,
'dropout_rate': 0.4,
'num_classes': 5,
'seed': 42
}
Training Details
Optimization Strategy
- Optimizer: AdamW with weight decay
- Initial Learning Rate: 0.001
- Learning Rate Schedule: ReduceLROnPlateau (factor=0.5, patience=7)
- Loss Function: Categorical Crossentropy with label smoothing (0.1)
Regularization Techniques
- L2 weight regularization (1e-4)
- Dropout (0.4 in classifier, 0.2-0.3 in feature extractor)
- Batch normalization after each convolution
- Label smoothing
Training Strategy
- Early stopping (patience=15)
- Model checkpointing (saves best model based on validation accuracy)
- TensorBoard logging for monitoring
Performance Metrics
The model is evaluated using multiple metrics:
- Accuracy: Overall classification accuracy
- AUC: Area under the ROC curve (multi-label)
- Precision: Positive predictive value
- Recall: Sensitivity
- Confusion Matrix: Detailed per-class performance
Model Output
Training produces the following artifacts:
lc25000_scratch_best.keras: Best model checkpointlc25000_scratch_final.keras: Final trained modellc25000_scratch_weights.weights.h5: Model weights onlytraining_history.png: Visualization of training metricsconfusion_matrix.png: Confusion matrix heatmap./logs/: TensorBoard logs
Inference Example
import tensorflow as tf
import numpy as np
# Load model
model = tf.keras.models.load_model('lc25000_scratch_final.keras')
# Load and preprocess image
image = tf.keras.preprocessing.image.load_img('path/to/image.png', target_size=(224, 224))
image_array = tf.keras.preprocessing.image.img_to_array(image)
image_array = np.expand_dims(image_array, axis=0)
# Predict
predictions = model.predict(image_array)
predicted_class = CLASS_NAMES[np.argmax(predictions)]
print(f"Predicted class: {predicted_class}")
print(f"Confidence: {np.max(predictions):.2%}")
Requirements
- Python 3.8+
- TensorFlow 2.13+
- NumPy
- Matplotlib
- Seaborn
- scikit-learn
Hardware Recommendations
- Minimum: 8GB RAM, CPU training (slow)
- Recommended: 16GB RAM, NVIDIA GPU with 8GB+ VRAM
- Optimal: 32GB RAM, NVIDIA GPU with 16GB+ VRAM
Training time varies by architecture and hardware:
- Simple CNN: ~2-4 hours (GPU)
- Residual Network: ~4-6 hours (GPU)
- Attention Network: ~6-10 hours (GPU)
Citation
If you use this implementation, please cite the LC25000 dataset:
@article{,
title= {LC25000 Lung and colon histopathological image dataset},
keywords= {cancer,histopathology},
author= {Andrew A. Borkowski, Marilyn M. Bui, L. Brannon Thomas, Catherine P. Wilson, Lauren A. DeLand, Stephen M. Mastorides},
url= {https://github.com/tampapath/lung_colon_image_set}
}
License
This implementation is provided for research and educational purposes. Please refer to the LC25000 dataset license for data usage terms.
Acknowledgments
- LC25000 dataset creators for providing high-quality histopathological images
- TensorFlow team for the deep learning framework
- Medical imaging community for advancing computational pathology
Contact
For questions, issues, or contributions, please open an issue in the repository.
Note: This model is intended for research purposes only and should not be used for clinical diagnosis without proper validation and regulatory approval.
- Downloads last month
- 8