jeremysean
/

histolab

 ---
 license: apache-2.0
+language:
+- en
+metrics:
+- accuracy
+- f1
+- precision
+pipeline_tag: image-classification
+tags:
+- histopathology
+- lung
+- colon
+- cancer
 ---
+# Histolab
+## LC25000 Histopathology Classification
+A custom CNN architecture optimized for histopathological image classification without using pretrained weights. This implementation provides three model variants with increasing complexity and performance, specifically designed for the LC25000 dataset.
+## Overview
+This project implements custom convolutional neural networks for classifying histopathological images into five distinct categories. The models are trained from scratch without transfer learning, demonstrating the effectiveness of carefully designed architectures for medical image analysis.
+## Dataset
+**LC25000 Lung and Colon Histopathological Image Dataset**
+- Total Images: 25,000
+- Image Size: 768 x 768 pixels (resized to 224 x 224)
+- Number of Classes: 5
+- Format: RGB histopathological images
+### Classes
+1. Colon Adenocarcinoma
+2. Colon Benign Tissue
+3. Lung Adenocarcinoma
+4. Lung Benign Tissue
+5. Lung Squamous Cell Carcinoma
+## Model Architectures
+Three distinct architectures are provided, each with different complexity-performance tradeoffs:
+### Version 1: Simple CNN
+- **Architecture**: Classic VGG-style sequential convolutions
+- **Training Time**: Fast
+- **Expected Accuracy**: 90-93%
+- **Parameters**: ~15M
+- **Best For**: Baseline experiments, quick iterations
+### Version 2: Residual Network
+- **Architecture**: ResNet-inspired with residual connections
+- **Training Time**: Moderate
+- **Expected Accuracy**: 92-95%
+- **Parameters**: ~20M
+- **Best For**: Balanced performance and training efficiency
+### Version 3: Attention Network (Recommended)
+- **Architecture**: Advanced design with Squeeze-and-Excitation blocks
+- **Training Time**: Longer
+- **Expected Accuracy**: 94-97%
+- **Parameters**: ~25M
+- **Best For**: Maximum performance
+## Key Features
+- **Data Augmentation**: Comprehensive augmentation pipeline including random flips, rotations, zoom, translation, contrast, and brightness adjustments
+- **Regularization**: L2 weight decay, dropout, and label smoothing
+- **Optimization**: AdamW optimizer with learning rate scheduling
+- **Callbacks**: Early stopping, learning rate reduction on plateau, model checkpointing
+- **Metrics**: Accuracy, AUC, Precision, Recall
+- **Visualization**: Training history plots and confusion matrices
+## Installation
+```bash
+pip install tensorflow>=2.13.0
+pip install numpy
+pip install matplotlib
+pip install seaborn
+pip install scikit-learn
+```
+## Usage
+### Basic Training
+```python
+from lc25000_classifier import main
+# Update paths in main() function
+train_directory = 'path/to/train'
+test_directory = 'path/to/test'
+# Run training
+main()
+```
+### Model Selection
+Choose your desired architecture by uncommenting the appropriate line:
+```python
+# Simple CNN (faster training)
+model = build_model_v1_simple()
+# Residual Network (balanced)
+model = build_model_v2_residual()
+# Attention Network (best performance)
+model = build_model_v3_attention()
+```
+### Custom Training
+```python
+from lc25000_classifier import load_datasets, build_model_v3_attention, compile_model, get_callbacks
+# Load data
+train_ds, val_ds, test_ds = load_datasets(train_dir, test_dir)
+# Build and compile model
+model = build_model_v3_attention()
+model = compile_model(model)
+# Train
+history = model.fit(
+    train_ds,
+    epochs=150,
+    validation_data=val_ds,
+    callbacks=get_callbacks()
+)
+```
+## Configuration
+Key hyperparameters can be adjusted in the CONFIG dictionary:
+```python
+CONFIG = {
+    'image_size': (224, 224),
+    'batch_size': 32,
+    'epochs': 150,
+    'initial_lr': 0.001,
+    'weight_decay': 1e-4,
+    'dropout_rate': 0.4,
+    'num_classes': 5,
+    'seed': 42
+}
+```
+## Training Details
+### Optimization Strategy
+- **Optimizer**: AdamW with weight decay
+- **Initial Learning Rate**: 0.001
+- **Learning Rate Schedule**: ReduceLROnPlateau (factor=0.5, patience=7)
+- **Loss Function**: Categorical Crossentropy with label smoothing (0.1)
+### Regularization Techniques
+- L2 weight regularization (1e-4)
+- Dropout (0.4 in classifier, 0.2-0.3 in feature extractor)
+- Batch normalization after each convolution
+- Label smoothing
+### Training Strategy
+- Early stopping (patience=15)
+- Model checkpointing (saves best model based on validation accuracy)
+- TensorBoard logging for monitoring
+## Performance Metrics
+The model is evaluated using multiple metrics:
+- **Accuracy**: Overall classification accuracy
+- **AUC**: Area under the ROC curve (multi-label)
+- **Precision**: Positive predictive value
+- **Recall**: Sensitivity
+- **Confusion Matrix**: Detailed per-class performance
+## Model Output
+Training produces the following artifacts:
+- `lc25000_scratch_best.keras`: Best model checkpoint
+- `lc25000_scratch_final.keras`: Final trained model
+- `lc25000_scratch_weights.weights.h5`: Model weights only
+- `training_history.png`: Visualization of training metrics
+- `confusion_matrix.png`: Confusion matrix heatmap
+- `./logs/`: TensorBoard logs
+## Inference Example
+```python
+import tensorflow as tf
+import numpy as np
+# Load model
+model = tf.keras.models.load_model('lc25000_scratch_final.keras')
+# Load and preprocess image
+image = tf.keras.preprocessing.image.load_img('path/to/image.png', target_size=(224, 224))
+image_array = tf.keras.preprocessing.image.img_to_array(image)
+image_array = np.expand_dims(image_array, axis=0)
+# Predict
+predictions = model.predict(image_array)
+predicted_class = CLASS_NAMES[np.argmax(predictions)]
+print(f"Predicted class: {predicted_class}")
+print(f"Confidence: {np.max(predictions):.2%}")
+```
+## Requirements
+- Python 3.8+
+- TensorFlow 2.13+
+- NumPy
+- Matplotlib
+- Seaborn
+- scikit-learn
+## Hardware Recommendations
+- **Minimum**: 8GB RAM, CPU training (slow)
+- **Recommended**: 16GB RAM, NVIDIA GPU with 8GB+ VRAM
+- **Optimal**: 32GB RAM, NVIDIA GPU with 16GB+ VRAM
+Training time varies by architecture and hardware:
+- Simple CNN: ~2-4 hours (GPU)
+- Residual Network: ~4-6 hours (GPU)
+- Attention Network: ~6-10 hours (GPU)
+## Citation
+If you use this implementation, please cite the LC25000 dataset:
+```
+@article{,
+title= {LC25000 Lung and colon histopathological image dataset},
+keywords= {cancer,histopathology},
+author= {Andrew A. Borkowski, Marilyn M. Bui, L. Brannon Thomas, Catherine P. Wilson, Lauren A. DeLand, Stephen M. Mastorides},
+url= {https://github.com/tampapath/lung_colon_image_set}
+}
+```
+## License
+This implementation is provided for research and educational purposes. Please refer to the LC25000 dataset license for data usage terms.
+## Acknowledgments
+- LC25000 dataset creators for providing high-quality histopathological images
+- TensorFlow team for the deep learning framework
+- Medical imaging community for advancing computational pathology
+## Contact
+For questions, issues, or contributions, please open an issue in the repository.
+---
+**Note**: This model is intended for research purposes only and should not be used for clinical diagnosis without proper validation and regulatory approval.