|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- accuracy |
|
|
pipeline_tag: image-classification |
|
|
library_name: transformers |
|
|
tags: |
|
|
- biology |
|
|
- med |
|
|
- chemistry |
|
|
- code |
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
# Multi-Cancer Lymphoma Classification with Convolutional Neural Networks (CNN) |
|
|
|
|
|
## π Overview |
|
|
|
|
|
This repository contains an end-to-end deep learning pipeline developed in **Python** using **TensorFlow** and **Keras** for the automated classification of lymphoma subtypes within a multi-cancer dataset. The project leverages **Convolutional Neural Networks (CNNs)** to perform supervised image classification on histopathological cancer images, aiming to provide a robust and scalable solution for medical imaging analysis. |
|
|
|
|
|
The pipeline encompasses: |
|
|
|
|
|
* Data ingestion and preprocessing with **ImageDataGenerator** |
|
|
* Training/validation split and augmentation |
|
|
* Definition and compilation of a deep CNN architecture |
|
|
* Training with real-time performance evaluation |
|
|
* Model persistence (`.h5` file format) for later inference |
|
|
* Custom prediction utility with visualization |
|
|
|
|
|
This repository is intended for **medical AI researchers**, **machine learning engineers**, and **healthcare data scientists** who seek to apply convolutional neural networks for diagnostic support in oncology. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Dataset Information |
|
|
|
|
|
The dataset used in this project is located at: |
|
|
|
|
|
``` |
|
|
/kaggle/input/multi-cancer/Multi Cancer/Multi Cancer/Lymphoma |
|
|
``` |
|
|
|
|
|
This directory contains subfolders representing different classes of lymphoma and potentially other cancer subtypes. The **directory structure** is expected to be of the form: |
|
|
|
|
|
``` |
|
|
Lymphoma/ |
|
|
βββ Class_A/ |
|
|
β βββ img_1.jpg |
|
|
β βββ img_2.jpg |
|
|
β βββ ... |
|
|
βββ Class_B/ |
|
|
β βββ img_3.jpg |
|
|
β βββ ... |
|
|
βββ Class_C/ |
|
|
βββ img_4.jpg |
|
|
βββ ... |
|
|
``` |
|
|
|
|
|
* Each subfolder corresponds to one diagnostic class. |
|
|
* The model automatically infers class labels from these subdirectories. |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Dependencies |
|
|
|
|
|
This project requires the following core dependencies: |
|
|
|
|
|
* **Python 3.8+** |
|
|
* **TensorFlow 2.x** |
|
|
* **Keras (integrated with TensorFlow)** |
|
|
* **NumPy** |
|
|
* **Matplotlib** |
|
|
|
|
|
To install dependencies: |
|
|
|
|
|
```bash |
|
|
pip install tensorflow numpy matplotlib |
|
|
``` |
|
|
|
|
|
If running on Kaggle or Google Colab, these libraries are already pre-installed. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§© Code Structure |
|
|
|
|
|
The main script (`train.py` or notebook cell) is divided into logical sections: |
|
|
|
|
|
1. **Imports** |
|
|
|
|
|
* Standard libraries (`os`, `numpy`) |
|
|
* Scientific libraries (`matplotlib`) |
|
|
* Deep learning libraries (`tensorflow`, `keras`, `layers`) |
|
|
|
|
|
2. **Data Pipeline** |
|
|
|
|
|
* Data preprocessing with `ImageDataGenerator` |
|
|
* Automatic normalization of pixel intensities (`rescale=1./255`) |
|
|
* Splitting into training (90%) and validation (10%) |
|
|
|
|
|
3. **Model Architecture** |
|
|
|
|
|
* A sequential CNN architecture with the following layers: |
|
|
|
|
|
* `Conv2D` (32 filters, 3Γ3 kernel, ReLU) |
|
|
* `MaxPooling2D` (2Γ2) |
|
|
* `Conv2D` (64 filters, ReLU) |
|
|
* `MaxPooling2D` (2Γ2) |
|
|
* `Conv2D` (128 filters, ReLU) |
|
|
* `MaxPooling2D` (2Γ2) |
|
|
* `Flatten` |
|
|
* `Dense` (512 units, ReLU) |
|
|
* `Dense` (softmax output for multi-class classification) |
|
|
|
|
|
4. **Compilation** |
|
|
|
|
|
* Optimizer: **Adam** |
|
|
* Loss Function: **Categorical Crossentropy** |
|
|
* Metrics: **Accuracy** |
|
|
|
|
|
5. **Training** |
|
|
|
|
|
* Training via `model.fit()` |
|
|
* `epochs=10` |
|
|
* Validation data monitoring |
|
|
|
|
|
6. **Model Persistence** |
|
|
|
|
|
* Final trained model is saved as `model5.h5` |
|
|
|
|
|
7. **Prediction Utility** (`guess()` function) |
|
|
|
|
|
* Takes an input image path |
|
|
* Resizes and normalizes the image |
|
|
* Performs forward propagation using the trained model |
|
|
* Outputs the predicted class with corresponding visualization |
|
|
|
|
|
--- |
|
|
|
|
|
## π¬ Methodology |
|
|
|
|
|
The approach relies on **supervised learning** using CNNs for image recognition. |
|
|
|
|
|
* **Feature Extraction:** Convolutional and pooling layers learn hierarchical spatial representations of cancerous tissue patterns. |
|
|
* **Classification:** Dense layers map these features into probabilistic class predictions. |
|
|
* **Normalization:** All images are rescaled to `[0,1]` for stable gradient descent. |
|
|
* **Generalization:** Validation set (10%) monitors overfitting and ensures out-of-sample reliability. |
|
|
|
|
|
This is a **baseline model**, and can be extended with: |
|
|
|
|
|
* **Data Augmentation** (rotation, zoom, shear, flips) |
|
|
* **Transfer Learning** (e.g., VGG16, ResNet50, EfficientNet) |
|
|
* **Regularization** (Dropout, L2 penalty) |
|
|
* **Hyperparameter Optimization** (learning rate, batch size tuning) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Training Performance |
|
|
|
|
|
* **Epochs:** 10 |
|
|
* **Batch Size:** 32 |
|
|
* **Image Size:** 150Γ150 (RGB channels) |
|
|
* **Optimizer:** Adam (adaptive learning rate) |
|
|
* **Loss Function:** Categorical Crossentropy |
|
|
* **Evaluation Metric:** Accuracy |
|
|
|
|
|
Performance metrics will be printed during runtime and can be plotted for visualization. Example outputs include training/validation accuracy and loss curves. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ͺ Inference Example |
|
|
|
|
|
Using the custom `guess()` function: |
|
|
|
|
|
```python |
|
|
from tensorflow.keras.models import load_model |
|
|
|
|
|
# Load model |
|
|
model = load_model("model5.h5") |
|
|
|
|
|
# Predict on new image |
|
|
guess("example_image.jpg", model, train_generator.class_indices) |
|
|
``` |
|
|
|
|
|
Expected Output: |
|
|
|
|
|
* The image is displayed. |
|
|
* The title above the image indicates the **predicted lymphoma subtype**. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Applications |
|
|
|
|
|
* **Medical Decision Support:** Assisting oncologists in rapid and preliminary diagnosis of lymphoma subtypes. |
|
|
* **Research:** Benchmarking CNN performance on histopathological datasets. |
|
|
* **Education:** Teaching medical students and engineers about AI applications in pathology. |
|
|
|
|
|
β οΈ **Disclaimer:** This model is for **research and educational purposes only**. It is **not a substitute for professional medical diagnosis**. Clinical deployment requires extensive validation, regulatory approval, and rigorous testing. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Future Improvements |
|
|
|
|
|
1. Integrating **transfer learning** for improved accuracy. |
|
|
2. Expanding dataset size and diversity. |
|
|
3. Hyperparameter optimization with automated search tools. |
|
|
4. Deploying as a web application (e.g., Flask, FastAPI, Streamlit). |
|
|
5. Exporting to **TensorFlow Lite** or **ONNX** for mobile/edge deployment. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Conclusion |
|
|
|
|
|
This project demonstrates the development of a robust, reproducible, and interpretable CNN-based classification model for multi-cancer (lymphoma) image analysis. It provides a **solid foundation** for further advancements in AI-driven oncology research. |
|
|
|
|
|
By following the modular design of this repository, researchers can: |
|
|
|
|
|
* Reproduce experiments |
|
|
* Extend the architecture |
|
|
* Adapt the pipeline for other cancer datasets |
|
|
|
|
|
This repository bridges the gap between **machine learning engineering** and **medical research**, contributing towards a future where AI supports healthcare professionals in delivering faster, more accurate, and more reliable diagnoses. |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|