CANet-v1.4 / README.md
CernovaAI's picture
Update README.md
e188a4c verified
---
license: mit
language:
- en
metrics:
- accuracy
pipeline_tag: image-classification
library_name: transformers
tags:
- biology
- med
- chemistry
- code
---
---
# Multi-Cancer Lymphoma Classification with Convolutional Neural Networks (CNN)
## πŸ“Œ Overview
This repository contains an end-to-end deep learning pipeline developed in **Python** using **TensorFlow** and **Keras** for the automated classification of lymphoma subtypes within a multi-cancer dataset. The project leverages **Convolutional Neural Networks (CNNs)** to perform supervised image classification on histopathological cancer images, aiming to provide a robust and scalable solution for medical imaging analysis.
The pipeline encompasses:
* Data ingestion and preprocessing with **ImageDataGenerator**
* Training/validation split and augmentation
* Definition and compilation of a deep CNN architecture
* Training with real-time performance evaluation
* Model persistence (`.h5` file format) for later inference
* Custom prediction utility with visualization
This repository is intended for **medical AI researchers**, **machine learning engineers**, and **healthcare data scientists** who seek to apply convolutional neural networks for diagnostic support in oncology.
---
## πŸ“‚ Dataset Information
The dataset used in this project is located at:
```
/kaggle/input/multi-cancer/Multi Cancer/Multi Cancer/Lymphoma
```
This directory contains subfolders representing different classes of lymphoma and potentially other cancer subtypes. The **directory structure** is expected to be of the form:
```
Lymphoma/
β”œβ”€β”€ Class_A/
β”‚ β”œβ”€β”€ img_1.jpg
β”‚ β”œβ”€β”€ img_2.jpg
β”‚ └── ...
β”œβ”€β”€ Class_B/
β”‚ β”œβ”€β”€ img_3.jpg
β”‚ └── ...
└── Class_C/
β”œβ”€β”€ img_4.jpg
└── ...
```
* Each subfolder corresponds to one diagnostic class.
* The model automatically infers class labels from these subdirectories.
---
## βš™οΈ Dependencies
This project requires the following core dependencies:
* **Python 3.8+**
* **TensorFlow 2.x**
* **Keras (integrated with TensorFlow)**
* **NumPy**
* **Matplotlib**
To install dependencies:
```bash
pip install tensorflow numpy matplotlib
```
If running on Kaggle or Google Colab, these libraries are already pre-installed.
---
## 🧩 Code Structure
The main script (`train.py` or notebook cell) is divided into logical sections:
1. **Imports**
* Standard libraries (`os`, `numpy`)
* Scientific libraries (`matplotlib`)
* Deep learning libraries (`tensorflow`, `keras`, `layers`)
2. **Data Pipeline**
* Data preprocessing with `ImageDataGenerator`
* Automatic normalization of pixel intensities (`rescale=1./255`)
* Splitting into training (90%) and validation (10%)
3. **Model Architecture**
* A sequential CNN architecture with the following layers:
* `Conv2D` (32 filters, 3Γ—3 kernel, ReLU)
* `MaxPooling2D` (2Γ—2)
* `Conv2D` (64 filters, ReLU)
* `MaxPooling2D` (2Γ—2)
* `Conv2D` (128 filters, ReLU)
* `MaxPooling2D` (2Γ—2)
* `Flatten`
* `Dense` (512 units, ReLU)
* `Dense` (softmax output for multi-class classification)
4. **Compilation**
* Optimizer: **Adam**
* Loss Function: **Categorical Crossentropy**
* Metrics: **Accuracy**
5. **Training**
* Training via `model.fit()`
* `epochs=10`
* Validation data monitoring
6. **Model Persistence**
* Final trained model is saved as `model5.h5`
7. **Prediction Utility** (`guess()` function)
* Takes an input image path
* Resizes and normalizes the image
* Performs forward propagation using the trained model
* Outputs the predicted class with corresponding visualization
---
## πŸ”¬ Methodology
The approach relies on **supervised learning** using CNNs for image recognition.
* **Feature Extraction:** Convolutional and pooling layers learn hierarchical spatial representations of cancerous tissue patterns.
* **Classification:** Dense layers map these features into probabilistic class predictions.
* **Normalization:** All images are rescaled to `[0,1]` for stable gradient descent.
* **Generalization:** Validation set (10%) monitors overfitting and ensures out-of-sample reliability.
This is a **baseline model**, and can be extended with:
* **Data Augmentation** (rotation, zoom, shear, flips)
* **Transfer Learning** (e.g., VGG16, ResNet50, EfficientNet)
* **Regularization** (Dropout, L2 penalty)
* **Hyperparameter Optimization** (learning rate, batch size tuning)
---
## πŸ“Š Training Performance
* **Epochs:** 10
* **Batch Size:** 32
* **Image Size:** 150Γ—150 (RGB channels)
* **Optimizer:** Adam (adaptive learning rate)
* **Loss Function:** Categorical Crossentropy
* **Evaluation Metric:** Accuracy
Performance metrics will be printed during runtime and can be plotted for visualization. Example outputs include training/validation accuracy and loss curves.
---
## πŸ§ͺ Inference Example
Using the custom `guess()` function:
```python
from tensorflow.keras.models import load_model
# Load model
model = load_model("model5.h5")
# Predict on new image
guess("example_image.jpg", model, train_generator.class_indices)
```
Expected Output:
* The image is displayed.
* The title above the image indicates the **predicted lymphoma subtype**.
---
## πŸ“Œ Applications
* **Medical Decision Support:** Assisting oncologists in rapid and preliminary diagnosis of lymphoma subtypes.
* **Research:** Benchmarking CNN performance on histopathological datasets.
* **Education:** Teaching medical students and engineers about AI applications in pathology.
⚠️ **Disclaimer:** This model is for **research and educational purposes only**. It is **not a substitute for professional medical diagnosis**. Clinical deployment requires extensive validation, regulatory approval, and rigorous testing.
---
## πŸš€ Future Improvements
1. Integrating **transfer learning** for improved accuracy.
2. Expanding dataset size and diversity.
3. Hyperparameter optimization with automated search tools.
4. Deploying as a web application (e.g., Flask, FastAPI, Streamlit).
5. Exporting to **TensorFlow Lite** or **ONNX** for mobile/edge deployment.
---
## πŸ† Conclusion
This project demonstrates the development of a robust, reproducible, and interpretable CNN-based classification model for multi-cancer (lymphoma) image analysis. It provides a **solid foundation** for further advancements in AI-driven oncology research.
By following the modular design of this repository, researchers can:
* Reproduce experiments
* Extend the architecture
* Adapt the pipeline for other cancer datasets
This repository bridges the gap between **machine learning engineering** and **medical research**, contributing towards a future where AI supports healthcare professionals in delivering faster, more accurate, and more reliable diagnoses.
---