---
license: mit
language:
- en
metrics:
- accuracy
pipeline_tag: image-classification
library_name: transformers
tags:
- biology
- med
- chemistry
- code
---


---

# Multi-Cancer Lymphoma Classification with Convolutional Neural Networks (CNN)

## 📌 Overview

This repository contains an end-to-end deep learning pipeline developed in **Python** using **TensorFlow** and **Keras** for the automated classification of lymphoma subtypes within a multi-cancer dataset. The project leverages **Convolutional Neural Networks (CNNs)** to perform supervised image classification on histopathological cancer images, aiming to provide a robust and scalable solution for medical imaging analysis.

The pipeline encompasses:

* Data ingestion and preprocessing with **ImageDataGenerator**
* Training/validation split and augmentation
* Definition and compilation of a deep CNN architecture
* Training with real-time performance evaluation
* Model persistence (`.h5` file format) for later inference
* Custom prediction utility with visualization

This repository is intended for **medical AI researchers**, **machine learning engineers**, and **healthcare data scientists** who seek to apply convolutional neural networks for diagnostic support in oncology.

---

## 📂 Dataset Information

The dataset used in this project is located at:

```
/kaggle/input/multi-cancer/Multi Cancer/Multi Cancer/Lymphoma
```

This directory contains subfolders representing different classes of lymphoma and potentially other cancer subtypes. The **directory structure** is expected to be of the form:

```
Lymphoma/
    ├── Class_A/
    │   ├── img_1.jpg
    │   ├── img_2.jpg
    │   └── ...
    ├── Class_B/
    │   ├── img_3.jpg
    │   └── ...
    └── Class_C/
        ├── img_4.jpg
        └── ...
```

* Each subfolder corresponds to one diagnostic class.
* The model automatically infers class labels from these subdirectories.

---

## ⚙️ Dependencies

This project requires the following core dependencies:

* **Python 3.8+**
* **TensorFlow 2.x**
* **Keras (integrated with TensorFlow)**
* **NumPy**
* **Matplotlib**

To install dependencies:

```bash
pip install tensorflow numpy matplotlib
```

If running on Kaggle or Google Colab, these libraries are already pre-installed.

---

## 🧩 Code Structure

The main script (`train.py` or notebook cell) is divided into logical sections:

1. **Imports**

   * Standard libraries (`os`, `numpy`)
   * Scientific libraries (`matplotlib`)
   * Deep learning libraries (`tensorflow`, `keras`, `layers`)

2. **Data Pipeline**

   * Data preprocessing with `ImageDataGenerator`
   * Automatic normalization of pixel intensities (`rescale=1./255`)
   * Splitting into training (90%) and validation (10%)

3. **Model Architecture**

   * A sequential CNN architecture with the following layers:

     * `Conv2D` (32 filters, 3×3 kernel, ReLU)
     * `MaxPooling2D` (2×2)
     * `Conv2D` (64 filters, ReLU)
     * `MaxPooling2D` (2×2)
     * `Conv2D` (128 filters, ReLU)
     * `MaxPooling2D` (2×2)
     * `Flatten`
     * `Dense` (512 units, ReLU)
     * `Dense` (softmax output for multi-class classification)

4. **Compilation**

   * Optimizer: **Adam**
   * Loss Function: **Categorical Crossentropy**
   * Metrics: **Accuracy**

5. **Training**

   * Training via `model.fit()`
   * `epochs=10`
   * Validation data monitoring

6. **Model Persistence**

   * Final trained model is saved as `model5.h5`

7. **Prediction Utility** (`guess()` function)

   * Takes an input image path
   * Resizes and normalizes the image
   * Performs forward propagation using the trained model
   * Outputs the predicted class with corresponding visualization

---

## 🔬 Methodology

The approach relies on **supervised learning** using CNNs for image recognition.

* **Feature Extraction:** Convolutional and pooling layers learn hierarchical spatial representations of cancerous tissue patterns.
* **Classification:** Dense layers map these features into probabilistic class predictions.
* **Normalization:** All images are rescaled to `[0,1]` for stable gradient descent.
* **Generalization:** Validation set (10%) monitors overfitting and ensures out-of-sample reliability.

This is a **baseline model**, and can be extended with:

* **Data Augmentation** (rotation, zoom, shear, flips)
* **Transfer Learning** (e.g., VGG16, ResNet50, EfficientNet)
* **Regularization** (Dropout, L2 penalty)
* **Hyperparameter Optimization** (learning rate, batch size tuning)

---

## 📊 Training Performance

* **Epochs:** 10
* **Batch Size:** 32
* **Image Size:** 150×150 (RGB channels)
* **Optimizer:** Adam (adaptive learning rate)
* **Loss Function:** Categorical Crossentropy
* **Evaluation Metric:** Accuracy

Performance metrics will be printed during runtime and can be plotted for visualization. Example outputs include training/validation accuracy and loss curves.

---

## 🧪 Inference Example

Using the custom `guess()` function:

```python
from tensorflow.keras.models import load_model

# Load model
model = load_model("model5.h5")

# Predict on new image
guess("example_image.jpg", model, train_generator.class_indices)
```

Expected Output:

* The image is displayed.
* The title above the image indicates the **predicted lymphoma subtype**.

---

## 📌 Applications

* **Medical Decision Support:** Assisting oncologists in rapid and preliminary diagnosis of lymphoma subtypes.
* **Research:** Benchmarking CNN performance on histopathological datasets.
* **Education:** Teaching medical students and engineers about AI applications in pathology.

⚠️ **Disclaimer:** This model is for **research and educational purposes only**. It is **not a substitute for professional medical diagnosis**. Clinical deployment requires extensive validation, regulatory approval, and rigorous testing.

---

## 🚀 Future Improvements

1. Integrating **transfer learning** for improved accuracy.
2. Expanding dataset size and diversity.
3. Hyperparameter optimization with automated search tools.
4. Deploying as a web application (e.g., Flask, FastAPI, Streamlit).
5. Exporting to **TensorFlow Lite** or **ONNX** for mobile/edge deployment.

---

## 🏆 Conclusion

This project demonstrates the development of a robust, reproducible, and interpretable CNN-based classification model for multi-cancer (lymphoma) image analysis. It provides a **solid foundation** for further advancements in AI-driven oncology research.

By following the modular design of this repository, researchers can:

* Reproduce experiments
* Extend the architecture
* Adapt the pipeline for other cancer datasets

This repository bridges the gap between **machine learning engineering** and **medical research**, contributing towards a future where AI supports healthcare professionals in delivering faster, more accurate, and more reliable diagnoses.

---