--- license: mit language: - en metrics: - accuracy pipeline_tag: image-classification library_name: transformers tags: - biology - med - chemistry - code --- --- # Multi-Cancer Lymphoma Classification with Convolutional Neural Networks (CNN) ## ๐Ÿ“Œ Overview This repository contains an end-to-end deep learning pipeline developed in **Python** using **TensorFlow** and **Keras** for the automated classification of lymphoma subtypes within a multi-cancer dataset. The project leverages **Convolutional Neural Networks (CNNs)** to perform supervised image classification on histopathological cancer images, aiming to provide a robust and scalable solution for medical imaging analysis. The pipeline encompasses: * Data ingestion and preprocessing with **ImageDataGenerator** * Training/validation split and augmentation * Definition and compilation of a deep CNN architecture * Training with real-time performance evaluation * Model persistence (`.h5` file format) for later inference * Custom prediction utility with visualization This repository is intended for **medical AI researchers**, **machine learning engineers**, and **healthcare data scientists** who seek to apply convolutional neural networks for diagnostic support in oncology. --- ## ๐Ÿ“‚ Dataset Information The dataset used in this project is located at: ``` /kaggle/input/multi-cancer/Multi Cancer/Multi Cancer/Lymphoma ``` This directory contains subfolders representing different classes of lymphoma and potentially other cancer subtypes. The **directory structure** is expected to be of the form: ``` Lymphoma/ โ”œโ”€โ”€ Class_A/ โ”‚ โ”œโ”€โ”€ img_1.jpg โ”‚ โ”œโ”€โ”€ img_2.jpg โ”‚ โ””โ”€โ”€ ... โ”œโ”€โ”€ Class_B/ โ”‚ โ”œโ”€โ”€ img_3.jpg โ”‚ โ””โ”€โ”€ ... โ””โ”€โ”€ Class_C/ โ”œโ”€โ”€ img_4.jpg โ””โ”€โ”€ ... ``` * Each subfolder corresponds to one diagnostic class. * The model automatically infers class labels from these subdirectories. --- ## โš™๏ธ Dependencies This project requires the following core dependencies: * **Python 3.8+** * **TensorFlow 2.x** * **Keras (integrated with TensorFlow)** * **NumPy** * **Matplotlib** To install dependencies: ```bash pip install tensorflow numpy matplotlib ``` If running on Kaggle or Google Colab, these libraries are already pre-installed. --- ## ๐Ÿงฉ Code Structure The main script (`train.py` or notebook cell) is divided into logical sections: 1. **Imports** * Standard libraries (`os`, `numpy`) * Scientific libraries (`matplotlib`) * Deep learning libraries (`tensorflow`, `keras`, `layers`) 2. **Data Pipeline** * Data preprocessing with `ImageDataGenerator` * Automatic normalization of pixel intensities (`rescale=1./255`) * Splitting into training (90%) and validation (10%) 3. **Model Architecture** * A sequential CNN architecture with the following layers: * `Conv2D` (32 filters, 3ร—3 kernel, ReLU) * `MaxPooling2D` (2ร—2) * `Conv2D` (64 filters, ReLU) * `MaxPooling2D` (2ร—2) * `Conv2D` (128 filters, ReLU) * `MaxPooling2D` (2ร—2) * `Flatten` * `Dense` (512 units, ReLU) * `Dense` (softmax output for multi-class classification) 4. **Compilation** * Optimizer: **Adam** * Loss Function: **Categorical Crossentropy** * Metrics: **Accuracy** 5. **Training** * Training via `model.fit()` * `epochs=10` * Validation data monitoring 6. **Model Persistence** * Final trained model is saved as `model5.h5` 7. **Prediction Utility** (`guess()` function) * Takes an input image path * Resizes and normalizes the image * Performs forward propagation using the trained model * Outputs the predicted class with corresponding visualization --- ## ๐Ÿ”ฌ Methodology The approach relies on **supervised learning** using CNNs for image recognition. * **Feature Extraction:** Convolutional and pooling layers learn hierarchical spatial representations of cancerous tissue patterns. * **Classification:** Dense layers map these features into probabilistic class predictions. * **Normalization:** All images are rescaled to `[0,1]` for stable gradient descent. * **Generalization:** Validation set (10%) monitors overfitting and ensures out-of-sample reliability. This is a **baseline model**, and can be extended with: * **Data Augmentation** (rotation, zoom, shear, flips) * **Transfer Learning** (e.g., VGG16, ResNet50, EfficientNet) * **Regularization** (Dropout, L2 penalty) * **Hyperparameter Optimization** (learning rate, batch size tuning) --- ## ๐Ÿ“Š Training Performance * **Epochs:** 10 * **Batch Size:** 32 * **Image Size:** 150ร—150 (RGB channels) * **Optimizer:** Adam (adaptive learning rate) * **Loss Function:** Categorical Crossentropy * **Evaluation Metric:** Accuracy Performance metrics will be printed during runtime and can be plotted for visualization. Example outputs include training/validation accuracy and loss curves. --- ## ๐Ÿงช Inference Example Using the custom `guess()` function: ```python from tensorflow.keras.models import load_model # Load model model = load_model("model5.h5") # Predict on new image guess("example_image.jpg", model, train_generator.class_indices) ``` Expected Output: * The image is displayed. * The title above the image indicates the **predicted lymphoma subtype**. --- ## ๐Ÿ“Œ Applications * **Medical Decision Support:** Assisting oncologists in rapid and preliminary diagnosis of lymphoma subtypes. * **Research:** Benchmarking CNN performance on histopathological datasets. * **Education:** Teaching medical students and engineers about AI applications in pathology. โš ๏ธ **Disclaimer:** This model is for **research and educational purposes only**. It is **not a substitute for professional medical diagnosis**. Clinical deployment requires extensive validation, regulatory approval, and rigorous testing. --- ## ๐Ÿš€ Future Improvements 1. Integrating **transfer learning** for improved accuracy. 2. Expanding dataset size and diversity. 3. Hyperparameter optimization with automated search tools. 4. Deploying as a web application (e.g., Flask, FastAPI, Streamlit). 5. Exporting to **TensorFlow Lite** or **ONNX** for mobile/edge deployment. --- ## ๐Ÿ† Conclusion This project demonstrates the development of a robust, reproducible, and interpretable CNN-based classification model for multi-cancer (lymphoma) image analysis. It provides a **solid foundation** for further advancements in AI-driven oncology research. By following the modular design of this repository, researchers can: * Reproduce experiments * Extend the architecture * Adapt the pipeline for other cancer datasets This repository bridges the gap between **machine learning engineering** and **medical research**, contributing towards a future where AI supports healthcare professionals in delivering faster, more accurate, and more reliable diagnoses. ---