jeremysean commited on
Commit
fe602ca
·
verified ·
1 Parent(s): 9e7f63b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +260 -0
README.md CHANGED
@@ -1,3 +1,263 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ metrics:
6
+ - accuracy
7
+ - f1
8
+ - precision
9
+ pipeline_tag: image-classification
10
+ tags:
11
+ - histopathology
12
+ - lung
13
+ - colon
14
+ - cancer
15
  ---
16
+ # Histolab
17
+ ## LC25000 Histopathology Classification
18
+
19
+ A custom CNN architecture optimized for histopathological image classification without using pretrained weights. This implementation provides three model variants with increasing complexity and performance, specifically designed for the LC25000 dataset.
20
+
21
+ ## Overview
22
+
23
+ This project implements custom convolutional neural networks for classifying histopathological images into five distinct categories. The models are trained from scratch without transfer learning, demonstrating the effectiveness of carefully designed architectures for medical image analysis.
24
+
25
+ ## Dataset
26
+
27
+ **LC25000 Lung and Colon Histopathological Image Dataset**
28
+
29
+ - Total Images: 25,000
30
+ - Image Size: 768 x 768 pixels (resized to 224 x 224)
31
+ - Number of Classes: 5
32
+ - Format: RGB histopathological images
33
+
34
+ ### Classes
35
+
36
+ 1. Colon Adenocarcinoma
37
+ 2. Colon Benign Tissue
38
+ 3. Lung Adenocarcinoma
39
+ 4. Lung Benign Tissue
40
+ 5. Lung Squamous Cell Carcinoma
41
+
42
+ ## Model Architectures
43
+
44
+ Three distinct architectures are provided, each with different complexity-performance tradeoffs:
45
+
46
+ ### Version 1: Simple CNN
47
+ - **Architecture**: Classic VGG-style sequential convolutions
48
+ - **Training Time**: Fast
49
+ - **Expected Accuracy**: 90-93%
50
+ - **Parameters**: ~15M
51
+ - **Best For**: Baseline experiments, quick iterations
52
+
53
+ ### Version 2: Residual Network
54
+ - **Architecture**: ResNet-inspired with residual connections
55
+ - **Training Time**: Moderate
56
+ - **Expected Accuracy**: 92-95%
57
+ - **Parameters**: ~20M
58
+ - **Best For**: Balanced performance and training efficiency
59
+
60
+ ### Version 3: Attention Network (Recommended)
61
+ - **Architecture**: Advanced design with Squeeze-and-Excitation blocks
62
+ - **Training Time**: Longer
63
+ - **Expected Accuracy**: 94-97%
64
+ - **Parameters**: ~25M
65
+ - **Best For**: Maximum performance
66
+
67
+ ## Key Features
68
+
69
+ - **Data Augmentation**: Comprehensive augmentation pipeline including random flips, rotations, zoom, translation, contrast, and brightness adjustments
70
+ - **Regularization**: L2 weight decay, dropout, and label smoothing
71
+ - **Optimization**: AdamW optimizer with learning rate scheduling
72
+ - **Callbacks**: Early stopping, learning rate reduction on plateau, model checkpointing
73
+ - **Metrics**: Accuracy, AUC, Precision, Recall
74
+ - **Visualization**: Training history plots and confusion matrices
75
+
76
+ ## Installation
77
+
78
+ ```bash
79
+ pip install tensorflow>=2.13.0
80
+ pip install numpy
81
+ pip install matplotlib
82
+ pip install seaborn
83
+ pip install scikit-learn
84
+ ```
85
+
86
+ ## Usage
87
+
88
+ ### Basic Training
89
+
90
+ ```python
91
+ from lc25000_classifier import main
92
+
93
+ # Update paths in main() function
94
+ train_directory = 'path/to/train'
95
+ test_directory = 'path/to/test'
96
+
97
+ # Run training
98
+ main()
99
+ ```
100
+
101
+ ### Model Selection
102
+
103
+ Choose your desired architecture by uncommenting the appropriate line:
104
+
105
+ ```python
106
+ # Simple CNN (faster training)
107
+ model = build_model_v1_simple()
108
+
109
+ # Residual Network (balanced)
110
+ model = build_model_v2_residual()
111
+
112
+ # Attention Network (best performance)
113
+ model = build_model_v3_attention()
114
+ ```
115
+
116
+ ### Custom Training
117
+
118
+ ```python
119
+ from lc25000_classifier import load_datasets, build_model_v3_attention, compile_model, get_callbacks
120
+
121
+ # Load data
122
+ train_ds, val_ds, test_ds = load_datasets(train_dir, test_dir)
123
+
124
+ # Build and compile model
125
+ model = build_model_v3_attention()
126
+ model = compile_model(model)
127
+
128
+ # Train
129
+ history = model.fit(
130
+ train_ds,
131
+ epochs=150,
132
+ validation_data=val_ds,
133
+ callbacks=get_callbacks()
134
+ )
135
+ ```
136
+
137
+ ## Configuration
138
+
139
+ Key hyperparameters can be adjusted in the CONFIG dictionary:
140
+
141
+ ```python
142
+ CONFIG = {
143
+ 'image_size': (224, 224),
144
+ 'batch_size': 32,
145
+ 'epochs': 150,
146
+ 'initial_lr': 0.001,
147
+ 'weight_decay': 1e-4,
148
+ 'dropout_rate': 0.4,
149
+ 'num_classes': 5,
150
+ 'seed': 42
151
+ }
152
+ ```
153
+
154
+ ## Training Details
155
+
156
+ ### Optimization Strategy
157
+ - **Optimizer**: AdamW with weight decay
158
+ - **Initial Learning Rate**: 0.001
159
+ - **Learning Rate Schedule**: ReduceLROnPlateau (factor=0.5, patience=7)
160
+ - **Loss Function**: Categorical Crossentropy with label smoothing (0.1)
161
+
162
+ ### Regularization Techniques
163
+ - L2 weight regularization (1e-4)
164
+ - Dropout (0.4 in classifier, 0.2-0.3 in feature extractor)
165
+ - Batch normalization after each convolution
166
+ - Label smoothing
167
+
168
+ ### Training Strategy
169
+ - Early stopping (patience=15)
170
+ - Model checkpointing (saves best model based on validation accuracy)
171
+ - TensorBoard logging for monitoring
172
+
173
+ ## Performance Metrics
174
+
175
+ The model is evaluated using multiple metrics:
176
+ - **Accuracy**: Overall classification accuracy
177
+ - **AUC**: Area under the ROC curve (multi-label)
178
+ - **Precision**: Positive predictive value
179
+ - **Recall**: Sensitivity
180
+ - **Confusion Matrix**: Detailed per-class performance
181
+
182
+ ## Model Output
183
+
184
+ Training produces the following artifacts:
185
+ - `lc25000_scratch_best.keras`: Best model checkpoint
186
+ - `lc25000_scratch_final.keras`: Final trained model
187
+ - `lc25000_scratch_weights.weights.h5`: Model weights only
188
+ - `training_history.png`: Visualization of training metrics
189
+ - `confusion_matrix.png`: Confusion matrix heatmap
190
+ - `./logs/`: TensorBoard logs
191
+
192
+ ## Inference Example
193
+
194
+ ```python
195
+ import tensorflow as tf
196
+ import numpy as np
197
+
198
+ # Load model
199
+ model = tf.keras.models.load_model('lc25000_scratch_final.keras')
200
+
201
+ # Load and preprocess image
202
+ image = tf.keras.preprocessing.image.load_img('path/to/image.png', target_size=(224, 224))
203
+ image_array = tf.keras.preprocessing.image.img_to_array(image)
204
+ image_array = np.expand_dims(image_array, axis=0)
205
+
206
+ # Predict
207
+ predictions = model.predict(image_array)
208
+ predicted_class = CLASS_NAMES[np.argmax(predictions)]
209
+
210
+ print(f"Predicted class: {predicted_class}")
211
+ print(f"Confidence: {np.max(predictions):.2%}")
212
+ ```
213
+
214
+ ## Requirements
215
+
216
+ - Python 3.8+
217
+ - TensorFlow 2.13+
218
+ - NumPy
219
+ - Matplotlib
220
+ - Seaborn
221
+ - scikit-learn
222
+
223
+ ## Hardware Recommendations
224
+
225
+ - **Minimum**: 8GB RAM, CPU training (slow)
226
+ - **Recommended**: 16GB RAM, NVIDIA GPU with 8GB+ VRAM
227
+ - **Optimal**: 32GB RAM, NVIDIA GPU with 16GB+ VRAM
228
+
229
+ Training time varies by architecture and hardware:
230
+ - Simple CNN: ~2-4 hours (GPU)
231
+ - Residual Network: ~4-6 hours (GPU)
232
+ - Attention Network: ~6-10 hours (GPU)
233
+
234
+ ## Citation
235
+
236
+ If you use this implementation, please cite the LC25000 dataset:
237
+
238
+ ```
239
+ @article{,
240
+ title= {LC25000 Lung and colon histopathological image dataset},
241
+ keywords= {cancer,histopathology},
242
+ author= {Andrew A. Borkowski, Marilyn M. Bui, L. Brannon Thomas, Catherine P. Wilson, Lauren A. DeLand, Stephen M. Mastorides},
243
+ url= {https://github.com/tampapath/lung_colon_image_set}
244
+ }
245
+ ```
246
+
247
+ ## License
248
+
249
+ This implementation is provided for research and educational purposes. Please refer to the LC25000 dataset license for data usage terms.
250
+
251
+ ## Acknowledgments
252
+
253
+ - LC25000 dataset creators for providing high-quality histopathological images
254
+ - TensorFlow team for the deep learning framework
255
+ - Medical imaging community for advancing computational pathology
256
+
257
+ ## Contact
258
+
259
+ For questions, issues, or contributions, please open an issue in the repository.
260
+
261
+ ---
262
+
263
+ **Note**: This model is intended for research purposes only and should not be used for clinical diagnosis without proper validation and regulatory approval.