mohakapoor commited on
Commit
a1eb0d1
Β·
1 Parent(s): 6a6a772

update readme

Browse files
Files changed (1) hide show
  1. README.md +37 -81
README.md CHANGED
@@ -128,16 +128,6 @@ python inference.py
128
  - **Shows both overall and character accuracy**
129
  - **Creates visualization plots** in Metrics folder
130
 
131
- ### Local Development (GTX 1650)
132
- - Use `Dataset_test` (1k images)
133
- - Batch size: 32
134
- - Good for rapid iteration and testing
135
-
136
- ### Colab Training (Tesla T4)
137
- - Use `Dataset` (100k images)
138
- - Batch size: 128
139
- - Expected training time: 2-4 hours for 40 epochs
140
-
141
  ## πŸ”¬ Technical Details
142
 
143
  ### Model Architecture (CRNN)
@@ -149,41 +139,46 @@ graph TD
149
  %% Input Layer
150
  A[Input Image<br/>60x256x1] --> B[CNN Encoder<br/>SmallCNN]
151
 
152
- %% CNN Encoder Details
153
- B --> C[Conv1 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 2x2]
154
- C --> D[Channels: 1 to 64<br/>Spatial: 60x256 to 30x128]
155
-
156
- D --> E[Conv2 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 1x2]
157
- E --> F[Channels: 64 to 128<br/>Spatial: 30x128 to 30x64]
158
-
159
- F --> G[Residual Block<br/>3x3 Conv + BatchNorm + ReLU<br/>3x3 Conv + BatchNorm<br/>+ Skip Connection]
160
- G --> H[Maintains: 128 channels, 30x64 spatial]
161
-
162
- H --> I[Height Pooling<br/>AdaptiveAvgPool2d 1xNone]
163
- I --> J[Squeeze Height<br/>30x64 to 1x64]
164
-
165
- J --> K[Permute & Reshape<br/>B,128,1,64 to 64,B,128]
166
-
167
- %% RNN Decoder
168
- K --> L[RNN Decoder<br/>2-Layer BiLSTM]
169
- L --> M[Hidden Size: 320 per direction<br/>Total: 640 features]
170
- M --> N[Output: 64,B,640]
171
 
172
- %% Output Layer
173
- N --> O[LayerNorm<br/>Stabilize 640D features]
174
- O --> P[Linear Layer<br/>640 to 63 classes]
175
- P --> Q[Output Logits<br/>64,B,63]
 
 
176
 
177
- %% CTC Processing
178
- Q --> R[CTC Decoding<br/>Remove duplicates & blanks]
179
- R --> S[Final Prediction<br/>Character sequence]
 
 
 
 
 
 
180
 
181
- %% Styling
182
- classDef inputLayer fill:#e1f5fe,stroke:#01579b,stroke-width:2px
183
- classDef cnnLayer fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
184
- classDef rnnLayer fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
185
- classDef outputLayer fill:#fff3e0,stroke:#e65100,stroke-width:2px
186
- classDef ctcLayer fill:#fce4ec,stroke:#880e4f,stroke-width:2px
187
 
188
  class A inputLayer
189
  class B,C,D,E,F,G,H,I,J,K cnnLayer
@@ -192,45 +187,6 @@ graph TD
192
  class R,S ctcLayer
193
  ```
194
 
195
- **Data Flow Summary:**
196
- - **Input**: `[B, 1, 60, 256]` - Batch of grayscale images
197
- - **CNN Output**: `[64, B, 128]` - 64 timesteps, batch size, 128 features
198
- - **RNN Output**: `[64, B, 640]` - 64 timesteps, batch size, 640 features
199
- - **Final Output**: `[64, B, 63]` - 64 timesteps, batch size, 63 classes
200
- - **CTC Decode**: Character sequence (a-z, A-Z, 0-9)
201
-
202
- #### **CNN Encoder (SmallCNN)**
203
- ```
204
- Input: [B, 1, 60, 256] β†’ Output: [64, B, 128]
205
- ```
206
- - **Conv1 Block**: 3Γ—3 conv β†’ BatchNorm β†’ ReLU β†’ MaxPool(2Γ—2)
207
- - Channels: 1 β†’ 64
208
- - Spatial: 60Γ—256 β†’ 30Γ—128
209
- - **Conv2 Block**: 3Γ—3 conv β†’ BatchNorm β†’ ReLU β†’ MaxPool(1Γ—2)
210
- - Channels: 64 β†’ 128
211
- - Spatial: 30Γ—128 β†’ 30Γ—64
212
- - **Residual Block**: 3Γ—3 conv β†’ BatchNorm β†’ ReLU β†’ 3Γ—3 conv β†’ BatchNorm + Skip Connection
213
- - Maintains 128 channels and 30Γ—64 spatial dimensions
214
- - **Height Pooling**: AdaptiveAvgPool2d(1, None) β†’ squeeze(2)
215
- - Spatial: 30Γ—64 β†’ 1Γ—64 β†’ 64 timesteps
216
- - Final: [64, B, 128] where T=64, B=batch_size, C=128
217
-
218
- #### **RNN Decoder (BiLSTM)**
219
- ```
220
- Input: [64, B, 128] β†’ Output: [64, B, 640]
221
- ```
222
- - **Architecture**: 2-layer bidirectional LSTM
223
- - **Hidden Size**: 320 per direction (total 640)
224
- - **Dropout**: 0.05 between layers
225
- - **Output**: [T, B, 2Γ—hidden] = [64, B, 640]
226
-
227
- #### **Output Layer**
228
- ```
229
- Input: [64, B, 640] β†’ Output: [64, B, 63]
230
- ```
231
- - **LayerNorm**: Stabilizes 640-dimensional features
232
- - **Linear**: Maps to vocabulary size (62 chars + 1 blank token)
233
- - **Final Shape**: [T=64, B=batch_size, V=63]
234
 
235
  #### **Key Design Features**
236
  - **Total Stride**: 4 (256 β†’ 64 timesteps)
 
128
  - **Shows both overall and character accuracy**
129
  - **Creates visualization plots** in Metrics folder
130
 
 
 
 
 
 
 
 
 
 
 
131
  ## πŸ”¬ Technical Details
132
 
133
  ### Model Architecture (CRNN)
 
139
  %% Input Layer
140
  A[Input Image<br/>60x256x1] --> B[CNN Encoder<br/>SmallCNN]
141
 
142
+ %% CNN Subgraph - Top to Bottom
143
+ subgraph CNN ["CNN Encoder Layer"]
144
+ B --> C[Conv1 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 2x2]
145
+ C --> D[Channels: 1 to 64<br/>Spatial: 60x256 to 30x128]
146
+
147
+ D --> E[Conv2 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 1x2]
148
+ E --> F[Channels: 64 to 128<br/>Spatial: 30x128 to 30x64]
149
+
150
+ F --> G[Residual Block<br/>3x3 Conv + BatchNorm + ReLU<br/>3x3 Conv + BatchNorm<br/>+ Skip Connection]
151
+ G --> H[Maintains: 128 channels, 30x64 spatial]
152
+
153
+ H --> I[Height Pooling<br/>AdaptiveAvgPool2d 1xNone]
154
+ I --> J[Squeeze Height<br/>30x64 to 1x64]
155
+
156
+ J --> K[Permute & Reshape<br/>B,128,1,64 to 64,B,128]
157
+ end
 
 
 
158
 
159
+ %% RNN Subgraph - Top to Bottom
160
+ subgraph RNN ["RNN Decoder Layer"]
161
+ K --> L[RNN Decoder<br/>2-Layer BiLSTM]
162
+ L --> M[Hidden Size: 320 per direction<br/>Total: 640 features]
163
+ M --> N[Output: 64,B,640]
164
+ end
165
 
166
+ %% Output Subgraph - Top to Bottom
167
+ subgraph OUTPUT ["Output & CTC Layer"]
168
+ N --> O[LayerNorm<br/>Stabilize 640D features]
169
+ O --> P[Linear Layer<br/>640 to 63 classes]
170
+ P --> Q[Output Logits<br/>64,B,63]
171
+
172
+ Q --> R[CTC Decoding<br/>Remove duplicates & blanks]
173
+ R --> S[Final Prediction<br/>Character sequence]
174
+ end
175
 
176
+ %% Styling - Darker colors with black text
177
+ classDef inputLayer fill:#ffcc80,stroke:#e65100,stroke-width:3px,color:#000000
178
+ classDef cnnLayer fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px,color:#000000
179
+ classDef rnnLayer fill:#bbdefb,stroke:#1565c0,stroke-width:3px,color:#000000
180
+ classDef outputLayer fill:#f8bbd9,stroke:#ad1457,stroke-width:3px,color:#000000
181
+ classDef ctcLayer fill:#d1c4e9,stroke:#512da8,stroke-width:3px,color:#000000
182
 
183
  class A inputLayer
184
  class B,C,D,E,F,G,H,I,J,K cnnLayer
 
187
  class R,S ctcLayer
188
  ```
189
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
 
191
  #### **Key Design Features**
192
  - **Total Stride**: 4 (256 β†’ 64 timesteps)