Spaces:
Running
Running
Commit
Β·
a1eb0d1
1
Parent(s):
6a6a772
update readme
Browse files
README.md
CHANGED
|
@@ -128,16 +128,6 @@ python inference.py
|
|
| 128 |
- **Shows both overall and character accuracy**
|
| 129 |
- **Creates visualization plots** in Metrics folder
|
| 130 |
|
| 131 |
-
### Local Development (GTX 1650)
|
| 132 |
-
- Use `Dataset_test` (1k images)
|
| 133 |
-
- Batch size: 32
|
| 134 |
-
- Good for rapid iteration and testing
|
| 135 |
-
|
| 136 |
-
### Colab Training (Tesla T4)
|
| 137 |
-
- Use `Dataset` (100k images)
|
| 138 |
-
- Batch size: 128
|
| 139 |
-
- Expected training time: 2-4 hours for 40 epochs
|
| 140 |
-
|
| 141 |
## π¬ Technical Details
|
| 142 |
|
| 143 |
### Model Architecture (CRNN)
|
|
@@ -149,41 +139,46 @@ graph TD
|
|
| 149 |
%% Input Layer
|
| 150 |
A[Input Image<br/>60x256x1] --> B[CNN Encoder<br/>SmallCNN]
|
| 151 |
|
| 152 |
-
%% CNN
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
K --> L[RNN Decoder<br/>2-Layer BiLSTM]
|
| 169 |
-
L --> M[Hidden Size: 320 per direction<br/>Total: 640 features]
|
| 170 |
-
M --> N[Output: 64,B,640]
|
| 171 |
|
| 172 |
-
%%
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
|
|
|
|
|
|
| 176 |
|
| 177 |
-
%%
|
| 178 |
-
|
| 179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
-
%% Styling
|
| 182 |
-
classDef inputLayer fill:#
|
| 183 |
-
classDef cnnLayer fill:#
|
| 184 |
-
classDef rnnLayer fill:#
|
| 185 |
-
classDef outputLayer fill:#
|
| 186 |
-
classDef ctcLayer fill:#
|
| 187 |
|
| 188 |
class A inputLayer
|
| 189 |
class B,C,D,E,F,G,H,I,J,K cnnLayer
|
|
@@ -192,45 +187,6 @@ graph TD
|
|
| 192 |
class R,S ctcLayer
|
| 193 |
```
|
| 194 |
|
| 195 |
-
**Data Flow Summary:**
|
| 196 |
-
- **Input**: `[B, 1, 60, 256]` - Batch of grayscale images
|
| 197 |
-
- **CNN Output**: `[64, B, 128]` - 64 timesteps, batch size, 128 features
|
| 198 |
-
- **RNN Output**: `[64, B, 640]` - 64 timesteps, batch size, 640 features
|
| 199 |
-
- **Final Output**: `[64, B, 63]` - 64 timesteps, batch size, 63 classes
|
| 200 |
-
- **CTC Decode**: Character sequence (a-z, A-Z, 0-9)
|
| 201 |
-
|
| 202 |
-
#### **CNN Encoder (SmallCNN)**
|
| 203 |
-
```
|
| 204 |
-
Input: [B, 1, 60, 256] β Output: [64, B, 128]
|
| 205 |
-
```
|
| 206 |
-
- **Conv1 Block**: 3Γ3 conv β BatchNorm β ReLU β MaxPool(2Γ2)
|
| 207 |
-
- Channels: 1 β 64
|
| 208 |
-
- Spatial: 60Γ256 β 30Γ128
|
| 209 |
-
- **Conv2 Block**: 3Γ3 conv β BatchNorm β ReLU β MaxPool(1Γ2)
|
| 210 |
-
- Channels: 64 β 128
|
| 211 |
-
- Spatial: 30Γ128 β 30Γ64
|
| 212 |
-
- **Residual Block**: 3Γ3 conv β BatchNorm β ReLU β 3Γ3 conv β BatchNorm + Skip Connection
|
| 213 |
-
- Maintains 128 channels and 30Γ64 spatial dimensions
|
| 214 |
-
- **Height Pooling**: AdaptiveAvgPool2d(1, None) β squeeze(2)
|
| 215 |
-
- Spatial: 30Γ64 β 1Γ64 β 64 timesteps
|
| 216 |
-
- Final: [64, B, 128] where T=64, B=batch_size, C=128
|
| 217 |
-
|
| 218 |
-
#### **RNN Decoder (BiLSTM)**
|
| 219 |
-
```
|
| 220 |
-
Input: [64, B, 128] β Output: [64, B, 640]
|
| 221 |
-
```
|
| 222 |
-
- **Architecture**: 2-layer bidirectional LSTM
|
| 223 |
-
- **Hidden Size**: 320 per direction (total 640)
|
| 224 |
-
- **Dropout**: 0.05 between layers
|
| 225 |
-
- **Output**: [T, B, 2Γhidden] = [64, B, 640]
|
| 226 |
-
|
| 227 |
-
#### **Output Layer**
|
| 228 |
-
```
|
| 229 |
-
Input: [64, B, 640] β Output: [64, B, 63]
|
| 230 |
-
```
|
| 231 |
-
- **LayerNorm**: Stabilizes 640-dimensional features
|
| 232 |
-
- **Linear**: Maps to vocabulary size (62 chars + 1 blank token)
|
| 233 |
-
- **Final Shape**: [T=64, B=batch_size, V=63]
|
| 234 |
|
| 235 |
#### **Key Design Features**
|
| 236 |
- **Total Stride**: 4 (256 β 64 timesteps)
|
|
|
|
| 128 |
- **Shows both overall and character accuracy**
|
| 129 |
- **Creates visualization plots** in Metrics folder
|
| 130 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
## π¬ Technical Details
|
| 132 |
|
| 133 |
### Model Architecture (CRNN)
|
|
|
|
| 139 |
%% Input Layer
|
| 140 |
A[Input Image<br/>60x256x1] --> B[CNN Encoder<br/>SmallCNN]
|
| 141 |
|
| 142 |
+
%% CNN Subgraph - Top to Bottom
|
| 143 |
+
subgraph CNN ["CNN Encoder Layer"]
|
| 144 |
+
B --> C[Conv1 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 2x2]
|
| 145 |
+
C --> D[Channels: 1 to 64<br/>Spatial: 60x256 to 30x128]
|
| 146 |
+
|
| 147 |
+
D --> E[Conv2 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 1x2]
|
| 148 |
+
E --> F[Channels: 64 to 128<br/>Spatial: 30x128 to 30x64]
|
| 149 |
+
|
| 150 |
+
F --> G[Residual Block<br/>3x3 Conv + BatchNorm + ReLU<br/>3x3 Conv + BatchNorm<br/>+ Skip Connection]
|
| 151 |
+
G --> H[Maintains: 128 channels, 30x64 spatial]
|
| 152 |
+
|
| 153 |
+
H --> I[Height Pooling<br/>AdaptiveAvgPool2d 1xNone]
|
| 154 |
+
I --> J[Squeeze Height<br/>30x64 to 1x64]
|
| 155 |
+
|
| 156 |
+
J --> K[Permute & Reshape<br/>B,128,1,64 to 64,B,128]
|
| 157 |
+
end
|
|
|
|
|
|
|
|
|
|
| 158 |
|
| 159 |
+
%% RNN Subgraph - Top to Bottom
|
| 160 |
+
subgraph RNN ["RNN Decoder Layer"]
|
| 161 |
+
K --> L[RNN Decoder<br/>2-Layer BiLSTM]
|
| 162 |
+
L --> M[Hidden Size: 320 per direction<br/>Total: 640 features]
|
| 163 |
+
M --> N[Output: 64,B,640]
|
| 164 |
+
end
|
| 165 |
|
| 166 |
+
%% Output Subgraph - Top to Bottom
|
| 167 |
+
subgraph OUTPUT ["Output & CTC Layer"]
|
| 168 |
+
N --> O[LayerNorm<br/>Stabilize 640D features]
|
| 169 |
+
O --> P[Linear Layer<br/>640 to 63 classes]
|
| 170 |
+
P --> Q[Output Logits<br/>64,B,63]
|
| 171 |
+
|
| 172 |
+
Q --> R[CTC Decoding<br/>Remove duplicates & blanks]
|
| 173 |
+
R --> S[Final Prediction<br/>Character sequence]
|
| 174 |
+
end
|
| 175 |
|
| 176 |
+
%% Styling - Darker colors with black text
|
| 177 |
+
classDef inputLayer fill:#ffcc80,stroke:#e65100,stroke-width:3px,color:#000000
|
| 178 |
+
classDef cnnLayer fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px,color:#000000
|
| 179 |
+
classDef rnnLayer fill:#bbdefb,stroke:#1565c0,stroke-width:3px,color:#000000
|
| 180 |
+
classDef outputLayer fill:#f8bbd9,stroke:#ad1457,stroke-width:3px,color:#000000
|
| 181 |
+
classDef ctcLayer fill:#d1c4e9,stroke:#512da8,stroke-width:3px,color:#000000
|
| 182 |
|
| 183 |
class A inputLayer
|
| 184 |
class B,C,D,E,F,G,H,I,J,K cnnLayer
|
|
|
|
| 187 |
class R,S ctcLayer
|
| 188 |
```
|
| 189 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
|
| 191 |
#### **Key Design Features**
|
| 192 |
- **Total Stride**: 4 (256 β 64 timesteps)
|