| # DeepEncoder (Extracted from DeepSeek-OCR) | |
| ## Overview | |
| This directory contains the encoder components extracted from DeepSeek-OCR. | |
| ## Model Files | |
| - `sam_encoder.pth`: SAM ViT-B encoder (95,569,152 params, 364.6 MB) | |
| - `clip_encoder.pth`: CLIP-Large encoder (303,177,728 params, 1156.6 MB) | |
| - `projector.pth`: Linear projector (2,622,720 params, 10.0 MB) | |
| - `config.json`: Model configuration | |
| **Total:** 401,369,600 parameters | |
| ## Architecture | |
| ``` | |
| Image (1024Γ1024) β SAM (95M) β 16Γ Conv β CLIP (303M) β Projector (3M) β 256 vision tokens | |
| ``` | |
| ## Usage | |
| ```python | |
| import torch | |
| from deepencoder import build_sam_vit_b, build_clip_l, MlpProjector | |
| from easydict import EasyDict as adict | |
| # Load models | |
| sam = build_sam_vit_b(checkpoint=None) | |
| sam.load_state_dict(torch.load('sam_encoder.pth')) | |
| clip = build_clip_l() | |
| clip.load_state_dict(torch.load('clip_encoder.pth')) | |
| projector_cfg = adict({'projector_type': 'linear', 'input_dim': 2048, 'n_embed': 1280}) | |
| projector = MlpProjector(projector_cfg) | |
| projector.load_state_dict(torch.load('projector.pth')) | |
| # Run encoder | |
| vision_tokens = encode(image) # [1, 256, 1280] | |
| ``` | |
| ## Training | |
| These weights are: | |
| - Initialized from pretrained SAM (SA-1B) + CLIP (LAION-2B) | |
| - Fine-tuned together on optical compression/OCR tasks | |
| - Optimized for text preservation in compressed form | |
| ## Source | |
| Extracted from: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) | |