| --- |
| language: |
| - zh |
| - en |
| tags: |
| - sweepgpm |
| - sweepmm |
| - chatglm |
| - multimodal |
| - sweeping-robot |
| - lora |
| - blip2 |
| license: mit |
| --- |
| # SweepGPM |
|
|
| SweepGPM is a multimodal dialogue model for sweeping robots in home scenarios, fine-tuned from [VisualGLM-6B](https://github.com/THUDM/VisualGLM-6B). The language model is based on [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) (6.2B parameters, frozen), and the image encoder uses [CLIP ViT-L/14](https://github.com/openai/CLIP) (frozen). The Q-Former, fully connected projection layer, and LoRA adapters (rank=4, last 2 layers only) are trained to adapt the model to the domain knowledge of sweeping robots. |
|
|
|
|
| ## Performance |
|
|
| | Downstream Task | Metric | SweepGPM | |
| |----------------|--------|----------| |
| | Room Type Classification | Mean Accuracy | **84.3%** | |
| | Obstacle Detection | mAP@0.5 | **86.1%** | |
| | Lost Item Search | Mean Recall | **80.2%** | |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| |
| tokenizer = AutoTokenizer.from_pretrained("bazaar-research/sweepgpm", trust_remote_code=True) |
| model = AutoModel.from_pretrained("bazaar-research/sweepgpm", trust_remote_code=True).half().cuda() |
| |
| image_path = "your_image.jpg" |
| response, history = model.chat(tokenizer, image_path, "Give the room type in the image.", history=[]) |
| print(response) |
| |
| response, history = model.chat(tokenizer, image_path, "Provide fine-grained bounding boxes for all objects in the image.", history=history) |
| print(response) |
| ``` |
|
|
| ## Dependencies |
|
|
| ```bash |
| pip install SwissArmyTransformer>=0.3.6 torch>=2.0.1 torchvision transformers>=4.31.0 cpm_kernels peft>=0.4.0 |
| ``` |
|
|