Robotics
Transformers
Safetensors
qwen2_5_vl
image-to-text
text-generation-inference
jan-hq commited on
Commit
57ff4e4
·
verified ·
1 Parent(s): 6cb08fc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -1
README.md CHANGED
@@ -24,8 +24,68 @@ Our key contributions are as follows:
24
  * Model type: Qwen 2.5 3B Instruct, fine-tuned for hand pose estimation
25
  * License: Apache-2.0 license
26
 
27
- # How to Get Started
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  ## Citation
31
  BibTeX: []
 
24
  * Model type: Qwen 2.5 3B Instruct, fine-tuned for hand pose estimation
25
  * License: Apache-2.0 license
26
 
27
+ ## How to Get Started
28
 
29
+ ```python
30
+ import torch
31
+ from PIL import Image
32
+ from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
33
+ from qwen_vl_utils import process_vision_info
34
+
35
+ # 1. Load model and processor
36
+ device = "cuda" if torch.cuda.is_available() else "cpu"
37
+ model_path = "path/to/qwen2.5_vl/checkpoint-1500/"
38
+
39
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
40
+ model_path,
41
+ trust_remote_code=True,
42
+ torch_dtype=torch.bfloat16
43
+ ).eval().to(device)
44
+
45
+ processor = AutoProcessor.from_pretrained(
46
+ model_path,
47
+ min_pixels=256*28*28,
48
+ max_pixels=1280*28*28,
49
+ trust_remote_code=True
50
+ )
51
+
52
+ # 2. Prepare your image
53
+ image = Image.open("your_hand_image.png").convert("RGB")
54
+
55
+ # 3. Create messages
56
+ messages = [
57
+ {"role": "system", "content": "You are a specialized Vision Language Model designed to accurately estimate joint angles from hand pose images..."},
58
+ {
59
+ "role": "user",
60
+ "content": [
61
+ {
62
+ "type": "image",
63
+ "image": image,
64
+ "min_pixels": 1003520,
65
+ "max_pixels": 1003520,
66
+ },
67
+ {"type": "text", "text": "<Pose>"},
68
+ ],
69
+ },
70
+ ]
71
+
72
+ # 4. Process and get predictions
73
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
74
+ image_inputs, video_inputs = process_vision_info(messages)
75
+ inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to(device)
76
+
77
+ # 5. Generate output
78
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
79
+ generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
80
+ output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
81
+
82
+ print(output_text) # This will show the joint angles in XML format
83
+ ```
84
+
85
+ The output will be joint angles in radians in XML format:
86
+ ```xml
87
+ <lh_WRJ2>angle</lh_WRJ2><lh_WRJ1>angle</lh_WRJ1><lh_FFJ4>angle</lh_FFJ4>...
88
+ ```
89
 
90
  ## Citation
91
  BibTeX: []