Fixing line break & Critical feedback on missing features (OD/Segmentation)
Why did you close the previous one? I thought I left a line in the middle)
Of course, you can just put \n if you want to move the model's output to a new line.
Also, provide an example image url, not something local like a folder path in the code.
Add flash-attn to pip, and also make it runnable without it.
No, seriously, add the attention type "eager" to the README, I couldn't get it to run with flash attn 2. Also, it would be great if there was support for Transformers version 5, because, if I understand correctly, it's already possible to run the model with the "sdpa" attention type there.
I spent THE ENTIRE NIGHT and THE WHOLE EVENING trying to understand that at the moment this is just a regular VL model, okay, a powerful 4B model on par with Qwen 3 4B VL, but... Where is the Object Detection? Semantic Segmentation? And in general, everything else that has benchmarks? It seems the model exists, benchmarks exist, so sit and guess what to do with it.
Initially, I thought it needed flash attention 2, then after several attempts to run it in a bad Docker image (as it turned out), I ran the gguf version, but I had doubts that gguf has the same functionality as the transformers version... Creating a new instance, I tried to install flash attention 2 there, but on Blackwell it builds forever, so I decided to run without it. It started working, but even when I looked at the LLM code implementation and the pull request in transformers, it was clear that I wouldn't be able to do anything with it right now, and would only be trying to reverse-engineer it. Yeah, the GitHub TODO List is a joy. There's no pull request for vllm yet, what was the problem with releasing:
Release recipes for various tasks
Release evaluation codes
From the start, instead of putting this in the TODO List below vLLM?
When I saw the description on HuggingFace, I was shocked that this even exists, but now I can only believe in it.
Of course, the fact that I tried to run it for so long and failed is my fault as an inexperienced user who doesn't know if flash attention is mandatory or not, but still, usually transformers has full and original support, unlike gguf, but here it turned out not to be the usual case. Or maybe I just imagined it or missed something.
I understand that Chinese New Year is around the corner and the author of the commit on GitHub (w1oves) might be resting at the moment or working while I, or the community, don't see it.
It's a pity I didn't see the model license from the very beginning - essentially MIT (though I might be wrong, but basically free), but access is prohibited for the European Union... And doesn't Europe itself ban working with Chinese AI? On w1oves' GitHub avatar, it's Paris, is this such hatred for Europe or just a joke?) Although okay, I see in other licenses it's not just Europe, alright.
I see they made an HF Space for Youtu-Parsing... Anyway, if there's some problem that a Space with ZERO on behalf of the company is not profitable (I don't know, there was no company account), then they could have made a Space under a user's name.
Yes, I noticed in Youtu-Parsing a response to a Pull Request, that an SDK is required for the models and full support isn't in Transformers yet... WHY?? THEN WHY did you release these models publicly? To create hype?
I honestly don't know, maybe there's another company that can simply release a similar type of model faster and they want to be first in this, but... at what cost?
So I had to do such "research" to understand that the model is possibly raw, or possibly really as great as the README describes, but... NOWHERE!!!! NOWHERE IS IT INDICATED THAT ESSENTIALLY THIS IS JUST A VL MODEL, AND FULL SUPPORT IS COMING LATER!
Essentially, this is misleading, although you admit yourselves that full support isn't there yet.
In any case, thank you for this and other models from Tencent, I look forward to the opportunity to try this model at its full potential.
I've run into the same problem as you. It seems that this model doesn't achieve functions like object detection and pose estimation.
I've run into the same problem as you. It seems that this model doesn't achieve functions like object detection and pose estimation.
Actually, I’ve done some digging and found that the model is capable of these tasks right now, but the documentation is just non-existent here. You have to use the exact prompt formats hidden in their Arxiv paper (2601.19798)
Hello,
Thank you for your detailed feedback and for the significant time you spent exploring our work. We genuinely appreciate the effort you put into testing the models.
To address your concerns, we would like to clarify the capabilities of Youtu-VL and Youtu-Parsing.
1. Regarding Youtu-VL Capabilities
Youtu-VL does feature robust vision-centric capabilities. You can find specific details and prompt formats in our technical report and its appendix.
To help you verify these performance metrics quickly, we have provided a Jupyter notebook:
https://github.com/TencentCloudADP/youtu-vl/blob/main/demo/demo.ipynb
In this notebook, we demonstrate the majority of the model's functionalities of vision-centric tasks:
Task: Grounding
Prompt: Please provide the bounding box coordinate of the region this sentence describes: a black and white cat sitting on the edge of the bathtub
Task: Object Detection
Prompt: Detect all objects in the provided image.
Task: Referring Segmentation
Prompt: Can you segment "hotdog on left" in this image?
For more examples, please refer to paper and Jupyter notebooks.
2. Regarding Youtu-Parsing and the SDK
Thank you for your interest in the Youtu-Parsing project. To clarify, Youtu-Parsing is an OCR engineering effort that reuses and builds upon the research from Youtu-VL, specifically tailored for document parsing tasks.
You've rightly pointed out that while the core model is available, a complete end-to-end experience requires an SDK. Here’s why:
The Youtu-Parsing model serves as the core engine, but to handle real-world documents effectively, it relies on a full processing pipeline:
- Pre-processing: Prepares input images through normalization and scaling to ensure optimal data input.
- Multi-task orchestration: Coordinates tasks like layout analysis, text recognition, and structural understanding in a unified workflow.
- Post-processing: Assembles raw model outputs into a practical format—such as Markdown—for ease of use.
While the core model can be run directly via the Transformers library for basic inference, to fully utilize Youtu-Parsing’s document parsing capabilities—including the complete pipeline with pre- and post-processing—we recommend using the SDK available on GitHub:
https://github.com/TencentCloudADP/youtu-parsing
We truly appreciate your interest and hope this clarifies the current scope and roadmap for Youtu-Parsing.
We are actively working to improve our documentation.
Thank you again for your interest and your rigorous testing. Please stay tuned for our latest progress!
I've run into the same problem as you. It seems that this model doesn't achieve functions like object detection and pose estimation.
Hello! We noticed that you raised a similar issue on GitHub. We have just updated a new tutorial notebook, which you can find here: https://github.com/TencentCloudADP/youtu-vl/blob/main/demo/demo.ipynb
Could you please let us know if updated information resolves your problem? Feel free to reach out if you have any further questions!
Hello,
@zhixiangwei Thank you for the response and for providing the Jupyter notebook! It definitely clears things up and makes the model much more usable.
I’ve shared this update with the community on Reddit, where the discussion about this release has already gathered over 35k views and significant interest: https://www.reddit.com/r/LocalLLaMA/comments/1qw8ord/why_do_companies_release_sota_models_when_the/
I strongly suggest updating the Hugging Face README to address all the issues mentioned both here and in that thread. It needs proper prompt examples, fixed code snippets, and clear environment requirements.
Looking forward to the project's progress!
Hello,
@zhixiangwei Just one more observation: because the core CV features were effectively undocumented on Hugging Face until now, the public perception of the model is currently very limited.
If you look at the most popular community threads on Reddit (e.g., on r/LocalLLaMA and on r/comfyui), you'll see that people are only testing the basic VL chat/description capabilities. No one is showcasing the model's unique vision-centric features because they simply couldn't find how to run them.
Even on YouTube, the only available content consists of:
- Two videos reviewing the Arxiv paper.
- Two tutorials showing only the standard VL description tasks.
These creators and developers would likely be showcasing the "full power" of Youtu-VL right now if the Hugging Face README included examples for tasks beyond simple chat. This is why I strongly believe that bringing the documentation in order is critical for the model's adoption.
https://www.youtube.com/watch?v=ETvq66_STE0
https://www.youtube.com/watch?v=gbqJ8aGHeqI
https://www.youtube.com/watch?v=cPtHF1h0tQ8
https://www.youtube.com/watch?v=Sk5VcXJ2coI


