license: apache-2.0
pipeline_tag: keypoint-detection
Talking Points: Describing and Localizing Pixels
This repository contains the official implementation of the Talking Points framework, presented in the paper Talking Points: Describing and Localizing Pixels.
Vision-language models have achieved remarkable success in cross-modal understanding, but often remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. Talking Points introduces a novel framework for pixel-level grounding, consisting of two complementary components:
- A Point Descriptor that generates rich, contextual descriptions of individual keypoints.
- A Point Localizer that regresses precise pixel coordinates from these descriptions.
Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context.
For more detailed information, including installation instructions, dataset creation, training scripts, and evaluation protocols, please refer to the official GitHub repository.
Acknowledgments
This repository is built upon and incorporates code from OMG-Seg and OMG-LLaVA. In addition, it uses the code from LLaVA.
License
This project follows the Apache-2.0 license, for the respect of both LLaVA and XTuner codebase.