TalkingPoints / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add pipeline tag, paper, and code links for Talking Points
3862dca verified
|
raw
history blame
1.88 kB
metadata
license: apache-2.0
pipeline_tag: keypoint-detection

Talking Points: Describing and Localizing Pixels

This repository contains the official implementation of the Talking Points framework, presented in the paper Talking Points: Describing and Localizing Pixels.

Vision-language models have achieved remarkable success in cross-modal understanding, but often remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. Talking Points introduces a novel framework for pixel-level grounding, consisting of two complementary components:

  • A Point Descriptor that generates rich, contextual descriptions of individual keypoints.
  • A Point Localizer that regresses precise pixel coordinates from these descriptions.

Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context.

For more detailed information, including installation instructions, dataset creation, training scripts, and evaluation protocols, please refer to the official GitHub repository.

Acknowledgments

This repository is built upon and incorporates code from OMG-Seg and OMG-LLaVA. In addition, it uses the code from LLaVA.

License

This project follows the Apache-2.0 license, for the respect of both LLaVA and XTuner codebase.