Update README.md
Browse files
README.md
CHANGED
|
@@ -1,9 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
|
| 2 |
# Selective Contrastive Learning for Weakly Supervised Affordance Grounding (ICCV 2025)
|
| 3 |
WonJun Moon*</sup>, Hyun Seok Seong*</sup>, Jae-Pil Heo</sup> (*: equal contribution)
|
| 4 |
|
| 5 |
-
[[Arxiv](https://arxiv.org/abs/2508.07877)]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
-
---
|
| 8 |
-
license: mit
|
| 9 |
-
---
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: object-detection
|
| 4 |
+
---
|
| 5 |
|
| 6 |
# Selective Contrastive Learning for Weakly Supervised Affordance Grounding (ICCV 2025)
|
| 7 |
WonJun Moon*</sup>, Hyun Seok Seong*</sup>, Jae-Pil Heo</sup> (*: equal contribution)
|
| 8 |
|
| 9 |
+
[[Arxiv](https://arxiv.org/abs/2508.07877)] [[Github](https://github.com/hynnsk/SelectiveCL)]
|
| 10 |
+
(Code will be released soon.)
|
| 11 |
+
|
| 12 |
+
## Abstract
|
| 13 |
+
> Facilitating an entity's interaction with objects requires accurately identifying parts that afford specific actions. Weakly supervised affordance grounding (WSAG) seeks to imitate human learning from third-person demonstrations, where humans intuitively grasp functional parts without needing pixel-level annotations. To achieve this, grounding is typically learned using a shared classifier across images from different perspectives, along with distillation strategies incorporating part discovery process. However, since affordance-relevant parts are not always easily distinguishable, models primarily rely on classification, often focusing on common class-specific patterns that are unrelated to affordance. To address this limitation, we move beyond isolated part-level learning by introducing selective prototypical and pixel contrastive objectives that adaptively learn affordance-relevant cues at both the part and object levels, depending on the granularity of the available information. Initially, we find the action-associated objects in both egocentric (object-focused) and exocentric (third-person example) images by leveraging CLIP. Then, by cross-referencing the discovered objects of complementary views, we excavate the precise part-level affordance clues in each perspective. By consistently learning to distinguish affordance-relevant regions from affordance-irrelevant background context, our approach effectively shifts activation from irrelevant areas toward meaningful affordance cues. Experimental results demonstrate the effectiveness of our method.
|
| 14 |
+
|
| 15 |
+
|
| 16 |
|
|
|
|
|
|
|
|
|