File size: 4,983 Bytes
b4158a9 e00ba37 b4158a9 41b98fc b4158a9 41b98fc b4158a9 e00ba37 b4158a9 1fa5302 b4158a9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
license: llama3
language:
- en
base_model:
- lmms-lab/llama3-llava-next-8b
- CowCorpus/CowCorpus-llama3-llava-next-8b
pipeline_tag: text-generation
tags:
- text-generation
- agent
- cowcorpus
- llava
- personalization
- user-adaptation
metrics:
- accuracy
- f1
- perfect-timing-score
library_name: transformers
---
# Model Card for CowCorpus/Cluster0-Collaborative-Llava
<!-- Provide a quick summary of what the model is/does. -->
This model is a **specialized fine-tune** of the general [CowCorpus-Llava](https://huggingface.co/CowCorpus/CowCorpus-llama3-llava-next-8b) model.
It was specifically further fine-tuned on **Cluster 0 - Collaborative User** data from the **CowCorpus** dataset to adapt to the specific intervention preferences and behavioral patterns of this user group.
This model is designed for the task of **Human Intervention Prediction** in collaborative web navigation. Unlike standard autonomous agents,
this model predicts *when* **Collaborative** user (Cluster 0) needs to take control from an AI agent. It utilizes multimodal inputs (screenshots, DOM trees, and action history)
to distinguish between safe autonomous execution and moments requiring human error correction, preference alignment, or assistance.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** CowCorpus Team (Huq et al.)
- **Model type:** Multimodal Causal Language Model
- **Parent Model:** [CowCorpus/CowCorpus-llama3-llava-next-8b](https://huggingface.co/CowCorpus/CowCorpus-llama3-llava-next-8b)
- **Base model:** [lmms-lab/llama3-llava-next-8b](https://huggingface.co/lmms-lab/llama3-llava-next-8b)
- **Language:** English
- **License:** [Llama 3 Community License Agreement](https://www.llama.com/llama3/license/)
- **Paper:** *Modeling Distinct Human Interaction in Web Agents*
- **Repository:** [GitHub: oaishi/CowCorpus](https://github.com/oaishi/CowCorpus)
### Input Data
The model is trained on a rich, multimodal state representation:
1. **Visual Screenshot:** The pixel-level view of the current webpage.
2. **UI Structure (AX Tree):** The accessibility tree (textual representation of DOM).
3. **Past Trajectory:** The history of actions taken by the agent/human so far.
4. **Proposed Next Action:** The action that the autonomous agent *intends* to take. The model evaluates if this intent is erroneous.
## How to Get Started
For inference code, prompt templates, and setup instructions, please refer to our [GitHub Repository](https://github.com/oaishi/CowCorpus).
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The model underwent a two-stage training process:
1. **Stage 1 (General Adaptation):** Fine-tuned on the complete CowCorpus dataset.
2. **Stage 2 (User Personalization):** Further fine-tuned on the **User Cluster 0 subset** of CowCorpus, consists of 101 trajectories and 793 steps.
**User Cluster 0 Characteristics:**
* **Data Source:** A subset of the collaborative trajectories specific to User Group 0.
* **Behavioral Profile:** Collaborative user, interact with rare, modest interventions, usually later in the task, with a strong tendency to hand control back to the agent.
### Training Configuration
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
- **Hyperparameters:**
- Learning Rate: Linear decay from 1e-5 to ~2e-9
- Epochs: 6
- Global Steps: 120
- Batch Size: 1
- Precision: bfloat16
## Evaluation: Cross-Cluster Personalization
We evaluate the model using the **Perfect Timing Score (PTS)**, a metric designed to measure the temporal accuracy of intervention predictions.
Because this is a personalized model, we report **Cross-Cluster PTS**. This measures how well the model (trained on Cluster 0) performs on its own test data versus test data from other user clusters.
High performance on the diagonal (matching train/test groups) indicates successful personalization.
### Cross-Cluster PTS Heatmap
*The table below displays the PTS values. Rows represent the User Cluster the model was trained on, and Columns represent the User Cluster data it was tested on.*
| Trained On (Model) | Tested On: **Collaborative (User 0)** | Tested On: Hands-on (User 2) | Tested On: Takeover (User 3) |
| :--- | :---: | :---: | :---: |
| Collaborative | **0.187** | 0.130 | 0.058 |
| Hands-on | 0.417 | **0.583** | 0.468 |
| Takeover | 0.000 | **0.027** | 0.009 |
*Note: All models are evaluated in a zero-shot setting without reasoning.*
## Citation [optional]
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
If you use this model or dataset, please cite our work: Paper incoming |