|
|
--- |
|
|
license: llama3 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- lmms-lab/llama3-llava-next-8b |
|
|
- CowCorpus/CowCorpus-llama3-llava-next-8b |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- text-generation |
|
|
- agent |
|
|
- cowcorpus |
|
|
- llava |
|
|
- personalization |
|
|
- user-adaptation |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- perfect-timing-score |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Model Card for CowCorpus/Cluster0-Collaborative-Llava |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
This model is a **specialized fine-tune** of the general [CowCorpus-Llava](https://huggingface.co/CowCorpus/CowCorpus-llama3-llava-next-8b) model. |
|
|
|
|
|
It was specifically further fine-tuned on **Cluster 0 - Collaborative User** data from the **CowCorpus** dataset to adapt to the specific intervention preferences and behavioral patterns of this user group. |
|
|
|
|
|
This model is designed for the task of **Human Intervention Prediction** in collaborative web navigation. Unlike standard autonomous agents, |
|
|
this model predicts *when* **Collaborative** user (Cluster 0) needs to take control from an AI agent. It utilizes multimodal inputs (screenshots, DOM trees, and action history) |
|
|
to distinguish between safe autonomous execution and moments requiring human error correction, preference alignment, or assistance. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
- **Developed by:** CowCorpus Team (Huq et al.) |
|
|
- **Model type:** Multimodal Causal Language Model |
|
|
- **Parent Model:** [CowCorpus/CowCorpus-llama3-llava-next-8b](https://huggingface.co/CowCorpus/CowCorpus-llama3-llava-next-8b) |
|
|
- **Base model:** [lmms-lab/llama3-llava-next-8b](https://huggingface.co/lmms-lab/llama3-llava-next-8b) |
|
|
- **Language:** English |
|
|
- **License:** [Llama 3 Community License Agreement](https://www.llama.com/llama3/license/) |
|
|
- **Paper:** *Modeling Distinct Human Interaction in Web Agents* |
|
|
- **Repository:** [GitHub: oaishi/CowCorpus](https://github.com/oaishi/CowCorpus) |
|
|
|
|
|
### Input Data |
|
|
The model is trained on a rich, multimodal state representation: |
|
|
1. **Visual Screenshot:** The pixel-level view of the current webpage. |
|
|
2. **UI Structure (AX Tree):** The accessibility tree (textual representation of DOM). |
|
|
3. **Past Trajectory:** The history of actions taken by the agent/human so far. |
|
|
4. **Proposed Next Action:** The action that the autonomous agent *intends* to take. The model evaluates if this intent is erroneous. |
|
|
|
|
|
## How to Get Started |
|
|
|
|
|
For inference code, prompt templates, and setup instructions, please refer to our [GitHub Repository](https://github.com/oaishi/CowCorpus). |
|
|
|
|
|
### Training Data |
|
|
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
The model underwent a two-stage training process: |
|
|
1. **Stage 1 (General Adaptation):** Fine-tuned on the complete CowCorpus dataset. |
|
|
2. **Stage 2 (User Personalization):** Further fine-tuned on the **User Cluster 0 subset** of CowCorpus, consists of 101 trajectories and 793 steps. |
|
|
|
|
|
**User Cluster 0 Characteristics:** |
|
|
* **Data Source:** A subset of the collaborative trajectories specific to User Group 0. |
|
|
* **Behavioral Profile:** Collaborative user, interact with rare, modest interventions, usually later in the task, with a strong tendency to hand control back to the agent. |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
- **Hyperparameters:** |
|
|
- Learning Rate: Linear decay from 1e-5 to ~2e-9 |
|
|
- Epochs: 6 |
|
|
- Global Steps: 120 |
|
|
- Batch Size: 1 |
|
|
- Precision: bfloat16 |
|
|
|
|
|
## Evaluation: Cross-Cluster Personalization |
|
|
|
|
|
We evaluate the model using the **Perfect Timing Score (PTS)**, a metric designed to measure the temporal accuracy of intervention predictions. |
|
|
|
|
|
Because this is a personalized model, we report **Cross-Cluster PTS**. This measures how well the model (trained on Cluster 0) performs on its own test data versus test data from other user clusters. |
|
|
High performance on the diagonal (matching train/test groups) indicates successful personalization. |
|
|
|
|
|
### Cross-Cluster PTS Heatmap |
|
|
|
|
|
*The table below displays the PTS values. Rows represent the User Cluster the model was trained on, and Columns represent the User Cluster data it was tested on.* |
|
|
|
|
|
| Trained On (Model) | Tested On: **Collaborative (User 0)** | Tested On: Hands-on (User 2) | Tested On: Takeover (User 3) | |
|
|
| :--- | :---: | :---: | :---: | |
|
|
| Collaborative | **0.187** | 0.130 | 0.058 | |
|
|
| Hands-on | 0.417 | **0.583** | 0.468 | |
|
|
| Takeover | 0.000 | **0.027** | 0.009 | |
|
|
|
|
|
*Note: All models are evaluated in a zero-shot setting without reasoning.* |
|
|
|
|
|
## Citation [optional] |
|
|
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
If you use this model or dataset, please cite our work: Paper incoming |