YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Mobile Actions SFT Dataset

A converted version of the google/mobile-actions dataset for supervised fine-tuning (SFT) of Qwen models with tool calling capabilities.

Dataset Description

This dataset is derived from the google/mobile-actions dataset, which contains human-AI conversations about performing actions on mobile devices. The original dataset has been converted to the Qwen chat template format for efficient training of Qwen models.

Conversion Process

The conversion is performed using the convert_mobile_actions.py script, which:

  1. Reads the original JSONL format from /data/datasets/mobile-actions/dataset.jsonl
  2. Reformats the tools and messages to match Qwen's expected format
  3. Applies the Qwen chat template using transformers.AutoTokenizer.apply_chat_template()
  4. Saves the result as a Parquet file with source and text columns

Dataset Structure

The dataset is stored in Parquet format with the following columns:

Column Type Description
source string Always 'mobile-actions' to identify the data source
text string The complete conversation formatted using Qwen's chat template

The data is organized in a single Parquet file:

data/train-00000-of-00001.parquet

Data Format

Each sample in the original dataset contains:

  • tools: List of available tools/functions with their JSON schemas
  • messages: Conversation history with roles (developer, user, assistant)

After conversion, the text column contains the fully formatted conversation ready for language modeling training. The format follows Qwen's tool-calling template, which includes:

  • System prompt with tool definitions
  • User query
  • Assistant response with tool calls (when applicable)

Usage Example

import pandas as pd
from transformers import AutoTokenizer

# Load the dataset
df = pd.read_parquet('/data/datasets/mobile-actions-sft/data/train-00000-of-00001.parquet')

# Sample text
sample_text = df.iloc[0]['text']
print(sample_text[:500])  # Print first 500 characters

# For training with TRL SFTTrainer
# The dataset is in standard language modeling format: {"text": "full_conversation"}

Training with TRL

This dataset is in the standard language modeling format as defined in the TRL documentation:

  • Type: Language modeling
  • Format: Standard (plain text strings)
  • Expected columns: {"text": "The sky is blue."}

It can be used directly with SFTTrainer for supervised fine-tuning of Qwen models.

Original Dataset Information

  • Name: google/mobile-actions
  • Description: Human-AI conversations about performing actions on mobile devices
  • Size: ~100k conversations
  • Tasks: Tool calling, function calling, mobile assistant
  • License: Apache 2.0

Conversion Script

The conversion script is available at convert_mobile_actions.py in the project root. Key parameters:

python convert_mobile_actions.py \
  --input_path /data/datasets/mobile-actions/dataset.jsonl \
  --output_dir /data/datasets/mobile-actions-sft/data \
  --model_path /data/models/Qwen2.5-0.5B-Instruct \
  --max_samples 1000  # Optional: limit for testing

Citation

If you use this dataset, please cite the original work:

@inproceedings{shah2024mobile,
  title={Mobile-actions: A dataset for instruction-based mobile UI navigation},
  author={Shah, Pratyush and Dhekane, Eshaan and Gholami, Saghar and Narayan, Apurva and Wang, Bing and Narayanan, Vijay},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={--},
  year={2024}
}

License

This converted dataset inherits the Apache 2.0 license from the original google/mobile-actions dataset.

Contact

For questions about the conversion process, refer to the convert_mobile_actions.py script documentation.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support