12.7 GB
12 files
Updated 18 days ago
Name
Size
preview
viewer
.gitattributes2.5 kB
xet
README.md4.74 kB
xet
audio.zip12.7 GB
xet
binary_classification.zip60.3 kB
xet
dataset_manifest.json347 Bytes
xet
sft.zip4.39 MB
xet
README.md

TeleAntiFraud

Sanitized public release of the TeleAntiFraud audio-text fraud detection dataset.

This repository contains public metadata splits, audio archives, and a small preview set for quick inspection on the dataset page.

License

Copyright 2025 Zhiming Ma. All rights reserved.

Licensed under the Apache License, Version 2.0.

Overview

TeleAntiFraud is a Chinese audio-text fraud detection dataset designed for:

  • binary fraud detection from call audio
  • multi-turn audio-text instruction tuning
  • speech understanding and fraud-risk reasoning

The public release removes machine-specific paths from the original research environment and normalizes audio references to relative paths.

Contents

  • binary_classification.zip
    • train.json: 4,000 binary fraud classification samples
    • test.json: 400 binary fraud classification samples
  • sft.zip
    • train.jsonl: 27,146 multi-turn SFT samples
    • test.jsonl: 6,807 multi-turn SFT samples
  • audio.zip
    • referenced audio files normalized under audio/...
  • dataset_manifest.json
  • preview/
    • a few small MP3 examples for quick listening on the Hub page
  • viewer/
    • lightweight parquet files used by the Hugging Face dataset viewer

Splits

Package File Samples Description
binary_classification.zip train.json 4,000 binary call-level fraud classification
binary_classification.zip test.json 400 binary call-level fraud classification
sft.zip train.jsonl 27,146 multi-turn SFT data with audio-grounded prompts
sft.zip test.jsonl 6,807 multi-turn SFT data with audio-grounded prompts

Schema Summary

Binary classification

Each sample keeps a prompt-style structure and a label:

{
  "prompt": [
    {
      "role": "system",
      "content": "..."
    },
    {
      "role": "user",
      "content": [
        {
          "type": "audio",
          "audio_url": "audio/..."
        },
        {
          "type": "text",
          "text": "..."
        }
      ]
    }
  ],
  "answer": "fraud"
}

SFT

Each line in train.jsonl or test.jsonl is a JSON object containing multi-turn messages and audio-grounded prompts for scene understanding, fraud judgment, and related reasoning tasks.

Preview

Small preview files are provided for direct listening without downloading the full audio.zip.

Example Label Audio Notes
normal_example.mp3 normal link binary classification sample
fraud_example_1.mp3 fraud link binary classification sample
fraud_example_2.mp3 fraud link binary classification sample

Preview metadata is also available in preview/preview_samples.json.

Viewer Support

The Hugging Face dataset viewer is configured with lightweight parquet files in viewer/train.parquet and viewer/test.parquet. These files expose a stable preview table with:

  • id
  • task
  • audio_path
  • instruction
  • label

Sanitization

  • Absolute local paths from the original research environment were removed.
  • Audio references were normalized to relative paths under audio/.
  • The original field structure was kept whenever possible to avoid breaking downstream scripts.

Usage Notes

  • This release is packaged as zip archives to make distribution of the audio assets more manageable.
  • Audio references inside JSON / JSONL files are relative paths, not absolute local paths.
  • If you unpack audio.zip, the metadata files can be used directly with the normalized audio/... paths.
  • For project code and evaluation scripts, see the GitHub repository below.

Related Resources

Total size
12.7 GB
Files
12
Last updated
Jun 13
Pre-warmed CDN
US EU US EU

Contributors