Buckets:

Codex12
/

TeleAntiFraud-bucket123

Codex12/TeleAntiFraud-bucket123 / Mainbucket

12.7 GB

12 files

Updated 18 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
preview		18 days ago	4 items
viewer		18 days ago	2 items
.gitattributes	2.5 kB xet	18 days ago	738f1125
README.md	4.74 kB xet	18 days ago	cd042c24
audio.zip	12.7 GB xet	18 days ago	4d9eebb9
binary_classification.zip	60.3 kB xet	18 days ago	59d3dbd9
dataset_manifest.json	347 Bytes xet	18 days ago	0fee4ca4
sft.zip	4.39 MB xet	18 days ago	5a8d980c

README.md

TeleAntiFraud

Sanitized public release of the TeleAntiFraud audio-text fraud detection dataset.

This repository contains public metadata splits, audio archives, and a small preview set for quick inspection on the dataset page.

License

Licensed under the Apache License, Version 2.0.

Overview

TeleAntiFraud is a Chinese audio-text fraud detection dataset designed for:

binary fraud detection from call audio
multi-turn audio-text instruction tuning
speech understanding and fraud-risk reasoning

The public release removes machine-specific paths from the original research environment and normalizes audio references to relative paths.

binary_classification.zip
- train.json: 4,000 binary fraud classification samples
- test.json: 400 binary fraud classification samples
sft.zip
- train.jsonl: 27,146 multi-turn SFT samples
- test.jsonl: 6,807 multi-turn SFT samples
audio.zip
- referenced audio files normalized under audio/...
dataset_manifest.json
preview/
- a few small MP3 examples for quick listening on the Hub page
viewer/
- lightweight parquet files used by the Hugging Face dataset viewer

Splits

Package	File	Samples	Description
`binary_classification.zip`	`train.json`	4,000	binary call-level fraud classification
`binary_classification.zip`	`test.json`	400	binary call-level fraud classification
`sft.zip`	`train.jsonl`	27,146	multi-turn SFT data with audio-grounded prompts
`sft.zip`	`test.jsonl`	6,807	multi-turn SFT data with audio-grounded prompts

Schema Summary

Binary classification

Each sample keeps a prompt-style structure and a label:

{
  "prompt": [
    {
      "role": "system",
      "content": "..."
    },
    {
      "role": "user",
      "content": [
        {
          "type": "audio",
          "audio_url": "audio/..."
        },
        {
          "type": "text",
          "text": "..."
        }
      ]
    }
  ],
  "answer": "fraud"
}

SFT

Each line in train.jsonl or test.jsonl is a JSON object containing multi-turn messages and audio-grounded prompts for scene understanding, fraud judgment, and related reasoning tasks.

Preview

Small preview files are provided for direct listening without downloading the full audio.zip.

Example	Label	Audio	Notes
`normal_example.mp3`	`normal`	link	binary classification sample
`fraud_example_1.mp3`	`fraud`	link	binary classification sample
`fraud_example_2.mp3`	`fraud`	link	binary classification sample

Preview metadata is also available in preview/preview_samples.json.

Viewer Support

The Hugging Face dataset viewer is configured with lightweight parquet files in viewer/train.parquet and viewer/test.parquet. These files expose a stable preview table with:

id
task
audio_path
instruction
label

Sanitization

Absolute local paths from the original research environment were removed.
Audio references were normalized to relative paths under audio/.
The original field structure was kept whenever possible to avoid breaking downstream scripts.

Usage Notes

This release is packaged as zip archives to make distribution of the audio assets more manageable.
Audio references inside JSON / JSONL files are relative paths, not absolute local paths.
If you unpack audio.zip, the metadata files can be used directly with the normalized audio/... paths.
For project code and evaluation scripts, see the GitHub repository below.

Related Resources

TeleAntiFraud-28k paper: https://huggingface.co/papers/2503.24115
GitHub: https://github.com/JimmyMa99/TeleAntiFraud
Evaluation scripts: https://github.com/JimmyMa99/TeleAntiFraud/tree/main/evaluation
ModelScope: https://www.modelscope.cn/datasets/JimmyMa99/TeleAntiFraud-28k
SAFE-QAQ (ACL 2026): https://arxiv.org/abs/2601.01392

Total size: 12.7 GB

Files: 12

Last updated: Jun 13

Pre-warmed CDN: US EU US EU