Buckets:
| license: mit | |
| language: | |
| - si | |
| - ta | |
| - en | |
| tags: | |
| - instruction-finetuning | |
| - reasoning | |
| - tool-use | |
| - trilingual | |
| - singlish | |
| - tanglish | |
| - chat2find | |
| pretty_name: Chat2Find Unified Reasoning & Tool Dataset (Public Sample) | |
| size_categories: | |
| - n<1K | |
| # Chat2Find Unified Reasoning & Tool Dataset (Public Sample) | |
| This repository contains a **5,000-record public preview** of the **Chat2Find Unified Reasoning & Tool Dataset**. | |
| The full dataset is a premium, high-logic instruction dataset designed for training state-of-the-art conversational AI models. It contains **279,260 trilingual records** optimized for complex problem-solving, chain-of-thought reasoning, and sophisticated tool-calling interactions in **Sinhala**, **Tamil**, and **English**. | |
| ## ๐ Access the Full Dataset | |
| The full 1.8 GB dataset is available as a **Gated Repository** for commercial and advanced research use. | |
| ๐ **[Access the Full Dataset Here](https://huggingface.co/datasets/Chat2Find/Chat2Find-Instruct-Reasoning-Dataset)** | |
| ### **How to get a license:** | |
| 1. **Purchase License:** Use our secure Stripe link to purchase a commercial/advanced research license: | |
| ๐ **[Buy Full Dataset License (Stripe)](https://buy.stripe.com/14AaEX1EjdODfblcKMawo02)** | |
| 2. **Provide Username:** During the Stripe checkout, please enter your **Hugging Face Username**. | |
| 3. **Approval:** Once payment is confirmed, we will grant your account access to the gated repository within **24 hours**. | |
| --- | |
| ## ๐ Sample Details & Composition | |
| The 5,000 records in this preview are carefully curated to reflect the high quality of the full dataset. | |
| **Conversation Flow:** | |
| - **Single-turn (SFT):** 70.0% (3,500 records) | |
| - **Multi-turn (Agentic/Chat):** 30.0% (1,500 records) | |
| **Reasoning & Execution:** | |
| - **Pure Chain-of-Thought Reasoning:** 72.0% | |
| - **Tool Calling & API Interaction:** 28.0% | |
| **Language Breakdown:** | |
| - **Tamil:** 45.7% | |
| - **Sinhala:** 36.3% | |
| - **English:** 18.0% | |
| - *Note: Singlish and Tanglish code-mixed data are aggressively embedded within these records to ensure realistic South Asian conversational abilities.* | |
| --- | |
| ## ๐ What's in the Full Dataset? | |
| The full 1.8 GB dataset contains **279,260 records** offering a massive scale-up of everything seen in this sample. | |
| - Over **10,800** deep multi-turn interactions. | |
| - Hundreds of thousands of localized, culturally aware logic puzzles, tool invocations, and code-mixed conversations not found in standard open-source datasets. | |
| --- | |
| **Stay tuned to [chat2find.com](https://chat2find.com) for updates.** | |
Xet Storage Details
- Size:
- 2.54 kB
- Xet hash:
- 180290145a124fbf2678ae50622d76d246f24e578f1fdb3b6700ba629d696c5f
ยท
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.