Buckets:

Mutibrains
/

Chat2Find-Instruct-Reasoning

Files

xet

Mutibrains/Chat2Find-Instruct-Reasoning / README.md

Mutibrains

3 days ago

preview code

download

raw

2.54 kB

	---
	license: mit
	language:
	- si
	- ta
	- en
	tags:
	- instruction-finetuning
	- reasoning
	- tool-use
	- trilingual
	- singlish
	- tanglish
	- chat2find
	pretty_name: Chat2Find Unified Reasoning & Tool Dataset (Public Sample)
	size_categories:
	- n<1K
	---

	# Chat2Find Unified Reasoning & Tool Dataset (Public Sample)

	This repository contains a 5,000-record public preview of the Chat2Find Unified Reasoning & Tool Dataset.

	The full dataset is a premium, high-logic instruction dataset designed for training state-of-the-art conversational AI models. It contains 279,260 trilingual records optimized for complex problem-solving, chain-of-thought reasoning, and sophisticated tool-calling interactions in Sinhala, Tamil, and English.

	## 📂 Access the Full Dataset

	The full 1.8 GB dataset is available as a Gated Repository for commercial and advanced research use.

	👉 [Access the Full Dataset Here](https://huggingface.co/datasets/Chat2Find/Chat2Find-Instruct-Reasoning-Dataset)

	### How to get a license:
	1. Purchase License: Use our secure Stripe link to purchase a commercial/advanced research license:
	👉 [Buy Full Dataset License (Stripe)](https://buy.stripe.com/14AaEX1EjdODfblcKMawo02)
	2. Provide Username: During the Stripe checkout, please enter your Hugging Face Username.
	3. Approval: Once payment is confirmed, we will grant your account access to the gated repository within 24 hours.

	---

	## 📊 Sample Details & Composition

	The 5,000 records in this preview are carefully curated to reflect the high quality of the full dataset.

	Conversation Flow:
	- Single-turn (SFT): 70.0% (3,500 records)
	- Multi-turn (Agentic/Chat): 30.0% (1,500 records)

	Reasoning & Execution:
	- Pure Chain-of-Thought Reasoning: 72.0%
	- Tool Calling & API Interaction: 28.0%

	Language Breakdown:
	- Tamil: 45.7%
	- Sinhala: 36.3%
	- English: 18.0%
	- Note: Singlish and Tanglish code-mixed data are aggressively embedded within these records to ensure realistic South Asian conversational abilities.

	---

	## 🌟 What's in the Full Dataset?
	The full 1.8 GB dataset contains 279,260 records offering a massive scale-up of everything seen in this sample.
	- Over 10,800 deep multi-turn interactions.
	- Hundreds of thousands of localized, culturally aware logic puzzles, tool invocations, and code-mixed conversations not found in standard open-source datasets.

	---
	Stay tuned to [chat2find.com](https://chat2find.com) for updates.

Xet Storage Details

Size:: 2.54 kB
Xet hash:: 180290145a124fbf2678ae50622d76d246f24e578f1fdb3b6700ba629d696c5f

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.