Buckets:

ppenner
/

Edge-Agent-Reasoning-WebSearch-260K-bucket

1.1 GB

271 files

Updated about 2 months ago

Ctrl+K

Name	Size	Uploaded	Xet hash
.DS_Store	36.9 kB xet	about 2 months ago	e41c504a
.gitattributes	2.5 kB xet	about 2 months ago	738f1125
README.md	15.8 kB xet	about 2 months ago	768538b7
Small Agent Working for Frontier.jpg	1.68 MB xet	about 2 months ago	f34c89b5
edge_reasoning_train_0_296.parquet	173 kB xet	about 2 months ago	7e804e9f
edge_reasoning_train_100157_102262.parquet	4.21 MB xet	about 2 months ago	d80c9d86
edge_reasoning_train_10017_30995.parquet	4.17 MB xet	about 2 months ago	a3c739a4
edge_reasoning_train_101275_103537.parquet	4.17 MB xet	about 2 months ago	04587eb8
edge_reasoning_train_10184_11673.parquet	4.14 MB xet	about 2 months ago	a7e3e5d3
edge_reasoning_train_102680_104514.parquet	4.18 MB xet	about 2 months ago	affdcc94
edge_reasoning_train_103097_113845.parquet	4.16 MB xet	about 2 months ago	e2fc3eea
edge_reasoning_train_103816_105471.parquet	4.2 MB xet	about 2 months ago	16b655c1
edge_reasoning_train_105033_106484.parquet	4.18 MB xet	about 2 months ago	e9640ebe
edge_reasoning_train_106016_107518.parquet	4.17 MB xet	about 2 months ago	eca3fce1
edge_reasoning_train_106942_109362.parquet	4.16 MB xet	about 2 months ago	c33d458d
edge_reasoning_train_108412_110689.parquet	4.12 MB xet	about 2 months ago	818f08cd
edge_reasoning_train_109140_139749.parquet	4.19 MB xet	about 2 months ago	66b29318
edge_reasoning_train_109785_111730.parquet	4.16 MB xet	about 2 months ago	edd7d0a6
edge_reasoning_train_11019_12791.parquet	4.15 MB xet	about 2 months ago	75beaf75
edge_reasoning_train_111925_114325.parquet	4.13 MB xet	about 2 months ago	2f0b14a6
edge_reasoning_train_113394_115588.parquet	4.17 MB xet	about 2 months ago	28856dd0
edge_reasoning_train_114152_140999.parquet	4.18 MB xet	about 2 months ago	3d50a271
edge_reasoning_train_114778_116554.parquet	4.2 MB xet	about 2 months ago	525d63fb
edge_reasoning_train_116030_117561.parquet	4.15 MB xet	about 2 months ago	5728ee12
edge_reasoning_train_117077_118551.parquet	4.22 MB xet	about 2 months ago	7a5a8625
edge_reasoning_train_117746_119556.parquet	4.14 MB xet	about 2 months ago	4d629507
edge_reasoning_train_119229_120591.parquet	4.11 MB xet	about 2 months ago	104f1ff6
edge_reasoning_train_119720_121623.parquet	4.17 MB xet	about 2 months ago	228a8042
edge_reasoning_train_120940_122676.parquet	4.17 MB xet	about 2 months ago	2dd43602
edge_reasoning_train_121969_123664.parquet	4.22 MB xet	about 2 months ago	b441c788
edge_reasoning_train_12282_13862.parquet	4.14 MB xet	about 2 months ago	85e19278
edge_reasoning_train_122832_124668.parquet	4.13 MB xet	about 2 months ago	cabe5883
edge_reasoning_train_124162_125655.parquet	4.12 MB xet	about 2 months ago	9b55fbfc
edge_reasoning_train_125113_126661.parquet	4.16 MB xet	about 2 months ago	120adf06
edge_reasoning_train_126203_127688.parquet	4.16 MB xet	about 2 months ago	3541d7a9
edge_reasoning_train_126930_128704.parquet	4.19 MB xet	about 2 months ago	33dd5d11
edge_reasoning_train_128056_129718.parquet	4.2 MB xet	about 2 months ago	8e879459
edge_reasoning_train_129029_130712.parquet	4.13 MB xet	about 2 months ago	1f448e17
edge_reasoning_train_130158_131695.parquet	4.13 MB xet	about 2 months ago	f6228c8a
edge_reasoning_train_131266_132724.parquet	4.16 MB xet	about 2 months ago	4ae55dca
edge_reasoning_train_132255_133702.parquet	4.17 MB xet	about 2 months ago	8222af21
edge_reasoning_train_13281_14951.parquet	4.18 MB xet	about 2 months ago	f1faf750
edge_reasoning_train_133194_134695.parquet	4.2 MB xet	about 2 months ago	32dc565d
edge_reasoning_train_134230_135765.parquet	4.18 MB xet	about 2 months ago	ae54c8d3
edge_reasoning_train_135273_136762.parquet	4.21 MB xet	about 2 months ago	d0051386
edge_reasoning_train_136111_137755.parquet	4.18 MB xet	about 2 months ago	eea245c1
edge_reasoning_train_137309_138753.parquet	4.14 MB xet	about 2 months ago	a9845d9a
edge_reasoning_train_1376_5163.parquet	4.18 MB xet	about 2 months ago	e5100737
edge_reasoning_train_138276_141625.parquet	4.12 MB xet	about 2 months ago	a7183d8a
edge_reasoning_train_140016_180453.parquet	4.21 MB xet	about 2 months ago	b9260b27
edge_reasoning_train_140742_142777.parquet	4.19 MB xet	about 2 months ago	c038ba99
edge_reasoning_train_141241_189820.parquet	4.2 MB xet	about 2 months ago	dc4682df
edge_reasoning_train_141829_144019.parquet	4.16 MB xet	about 2 months ago	d079ab18
edge_reasoning_train_143060_145055.parquet	4.18 MB xet	about 2 months ago	afd1fb49
edge_reasoning_train_14407_16051.parquet	4.1 MB xet	about 2 months ago	4fa1b84f
edge_reasoning_train_144540_146049.parquet	4.14 MB xet	about 2 months ago	8295e4cc
edge_reasoning_train_145194_147047.parquet	4.22 MB xet	about 2 months ago	a08385e6
edge_reasoning_train_146328_148051.parquet	4.14 MB xet	about 2 months ago	c0815abe
edge_reasoning_train_147549_149048.parquet	4.17 MB xet	about 2 months ago	5c534aab
edge_reasoning_train_148505_150055.parquet	4.17 MB xet	about 2 months ago	7c808623
edge_reasoning_train_149128_151043.parquet	4.12 MB xet	about 2 months ago	38115072
edge_reasoning_train_150534_152061.parquet	4.12 MB xet	about 2 months ago	e493e135
edge_reasoning_train_151170_153054.parquet	4.18 MB xet	about 2 months ago	e8cecb4f
edge_reasoning_train_152577_154064.parquet	4.16 MB xet	about 2 months ago	425d2bb8
edge_reasoning_train_153630_155056.parquet	4.19 MB xet	about 2 months ago	1a7e0546
edge_reasoning_train_154250_156078.parquet	4.16 MB xet	about 2 months ago	95fc2303
edge_reasoning_train_15532_17140.parquet	4.13 MB xet	about 2 months ago	fbe60b94
edge_reasoning_train_155704_157067.parquet	4.2 MB xet	about 2 months ago	fdd833ca
edge_reasoning_train_156515_158105.parquet	4.13 MB xet	about 2 months ago	0bb1c189
edge_reasoning_train_157489_159104.parquet	4.18 MB xet	about 2 months ago	4e29f776
edge_reasoning_train_158581_160093.parquet	4.14 MB xet	about 2 months ago	b519cdaf
edge_reasoning_train_159602_161070.parquet	4.18 MB xet	about 2 months ago	cdf86168
edge_reasoning_train_160660_162067.parquet	4.16 MB xet	about 2 months ago	2db43308
edge_reasoning_train_161737_163085.parquet	4.16 MB xet	about 2 months ago	046c4bac
edge_reasoning_train_162412_164131.parquet	4.15 MB xet	about 2 months ago	b11a63b6
edge_reasoning_train_163521_165087.parquet	4.17 MB xet	about 2 months ago	01dd743c
edge_reasoning_train_164157_166131.parquet	4.23 MB xet	about 2 months ago	82756b77
edge_reasoning_train_165163_167156.parquet	4.18 MB xet	about 2 months ago	8e9ea235
edge_reasoning_train_1656_4020.parquet	4.16 MB xet	about 2 months ago	8feb7504
edge_reasoning_train_166584_168162.parquet	4.21 MB xet	about 2 months ago	dbb1827b
edge_reasoning_train_167543_169148.parquet	4.16 MB xet	about 2 months ago	6ddce2ca
edge_reasoning_train_168810_170144.parquet	4.21 MB xet	about 2 months ago	47b8d631
edge_reasoning_train_16942_18247.parquet	4.16 MB xet	about 2 months ago	520e8f73
edge_reasoning_train_169689_171179.parquet	4.15 MB xet	about 2 months ago	d79a6122
edge_reasoning_train_170660_172160.parquet	4.18 MB xet	about 2 months ago	ec630810
edge_reasoning_train_171416_173166.parquet	4.17 MB xet	about 2 months ago	eecf79e9
edge_reasoning_train_172841_174205.parquet	4.22 MB xet	about 2 months ago	49363243
edge_reasoning_train_173253_175213.parquet	4.13 MB xet	about 2 months ago	36fda04d
edge_reasoning_train_174133_176194.parquet	4.14 MB xet	about 2 months ago	54fbc69b
edge_reasoning_train_175384_177248.parquet	4.16 MB xet	about 2 months ago	9fddc7d4
edge_reasoning_train_176334_178255.parquet	4.16 MB xet	about 2 months ago	d374492d
edge_reasoning_train_17683_19335.parquet	4.18 MB xet	about 2 months ago	d34bd30d
edge_reasoning_train_177506_179236.parquet	4.18 MB xet	about 2 months ago	6ec87f36
edge_reasoning_train_178642_180290.parquet	4.19 MB xet	about 2 months ago	1fc8c818
edge_reasoning_train_179458_181597.parquet	4.16 MB xet	about 2 months ago	27ef22a2
edge_reasoning_train_180706_182884.parquet	4.19 MB xet	about 2 months ago	721ec767
edge_reasoning_train_180897_190777.parquet	4.11 MB xet	about 2 months ago	bf6215c7
edge_reasoning_train_181924_183949.parquet	4.17 MB xet	about 2 months ago	9a8c04fd
edge_reasoning_train_182974_184962.parquet	4.2 MB xet	about 2 months ago	5c877778
edge_reasoning_train_184396_185957.parquet	4.18 MB xet	about 2 months ago	a1f2ebdd

README.md

Edge Agent Reasoning WebSearch 260K

Abstract

The Edge-Agent-Reasoning-WebSearch-260K dataset is a massive, synthetically expert-engineered corpus of over 700 Million tokens, designed to train small, local models (SLMs) and edge-deployed agents in advanced problem deconstruction and self-aware reasoning.

Rather than training a model to execute instructions directly—which often leads to hallucinations when context is missing—this dataset trains a model to act as a preparatory router or System 2 thinking agent. When presented with a complex, domain-specific instruction, the agent's job is to systematically break down the request, identify its own knowledge gaps, formulate specific ambiguities, and construct expert-level web search queries. This preparatory reasoning equips a secondary, more capable frontier model with the exact verified context needed to execute the final task flawlessly.

Dataset Statistics (OpenAI cl100k_base)

This collection is built for deep, zero-shot generalization. Rather than focusing on simplistic conversational exchanges, the dataset prioritizes exhaustive, multi-stage reasoning trajectories grounded in rigorous professional constraints. Engineering this level of structural density and internal validation consumed nearly 1.5 Billion tokens in computational bandwidth.

Summary:

User Prompts: 42,585,076 Tokens
Agentic Reasoning: 646,883,262 Tokens
Total Rows (Observations): 260,293
Compute Spend (Generation Cost): ~1.47 Billion Tokens
Grand Total Dataset Tokens (All Schema Columns): ~692.1 Million Tokens
Format: Parquet (.parquet)

What Does This Dataset Solve?

In distributed agentic architectures, delegating raw user instructions directly to an internet-facing frontier model is inefficient. It wastes expensive compute on vague prompts and fails when the model lacks highly localized context (e.g., specific software versions, niche industry constraints, or local OS environments).

Edge-Agent-Reasoning-WebSearch-260K addresses this by teaching models self-auditing and verification planning. It trains models to overcome the common LLM flaw of overconfidence by forcing them to state what they believe they know, immediately followed by what they must verify.

Core Capabilities

Reasoning Fine-Tuning (RFT): Enhancing the step-by-step reasoning capabilities of 7B-14B parameter models, forcing them to "think before they act."
Self-Awareness & Humility: Training models to treat their own confidence as a signal for verification, rather than evidence of correctness.
Search Query Generation: Training retrieved-augmented generation (RAG) routers to formulate dense, expert-level queries rather than naive keyword matching.
Prompt Interception: Training classifiers to intercept poorly constructed or ambiguous user prompts and demand clarification before consuming expensive API credits.

The 5-Stage Reasoning Structure

Every row in the dataset contains a dense, 2,000 to 5,000-word reasoning trajectory (agent_reasoning). This structure is designed to simulate the internal deliberation of an expert actively planning a complex technical task, ensuring the model output is grounded in self-awareness, factual verification, and deep contextual understanding.

The reasoning is broken down into five highly analytical stages:

1. Understanding the request
Teaches the model to correctly identify the core objective while fully internalizing constraints. This stage ensures the model does not gloss over critical environmental factors (e.g., operating system constraints, user role, specific software versions) before formulating a plan.

2. What I believe I know — and what I'm uncertain about
Instills self-awareness and humility. Instead of hallucinating answers, the model is trained to aggressively audit its own internal knowledge base. It learns to cleanly separate established facts from assumptions, treating its own uncertainty as a trigger for external verification rather than a reason to guess.

3. Ambiguities in the request
Trains the model in prompt interception and clarification. It learns to spot missing parameters, vague instructions, or conflicting constraints that would lead to failure if executed blindly. This allows the routing agent to "push back" and ask the user for clarity before wasting compute or causing destructive side-effects.

4. Everything I need to confirm before responding
Establishes a strict verification protocol. The model actively generates an explicit checklist of facts, dependencies, API statuses, and documentation it must review. This stage acts as a blueprint for the final execution, ensuring that every subsequent action is backed by verified reality.

5. Web search queries
Acts as the bridge between internal reasoning and external retrieval. By generating 10 to 20 highly specific, keyword-dense queries, the model sets up a downstream Retrieval-Augmented Generation (RAG) pipeline to feed a frontier model for success. These queries are explicitly designed to bypass generic SEO content and land directly on highly technical documentation, error logs, or source code.

The Combinatorial Matrix & Sampling

To prevent the semantic collapse often seen in synthetic datasets (where models generate repetitive, homogenous scenarios), the prompt instructions were sourced from a custom-built, 7-dimensional combinatorial matrix.

The Matrix Schema:

Industry (e.g., Biotech, Astrophysics, DevOps, Corporate Finance)
Professional Role (Scope-locked to the Industry)
Software Stack (Determined by Domain Purity rules—some roles get single tools, others get mixed workflows)
Task Type (Realistic operations for the tool and role)
Operating System (OS constraints matched to the industry, e.g., Embedded Linux vs. macOS)
Difficulty (Low to Impossible)
Risk Level (Safe to Catastrophic)

Scale & Sampling: The matrix utilizes 7 distinct large prime numbers to cryptographically scramble and hash these dimensions, creating a deterministic search space of 1,000,000,000 (1 Billion) valid permutations.

From this massive possibility space, I sampled only ~260,000 unique rows for this dataset. This extremely low sampling rate (0.026%) virtually guarantees that there are no overlapping duplicates or repetitive thematic loops, resulting in a dataset with an exceptionally high degree of zero-shot diversity.

Dataset Diversity: 200+ Roles

To ensure the resulting models generalize across the entire spectrum of human knowledge work, the dataset is grounded in highly specific, realistic user profiles. It avoids generic "Assistant" personas in favor of explicit professional domains with corresponding environmental constraints.

Operating System Environments

The dataset escapes the trap of generic "web browser apps" by enforcing highly specific local environments, ensuring training covers the full spectrum of modern and legacy deployment targets. Problem-solving trajectories are explicitly contextualized across Apple ecosystems (macOS, macOS Monterey, macOS Ventura, macOS Sonoma, macOS Sequoia, iOS, iOS 16, iOS 17, iOS 18), Windows environments (Windows, Windows 7 (Legacy), Windows 10, Windows 11, Windows Subsystem for Linux), and Server infrastructures (Windows Server, Windows Server 2019, Windows Server 2022). It deeply covers Linux distributions/environments (Linux, Ubuntu, CentOS, RHEL, Rocky, Fedora, Debian, and Embedded Linux) alongside dedicated Cloud terminals (AWS CloudShell, Google Cloud Shell, Azure Cloud Shell, OCI Cloud Shell). The dataset further embeds mobile and specialized hardware constraints, covering Android (Android, Android 12 through Android 16), ChromeOS, and highly specific tablet use-cases like iPadOS (Clinical, Field Work, and for Procreate). This exhaustive coverage forces the routing agent to learn profound cross-platform contextual awareness, tailoring command-line prompts, software troubleshooting, and hardware constraints to the exact operating system of the simulated user.

Professional Roles (Grouped by Frequency)

> 2,000 tasks Unknown, DevOps Engineer, Industrial Engineer, Security Analyst, IT Support Specialist, System Administrator, IT Technician, Security Engineer, Safety Officer, Platform Engineer, Quality Engineer, Electrical Engineer, Maintenance Engineer, Research Scientist, Manufacturing Engineer, Plant Manager, Business Analyst, Project Manager, CEO, HR Manager, Program Manager, Executive Assistant, Office Manager, Product Manager, Talent Acquisition Specialist, Recruiter, Management Consultant, COO, General Manager, Operations Manager, Robotics Engineer, Lab Technician, Supply Chain Analyst, Mechanical Engineer, Data Scientist

1,300 to 2,000 tasks Software Architect, Mobile Developer, Frontend Developer, QA Engineer, Full Stack Developer, Backend Engineer, Engineering Manager, Postdoctoral Researcher, Game Designer, Visual Designer, Artist, Art Director, Photographer, Fashion Designer, Animator, Illustrator, Creative Director, Graphic Designer, Architect, Construction Manager, BIM Manager, Interior Designer, Project Engineer, Student, Content Creator, Site Manager, Genealogist, Teacher, Volunteer, Photographer (Hobbyist), Traveler, Homeowner, Gamer, Parent, Musician (Hobbyist), DIY Enthusiast, University Researcher, Retiree, Blogger, Writer (Hobbyist), Structural Engineer, Hobbyist, Streamer, Professor, Urban Planner

1,000 to 1,300 tasks Medical Doctor, Clinical Data Manager, Clinical Research Associate, Civil Engineer, Procurement Manager, Content Manager, Investment Banker, PR Manager, SEO Specialist, CFO, Social Media Manager, Financial Analyst, Customer Success Manager, Product Marketing Manager, Actuary, Tax Advisor, Real Estate Broker, Marketing Director, Medical Technologist, VP of Sales, Loan Officer, Risk Manager, Wealth Manager, Brand Manager, CMO, Controller, SDR, Accountant, UX Designer, Sales Manager, Nurse, Auditor, Account Executive, Pharmacologist, Bioinformatician, Geneticist, Immunologist, Biochemist, Data Engineer, ML Engineer, Data Analyst, AI Researcher, Anesthesiologist, Pharmacist, Surgeon, Toxicologist, Molecular Biologist

500 to 1,000 tasks Pathologist, PhD Student (Biology), Microbiologist, Epidemiologist, Radiologist, Virologist, Biologist, Film Editor, Video Editor, Director, Motion Designer, Dentist, GIS Analyst, Geologist, Climate Scientist, Remote Sensing Analyst, Environmental Scientist, PhD Student (Physics), Physicist, Podcast Host, Mixing Engineer, Composer, Mastering Engineer, Voice Actor, Sound Designer, Music Producer, Audio Engineer, Meteorologist, Oceanographer, Hydrologist, Materials Scientist, Chemist, Analytical Chemist, Veterinarian, VFX Artist, Soil Scientist, Process Engineer, Quality Control Chemist, Medicinal Chemist, Cinematographer, Chemical Engineer, Colorist, Spectroscopist, Formulation Scientist, Polymer Scientist, Geophysicist, Ecologist, Radio Astronomer, PhD Student (Astronomy)

100 to 500 tasks Observatory Scientist, Astrophysicist, Planetary Scientist, Astronomer, Data Scientist (Astronomy), Cosmologist, Computational Astrophysicist, Observational Astronomer, Space Scientist, Seismologist, Wildlife Biologist, Agronomist, Graphics Programmer, Technical Artist, Game Developer, Gameplay Engineer, Level Designer, Game Programmer, PhD Student (Chemistry), Computational Chemist, Paralegal, Attorney, Compliance Officer, Forensic Analyst, Prosecutor, Detective, Judge, Contract Manager, Legal Assistant, Legal Counsel, Defense Attorney, General Counsel

Data Structure / Schema

The dataset is distributed natively chunked in .parquet files.

Column	Type	Description
`batch_index_id`	int64	Identifier tracking the sample back to the source prompt batch.
`role`	string	The simulated professional role of the user issuing the prompt.
`industry`	string	The conceptual industry sector to which the task belongs.
`os`	string	The operating system environment relevant to the task constraints.
`user_prompt`	string	The raw, initial instruction or query provided by the synthetic user.
`agent_reasoning`	string	The 2,000-5,000 word internal reasoning output.

Developer & Architect

This dataset was created by Yatin Taneja, an AI Systems Engineer, Superintelligence Researcher, Musician (Dubstep Artist), Rapper, and Poet.

When you blend art and engineering, you get systems that can actually think like humans. I built this dataset to break models out of their rigid, robotic patterns and force them to approach problems with the disciplined structure of a researcher, the foresight of an engineer, and the lateral creativity of an artist.

To all the open-source engineers, AI researchers, and builders pushing the boundaries of what AI models can do, I encourage you to use this data to train and task agents that don't just execute blindly, but actually reason. Models that audit their own knowledge, respect their constraints, and solve problems with humility, precision, and nuance will build the future of edge-deployed superintelligence.

Weblinks

IM Superintelligence: Visit my central knowledge hub hosting other massive open datasets and over 2,000 articles exploring Superintelligence, cognitive architectures, quantum computing, distributed networks, algorithmic optimization, and the future of the global education sector, all authored through a custom 8-step multi-model agentic infrastructure I engineered.
Yatin Taneja | Professional Portfolio: View my professional portfolio for a comprehensive overview of my skills, industry experience, and software prototypes as part of my ongoing engineering work in full-stack AI agents and applications.
LinkedIn: Connect on LinkedIn to collaborate on advanced autonomous systems, enterprise AI implementations, or to follow my ongoing research.

License & Usage

This dataset is released under the MIT License.

Designed for open research in multi-agent orchestration, test-time compute scaling (System 2 thinking patterns), and robust SLM fine-tuning. You are free to use this dataset for academic, personal, and commercial model training applications, provided the original license and copyright notice are preserved.

Total size: 1.1 GB

Files: 271

Last updated: Jun 1

Pre-warmed CDN: US EU US EU