Pythonformer

non-profit

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

dsinghvi updated a dataset 12 days ago

pythonformer/nemotron-cc-small-subset-decontaminated-4conditions

dsinghvi published a dataset 12 days ago

pythonformer/nemotron-cc-small-subset-decontaminated-4conditions

AutomatedScientist updated a dataset 22 days ago

pythonformer/agents-learn-runtime-benchmarks

View all activity

ajibawa-2023

posted an update about 5 hours ago

Post

189

Shell-Code-Large
Dataset: ajibawa-2023/Shell-Code-Large

Shell-Code-Large is a large-scale corpus of Shell scripting source code comprising approximately 640,000 code samples stored in JSON Lines (.jsonl) format. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, DevOps automation, cloud infrastructure engineering, system administration, and software engineering automation.

By providing a high-volume, language-specific corpus focused exclusively on Shell scripting, Shell-Code-Large enables systematic experimentation in automation workflows, deployment pipelines, infrastructure management, and command-line tooling. These domains remain foundational to Linux systems, cloud-native platforms, CI/CD environments, and modern DevOps practices.

Shell-Code-Large addresses the need for a dedicated Shell-focused dataset at substantial scale, enabling targeted research into scripting patterns, command composition, workflow orchestration, infrastructure automation, and operational engineering practices

dsinghvi

updated a dataset 12 days ago

pythonformer/nemotron-cc-small-subset-decontaminated-4conditions

Viewer • Updated 12 days ago • 184M • 934

dsinghvi

published a dataset 12 days ago

pythonformer/nemotron-cc-small-subset-decontaminated-4conditions

Viewer • Updated 12 days ago • 184M • 934

AutomatedScientist

updated a dataset 22 days ago

pythonformer/agents-learn-runtime-benchmarks

Updated 22 days ago • 24

AutomatedScientist

updated a dataset 23 days ago

pythonformer/agents-learn-runtime-tasks

Updated 23 days ago • 19

AutomatedScientist

updated a dataset 27 days ago

pythonformer/agents-learn-runtime-train

Viewer • Updated 27 days ago • 6k • 24

ajibawa-2023

updated a dataset about 1 month ago

pythonformer/Trajectory-Stitching-Test-Small

Viewer • Updated May 18 • 128k • 40 • 1

ajibawa-2023

published a dataset about 1 month ago

pythonformer/Trajectory-Stitching-Test-Small

Viewer • Updated May 18 • 128k • 40 • 1

ajibawa-2023

updated a dataset about 1 month ago

ontocord/Test-Data

Viewer • Updated May 9 • 500 • 63

ajibawa-2023

published a dataset about 1 month ago

ontocord/Test-Data

Viewer • Updated May 9 • 500 • 63

ajibawa-2023

updated a dataset about 2 months ago

pythonformer/Trajectory-Stitching-Test-7M

Viewer • Updated May 6 • 5.04M • 1.02k

ajibawa-2023

published a dataset about 2 months ago

pythonformer/Trajectory-Stitching-Test-7M

Viewer • Updated May 6 • 5.04M • 1.02k

ajibawa-2023

posted an update about 2 months ago

Post

2154

Stitched-Reasoning-Trajectories-7M

Dataset: ajibawa-2023/Stitched-Reasoning-Trajectories-7M
Stitched-Reasoning-Trajectories-7M is a massive-scale, synthetic multi-hop reasoning dataset. It was built by algorithmically "stitching" together discrete reasoning traces from the original glaiveai/reasoning-v1-20m dataset into continuous, coherent, and logically structured multi-agent trajectories.

By extracting internal sub-questions from <think> blocks and mapping high-information keyword overlaps, this dataset transforms single-turn Q&A pairs into deep, multi-step research plans. To ensure high quality and eliminate "topic drift," every trajectory has been verified using a dense semantic embedding model (BAAI/bge-large-en-v1.5).

The resulting dataset consists of 709 .jsonl files containing over 7.2 million entirely deduplicated, highly coherent reasoning chains.