walidsobhie-code
reorganize: consolidate root level to 20 folders
b8e3e42

Stack 2.9 Training Data

This directory contains synthetic training data for fine-tuning code generation models.

Directory Structure

training-data/
β”œβ”€β”€ README.md                           # This file
β”œβ”€β”€ tool_examples.jsonl                 # Tool-calling examples (Qwen2.5-Coder format)
β”œβ”€β”€ tool_examples.json                  # Same as above in JSON format
β”œβ”€β”€ code_completion/                    # Pure code completion examples
β”‚   β”œβ”€β”€ code_completion.jsonl
β”‚   └── code_completion.json
└── training-data-expanded/            # Additional generated data
    └── tool_examples.jsonl             # 5000 expanded tool-calling examples

Data Formats

Tool-Calling Examples

Format: Qwen2.5-Coder style with tool_calls

Each example contains:

  • messages: Array of conversation messages (system, user, assistant, tool)
  • tools: Array of tool definitions

Example structure:

{
  "messages": [
    {"role": "system", "content": "You are a helpful AI assistant..."},
    {"role": "user", "content": "Read the file at src/main.py..."},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [
        {
          "id": "call_1234",
          "type": "function",
          "function": {
            "name": "FileRead",
            "arguments": "{\"path\": \"src/main.py\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "content": "Successfully read file: src/main.py\n...",
      "tool_call_id": "call_1234",
      "name": "FileRead"
    },
    {"role": "assistant", "content": "Here's the contents..."}
  ],
  "tools": [...]
}

Available Tools:

  • Bash - Execute bash commands
  • FileRead - Read file contents
  • FileWrite - Write/create files
  • WebSearch - Search the web
  • Grep - Search patterns in files

Code Completion Examples

Format: Chat-based with context and completion

Each example contains:

  • messages: Array of conversation messages
  • language: Programming language (python, javascript, go, rust, typescript)
  • difficulty: easy, medium, hard
  • variant: basic, explain, debug, optimize
  • context: The code context to complete
  • completion: The expected completion

Example structure:

{
  "messages": [
    {"role": "system", "content": "You are a helpful AI assistant..."},
    {"role": "user", "content": "Complete the following code:\n```python\ndef greet(name):\n```"},
    {"role": "assistant", "content": "Here's the completed code:\n```python\ndef greet(name):\n    return f\"Hello, {name}!\"\n```"}
  ],
  "language": "python",
  "difficulty": "easy",
  "variant": "basic",
  "description": "Simple function that returns a greeting",
  "context": "def greet(name):",
  "completion": "    return f\"Hello, {name}!\""
}

Generation Scripts

Tool Data Generator

python3 scripts/generate_tool_data.py \
    --num-examples 5000 \
    --output-dir training-data-expanded \
    --output-format jsonl

Code Completion Generator

python3 scripts/generate_code_completion_data.py \
    --num-examples 1000 \
    --output-dir training-data/code-completion \
    --languages python javascript go rust typescript \
    --difficulties easy medium hard \
    --variants basic explain debug optimize

Difficulty Levels

Level Description
easy Simple functions, basic operations, single concepts
medium Intermediate patterns, async operations, error handling
hard Complex algorithms, data structures, design patterns

Variants

Variant Description
basic Standard code completion
explain Code completion with explanation
debug Bug fixing and completion
optimize Performance optimization and completion

Supported Languages

  • Python
  • JavaScript
  • Go
  • Rust
  • TypeScript

Usage

Training with MLflow

mlflow run . -P num_examples=5000

Loading Data for Training

import json

# Load JSONL
with open("training-data/tool_examples.jsonl", "r") as f:
    for line in f:
        example = json.loads(line)
        # Process example
        pass

# Load JSON
with open("training-data/tool_examples.json", "r") as f:
    data = json.load(f)

Augmentation

The tool-calling generator applies augmentation to create diversity:

  • Varying file paths
  • Varying command options
  • Varying search queries
  • Varying code snippets

Quality Guidelines

  • All generated code is syntactically correct
  • Examples include realistic context
  • Tools have proper arguments and responses
  • Code completions are deterministic and correct