EXOROBOURII commited on
Commit
3fe7d2e
·
verified ·
1 Parent(s): d155231

Update README.md

Browse files

# Python Latent Space Explorer (PLSE) v2.0

**A System for Engineering High-Quality, Pedagogical Code Datasets**



PLSE v2.0 is a data engineering framework designed to address a fundamental constraint in the development of code-intelligent Large Language Models: the quality of training data. Moving beyond the limitations of scraping public code repositories, PLSE employs a "pedagogical by design" philosophy to synthetically generate a vast, instruction-tuned dataset of high-quality code examples, anti-patterns, and architectural blueprints.



### Core Philosophy

This project is built on a set of foundational principles that guide every generated example:

* ✅ **Pedagogical by Design:** Every generated sample is a "lesson" explicitly designed to teach a specific concept, from a simple language feature to a complex architectural pattern.
* 🔬 **Quality-Driven Architecture:** A multi-stage validation pipeline ensures every generated code sample is syntactically correct, stylistically idiomatic, and algorithmically sound.
* 🔀 **Combinatorial Generation:** A powerful Jinja2-based engine creates vast diversity from a small set of rich, modular patterns, allowing us to explore the "latent space" of Python code.
* 🧩 **Community-Ready and Scalable:** The modular, YAML-based pattern library is designed to be easily understood, maintained, and extended by developers.

### Quick Start: Generating Your First Dataset

This guide will walk you through setting up the environment and running the generation engine.

**1. Prerequisites**

Ensure you have `git` and a modern version of Python (`3.10` or newer) installed on your system.

**2. Clone & Setup**

First, clone the repository and navigate into the project directory.

```bash
# Replace with your repository's URL
git clone https://github.com/jonathanrbelanger-lang/plse-v2.git
cd plse-v2
```

Next, create a virtual environment and install the project in editable mode. This will install all necessary dependencies from pyproject.toml.

```bash
python3 -m venv .venv
source venv/bin/activate
pip install -e .
```

3. Run the Generation Engine

Execute the main script to start the dataset generation process.


```bash
python main.py
```

The script will discover all patterns, initialize the engine, and begin generating examples. By default, it will create 100 examples and save them to training_dataset.jsonl.

4. Verify the Output

You can inspect the generated dataset using standard command-line tools. To pretty-print the first example, you can use a tool like jq.


```bash
# (Optional) Install jq if you don't have it
sudo apt-get install jq
```

```bash
# Inspect the first line of the output file
head -n 1 training_dataset.jsonl | jq .
```

# Project Architecture

The PLSE v2.0 system is composed of two main parts: the Core Engine and the Pattern Library.

# The Pattern Library (/patterns)

This is the heart of the project—a collection of YAML files that serve as blueprints for code generation. Each pattern is a self-contained lesson. The library is organized into categories:

/patterns/*.yaml: Best-practice patterns and "tradeoff" patterns that teach expert-level nuance.

/patterns/anti_patterns/*.yaml: Paired patterns that demonstrate a common mistake and its correct solution.

/patterns/orchestration/*.yaml: Advanced patterns that generate complete, runnable scripts and DevOps artifacts (Dockerfile, ci.yml, etc.).

# The Core Engine (src/plse/)

This is the installable Python package that powers the generation process.

schema.py: Defines the Pydantic models that validate the structure of all YAML pattern files.

patterns.py: Defines the internal Python dataclasses the application uses to represent a validated pattern.

registry.py: Discovers, validates, and loads all .yaml patterns from the /patterns directory and its subdirectories.

generator.py: The Jinja2-based engine that renders a pattern with a random parameter context to produce code.

validation.py: The multi-stage pipeline that ensures the quality of generated code through syntax checks, a custom linter, and safe execution of unit tests.

# Developer Workflow

As a contributor to PLSE, your primary workflows will involve creating patterns and validating the library.

# Generating the Dataset

To run the full generation pipeline, simply execute main.py. You can configure the number of examples and the output file directly in the script.

```bash
python main.py
```

# Discovering & Linting Patterns

We have two key utility scripts in the /scripts directory to help manage the pattern library.

List All Available Patterns: To get a quick overview of all patterns in the library without crawling the directories, run:

```bash
python scripts/list_patterns.py
```

To generate a AVAILABLE_PATTERNS.md file, use the --output-markdown flag.

Lint the Pattern Library: To check all patterns for static errors (like flake8 violations in the generated code), run our high-performance parallel linter:


```bash
python scripts/lint_patterns.py
```

This tool will provide a summary of any patterns that need to be fixed.

# Contributing

Contributions are welcome, especially in the form of new patterns. The best way to contribute is to identify a common Python concept, best practice, or anti-pattern that is not yet in our library.

The basic workflow is:

Create a new .yaml file in the appropriate directory (e.g., patterns/anti_patterns/).

Follow the schema defined in src/plse/schema.py. The best way to learn is to study an existing pattern and the meta-pattern (plse.meta.paired_pattern_generator).

Run python scripts/lint_patterns.py to ensure your new pattern generates valid code.

Submit a pull request.

# License

This project is licensed under the MIT License. See the LICENSE file for details.

Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -1,3 +1,5 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ ---