cl-ds / README.md
j14i's picture
CL macro fine-tuning dataset: data, config, docs
794d2a5 verified
---
license: bsd-2-clause
task_categories:
- text-generation
language:
- en
tags:
- common-lisp
- macros
- code-generation
- program-transformation
pretty_name: Common Lisp Macro Transformations
size_categories:
- n<1K
---
# Common Lisp Macro Transformations
A fine-tuning dataset for training models to generate Common Lisp macros. Each example is a **(before-code) β†’ (macro-definition) β†’ (after-expansion)** triple.
## Idea
Instead of fine-tuning a model to "write code", fine-tune it to generate **CL macros** β€” code that writes code. The model learns to recognize AST patterns and generate transformations, not final output.
## Sources
- **Let Over Lambda** β€” Doug Hoyte's production macro collection (thephoeron/let-over-lambda)
- **On Lisp** β€” Paul Graham's classic Common Lisp macro utilities
## Dataset Structure
Each record contains:
- `instruction` β€” Task description with the code pattern to address
- `input` β€” The "before" code showing the pattern that needs a macro
- `output` β€” The `defmacro` form that solves it
- `category` β€” Macro category (capture-management, anaphoric, dispatch, control-flow, DSL, compiler-macro, efficiency, scope)
- `technique` β€” Comma-separated techniques used (gensym, nested-backquote, dlambda, anaphor, code-walking, symbol-macrolet, defsetf, tagbody-go, once-only, macrolet, compiler-macro, recursive-expansion)
- `complexity` β€” basic, intermediate, or advanced
- `quality_score` β€” Classifier score from 0.0 to 1.0
## Categories
| Category | Description | Examples |
|---|---|---|
| capture-management | Hygienic macro writing utilities | defmacro/g!, defmacro!, with-gensyms |
| anaphoric | Deliberate variable capture for conciseness | aif, alambda, alet, aand |
| dispatch | Keyword-based dispatch and inter-closure protocols | dlambda, pandoriclet, with-pandoric |
| control-flow | New evaluation semantics via macros | nlet-tail, condlet, if-match, choose |
| DSL | Domain-specific embedded languages | defunits, _f (generalized setf), dbind |
| compiler-macro | Compile-time optimization of function calls | fformat compiler macro |
| efficiency | Performance-oriented macro techniques | sortf (sorting networks) |
| scope | Lexical scope manipulation | pandoric-eval |
## Use for Fine-tuning
The data is in instruction-input-output JSONL format, ready for fine-tuning:
```python
from datasets import load_dataset
ds = load_dataset("j14i/cl-macros", split="train")
```
Target model size: ≀ 30B parameters (the domain is narrow β€” pattern matching on ASTs and transformations β€” so a smaller model suffices).