CodeCraftLab / README.md
S-Dreamer's picture
Update README.md
1be4869 verified
|
raw
history blame
5.38 kB
metadata
title: CodeCraftLab
emoji: πŸ‘
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.57.0
app_file: app.py
pinned: false
license: mit
short_description: A fine-tuning platform
datasets:
  - angie-chen55/python-github-code
  - sdiazlor/python-reasoning-dataset
  - MatrixStudio/Codeforces-Python-Submissions

CodeCraftLab

A production-grade platform for fine-tuning, evaluating, and serving code generation models. Built on FastAPI + React with a hardened training pipeline, structured logging, and HuggingFace Hub integration.

What It Does

''' Capability Detail Dataset management Upload, validate, and preprocess Python code datasets via REST API Fine-tuning Configure and run training jobs with Pydantic-validated configs Evaluation Automated eval hooks β€” pass@k, BLEU, execution accuracy Model serving Authenticated inference endpoints for trained models HF Hub sync Push/pull models and datasets to/from HuggingFace Hub '''

Quick Start

Requirements: Python 3.11+, Docker, CUDA-capable GPU (optional, CPU fallback available)

git clone https://github.com/your-org/codecraftlab.git
cd codecraftlab

# Copy and configure environment
cp .env.example .env
# Edit .env: set HF_TOKEN, SECRET_KEY, DATABASE_URL

# Start with Docker Compose
docker compose up --build

# API available at http://localhost:8000
# Docs at http://localhost:8000/docs

Without Docker:

pip install uv
uv sync
uv run uvicorn app:app --reload --port 8000

API Overview

All endpoints require a Bearer token. Get one via POST /auth/token.

# Authenticate
curl -X POST http://localhost:8000/auth/token \
  -H "Content-Type: application/json" \
  -d '{"username": "admin", "password": "your-password"}'

# Upload a dataset
curl -X POST http://localhost:8000/datasets/upload \
  -H "Authorization: Bearer <token>" \
  -F "file=@data/train.jsonl"

# Launch a training job
curl -X POST http://localhost:8000/training/jobs \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d @configs/example_job.json

# Check job status
curl http://localhost:8000/training/jobs/{job_id} \
  -H "Authorization: Bearer <token>"

Full interactive docs: http://localhost:8000/docs


Training Configuration

Jobs are defined as JSON and validated against Pydantic v2 schemas: json { "job_name": "codegen-finetune-v1", "base_model": "Salesforce/codegen-350M-mono", "dataset_id": "ds_abc123", "training": { "num_epochs": 3, "batch_size": 8, "learning_rate": 2e-5, "warmup_ratio": 0.1, "max_seq_length": 1024, "gradient_accumulation_steps": 4 }, "evaluation": { "enabled": true, "strategy": "epoch", "metrics": ["pass_at_1", "pass_at_10", "bleu"] }, "hub": { "push_to_hub": true, "repo_id": "your-org/codegen-finetune-v1" } }

Evaluation Metrics

Metric Description

pass@k Fraction of problems solved by at least 1 of k samples BLEU N-gram overlap against reference completions execution_accuracy Fraction of generated code that runs without error exact_match Exact string match against reference outputs Eval results are logged to structured JSON and optionally pushed to HF Hub model cards.

Architecture

codecraftlab/
β”œβ”€β”€ app.py                  # FastAPI entrypoint
β”œβ”€β”€ routers/
β”‚   β”œβ”€β”€ auth.py             # JWT auth
β”‚   β”œβ”€β”€ datasets.py         # Upload, validate, preprocess
β”‚   β”œβ”€β”€ training.py         # Job management
β”‚   └── inference.py        # Model serving
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ config.py           # Pydantic v2 training configs
β”‚   β”œβ”€β”€ pipeline.py         # Fine-tuning pipeline + eval hooks
β”‚   └── evaluators.py       # Metric implementations
β”œβ”€β”€ models/                 # SQLAlchemy ORM models
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ auth.py             # JWT utils
β”‚   β”œβ”€β”€ logging.py          # structlog setup
β”‚   └── settings.py         # Pydantic settings
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ docker-compose.yml
└── pyproject.toml

HuggingFace Space Config β€” Audit Notes

The original Space was configured as sdk: streamlit. This repo now runs on FastAPI via Docker: Field Before After Reason sdk streamlit docker FastAPI served via Uvicorn sdk_version 1.57.0 (removed) Not applicable for Docker SDK app_port (missing) 8000 Required for Docker SDK pinned false true Production Space, should persist short_description Generic Specific Better discoverability on HF Hub tags (missing) Added Enables HF search indexing

Development

# Run tests
uv run pytest tests/ -v --cov=. --cov-report=term-missing

# Lint
uv run ruff check .
uv run mypy . --strict

# Format
uv run ruff format .

Test a training run locally (CPU, minimal config): bash uv run python -m training.pipeline \ --config configs/smoke_test.json \ --dry-run

Environment Variables

Variable Required Description SECRET_KEY Yes JWT signing secret (min 32 chars) HF_TOKEN Yes HuggingFace token with write access DATABASE_URL Yes PostgreSQL connection string LOG_LEVEL No DEBUG/INFO/WARNING (default: INFO) MAX_CONCURRENT_JOBS No Max parallel training jobs (default: 2) MODEL_CACHE_DIR No Local model cache path (default: ./cache)

License

MIT β€” see LICENSE