CodeCraftLab / README.md
S-Dreamer's picture
Update README.md
1be4869 verified
|
raw
history blame
5.38 kB
---
title: CodeCraftLab
emoji: πŸ‘
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.57.0
app_file: app.py
pinned: false
license: mit
short_description: A fine-tuning platform
datasets:
- angie-chen55/python-github-code
- sdiazlor/python-reasoning-dataset
- MatrixStudio/Codeforces-Python-Submissions
---
# CodeCraftLab
A production-grade platform for fine-tuning, evaluating, and serving code generation models. Built on FastAPI + React with a hardened training pipeline, structured logging, and HuggingFace Hub integration.
---
## What It Does
'''
Capability Detail
Dataset management Upload, validate, and preprocess Python code datasets via REST API
Fine-tuning Configure and run training jobs with Pydantic-validated configs
Evaluation Automated eval hooks β€” pass@k, BLEU, execution accuracy
Model serving Authenticated inference endpoints for trained models
HF Hub sync Push/pull models and datasets to/from HuggingFace Hub
'''
---
## Quick Start
Requirements: Python 3.11+, Docker, CUDA-capable GPU (optional, CPU fallback available)
```bash
git clone https://github.com/your-org/codecraftlab.git
cd codecraftlab
# Copy and configure environment
cp .env.example .env
# Edit .env: set HF_TOKEN, SECRET_KEY, DATABASE_URL
# Start with Docker Compose
docker compose up --build
# API available at http://localhost:8000
# Docs at http://localhost:8000/docs
```
### Without Docker:
```bash
pip install uv
uv sync
uv run uvicorn app:app --reload --port 8000
```
---
## API Overview
All endpoints require a Bearer token. Get one via `POST /auth/token`.
```bash
# Authenticate
curl -X POST http://localhost:8000/auth/token \
-H "Content-Type: application/json" \
-d '{"username": "admin", "password": "your-password"}'
# Upload a dataset
curl -X POST http://localhost:8000/datasets/upload \
-H "Authorization: Bearer <token>" \
-F "file=@data/train.jsonl"
# Launch a training job
curl -X POST http://localhost:8000/training/jobs \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d @configs/example_job.json
# Check job status
curl http://localhost:8000/training/jobs/{job_id} \
-H "Authorization: Bearer <token>"
```
## Full interactive docs: `http://localhost:8000/docs`
---
## Training Configuration
Jobs are defined as JSON and validated against Pydantic v2 schemas:
```json
{
"job_name": "codegen-finetune-v1",
"base_model": "Salesforce/codegen-350M-mono",
"dataset_id": "ds_abc123",
"training": {
"num_epochs": 3,
"batch_size": 8,
"learning_rate": 2e-5,
"warmup_ratio": 0.1,
"max_seq_length": 1024,
"gradient_accumulation_steps": 4
},
"evaluation": {
"enabled": true,
"strategy": "epoch",
"metrics": ["pass_at_1", "pass_at_10", "bleu"]
},
"hub": {
"push_to_hub": true,
"repo_id": "your-org/codegen-finetune-v1"
}
}
```
---
## Evaluation Metrics
### Metric Description
`pass@k` Fraction of problems solved by at least 1 of k samples
`BLEU` N-gram overlap against reference completions
`execution_accuracy` Fraction of generated code that runs without error
`exact_match` Exact string match against reference outputs
Eval results are logged to structured JSON and optionally pushed to HF Hub model cards.
---
## Architecture
```
codecraftlab/
β”œβ”€β”€ app.py # FastAPI entrypoint
β”œβ”€β”€ routers/
β”‚ β”œβ”€β”€ auth.py # JWT auth
β”‚ β”œβ”€β”€ datasets.py # Upload, validate, preprocess
β”‚ β”œβ”€β”€ training.py # Job management
β”‚ └── inference.py # Model serving
β”œβ”€β”€ training/
β”‚ β”œβ”€β”€ config.py # Pydantic v2 training configs
β”‚ β”œβ”€β”€ pipeline.py # Fine-tuning pipeline + eval hooks
β”‚ └── evaluators.py # Metric implementations
β”œβ”€β”€ models/ # SQLAlchemy ORM models
β”œβ”€β”€ core/
β”‚ β”œβ”€β”€ auth.py # JWT utils
β”‚ β”œβ”€β”€ logging.py # structlog setup
β”‚ └── settings.py # Pydantic settings
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ docker-compose.yml
└── pyproject.toml
```
---
### HuggingFace Space Config β€” Audit Notes
The original Space was configured as `sdk: streamlit`. This repo now runs on FastAPI via Docker:
Field Before After Reason
`sdk` `streamlit` `docker` FastAPI served via Uvicorn
`sdk_version` `1.57.0` (removed) Not applicable for Docker SDK
`app_port` (missing) `8000` Required for Docker SDK
`pinned` `false` `true` Production Space, should persist
`short_description` Generic Specific Better discoverability on HF Hub
`tags` (missing) Added Enables HF search indexing
---
## Development
```bash
# Run tests
uv run pytest tests/ -v --cov=. --cov-report=term-missing
# Lint
uv run ruff check .
uv run mypy . --strict
# Format
uv run ruff format .
```
Test a training run locally (CPU, minimal config):
```bash
uv run python -m training.pipeline \
--config configs/smoke_test.json \
--dry-run
```
---
### Environment Variables
Variable Required Description
`SECRET_KEY` Yes JWT signing secret (min 32 chars)
`HF_TOKEN` Yes HuggingFace token with write access
`DATABASE_URL` Yes PostgreSQL connection string
`LOG_LEVEL` No `DEBUG`/`INFO`/`WARNING` (default: `INFO`)
`MAX_CONCURRENT_JOBS` No Max parallel training jobs (default: `2`)
`MODEL_CACHE_DIR` No Local model cache path (default: `./cache`)
---
## License
MIT β€” see LICENSE