Spaces:

S-Dreamer
/

CodeCraftLab

Runtime error

App Files Files Community

CodeCraftLab / README.md

S-Dreamer

Update README.md

1be4869 verified 16 days ago

preview code

raw

history blame

5.38 kB

	---
	title: CodeCraftLab
	emoji: 👁
	colorFrom: pink
	colorTo: purple
	sdk: streamlit
	sdk_version: 1.57.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: A fine-tuning platform
	datasets:
	- angie-chen55/python-github-code
	- sdiazlor/python-reasoning-dataset
	- MatrixStudio/Codeforces-Python-Submissions
	---

	# CodeCraftLab
	A production-grade platform for fine-tuning, evaluating, and serving code generation models. Built on FastAPI + React with a hardened training pipeline, structured logging, and HuggingFace Hub integration.
	---
	## What It Does
	'''
	Capability Detail
	Dataset management Upload, validate, and preprocess Python code datasets via REST API
	Fine-tuning Configure and run training jobs with Pydantic-validated configs
	Evaluation Automated eval hooks — pass@k, BLEU, execution accuracy
	Model serving Authenticated inference endpoints for trained models
	HF Hub sync Push/pull models and datasets to/from HuggingFace Hub
	'''
	---
	## Quick Start
	Requirements: Python 3.11+, Docker, CUDA-capable GPU (optional, CPU fallback available)
	```bash
	git clone https://github.com/your-org/codecraftlab.git
	cd codecraftlab

	# Copy and configure environment
	cp .env.example .env
	# Edit .env: set HF_TOKEN, SECRET_KEY, DATABASE_URL

	# Start with Docker Compose
	docker compose up --build

	# API available at http://localhost:8000
	# Docs at http://localhost:8000/docs
	```
	### Without Docker:
	```bash
	pip install uv
	uv sync
	uv run uvicorn app:app --reload --port 8000
	```
	---
	## API Overview
	All endpoints require a Bearer token. Get one via `POST /auth/token`.
	```bash
	# Authenticate
	curl -X POST http://localhost:8000/auth/token \
	-H "Content-Type: application/json" \
	-d '{"username": "admin", "password": "your-password"}'

	# Upload a dataset
	curl -X POST http://localhost:8000/datasets/upload \
	-H "Authorization: Bearer <token>" \
	-F "file=@data/train.jsonl"

	# Launch a training job
	curl -X POST http://localhost:8000/training/jobs \
	-H "Authorization: Bearer <token>" \
	-H "Content-Type: application/json" \
	-d @configs/example_job.json

	# Check job status
	curl http://localhost:8000/training/jobs/{job_id} \
	-H "Authorization: Bearer <token>"
	```
	## Full interactive docs: `http://localhost:8000/docs`
	---
	## Training Configuration
	Jobs are defined as JSON and validated against Pydantic v2 schemas:
	```json
	{
	"job_name": "codegen-finetune-v1",
	"base_model": "Salesforce/codegen-350M-mono",
	"dataset_id": "ds_abc123",
	"training": {
	"num_epochs": 3,
	"batch_size": 8,
	"learning_rate": 2e-5,
	"warmup_ratio": 0.1,
	"max_seq_length": 1024,
	"gradient_accumulation_steps": 4
	},
	"evaluation": {
	"enabled": true,
	"strategy": "epoch",
	"metrics": ["pass_at_1", "pass_at_10", "bleu"]
	},
	"hub": {
	"push_to_hub": true,
	"repo_id": "your-org/codegen-finetune-v1"
	}
	}
	```
	---
	## Evaluation Metrics
	### Metric Description
	`pass@k` Fraction of problems solved by at least 1 of k samples
	`BLEU` N-gram overlap against reference completions
	`execution_accuracy` Fraction of generated code that runs without error
	`exact_match` Exact string match against reference outputs
	Eval results are logged to structured JSON and optionally pushed to HF Hub model cards.
	---
	## Architecture
	```
	codecraftlab/
	├── app.py # FastAPI entrypoint
	├── routers/
	│ ├── auth.py # JWT auth
	│ ├── datasets.py # Upload, validate, preprocess
	│ ├── training.py # Job management
	│ └── inference.py # Model serving
	├── training/
	│ ├── config.py # Pydantic v2 training configs
	│ ├── pipeline.py # Fine-tuning pipeline + eval hooks
	│ └── evaluators.py # Metric implementations
	├── models/ # SQLAlchemy ORM models
	├── core/
	│ ├── auth.py # JWT utils
	│ ├── logging.py # structlog setup
	│ └── settings.py # Pydantic settings
	├── Dockerfile
	├── docker-compose.yml
	└── pyproject.toml
	```
	---
	### HuggingFace Space Config — Audit Notes
	The original Space was configured as `sdk: streamlit`. This repo now runs on FastAPI via Docker:
	Field Before After Reason
	`sdk` `streamlit` `docker` FastAPI served via Uvicorn
	`sdk_version` `1.57.0` (removed) Not applicable for Docker SDK
	`app_port` (missing) `8000` Required for Docker SDK
	`pinned` `false` `true` Production Space, should persist
	`short_description` Generic Specific Better discoverability on HF Hub
	`tags` (missing) Added Enables HF search indexing
	---
	## Development
	```bash
	# Run tests
	uv run pytest tests/ -v --cov=. --cov-report=term-missing

	# Lint
	uv run ruff check .
	uv run mypy . --strict

	# Format
	uv run ruff format .
	```
	Test a training run locally (CPU, minimal config):
	```bash
	uv run python -m training.pipeline \
	--config configs/smoke_test.json \
	--dry-run
	```
	---
	### Environment Variables
	Variable Required Description
	`SECRET_KEY` Yes JWT signing secret (min 32 chars)
	`HF_TOKEN` Yes HuggingFace token with write access
	`DATABASE_URL` Yes PostgreSQL connection string
	`LOG_LEVEL` No `DEBUG`/`INFO`/`WARNING` (default: `INFO`)
	`MAX_CONCURRENT_JOBS` No Max parallel training jobs (default: `2`)
	`MODEL_CACHE_DIR` No Local model cache path (default: `./cache`)
	---
	## License
	MIT — see LICENSE