--- title: CodeCraftLab emoji: 👁 colorFrom: pink colorTo: purple sdk: streamlit sdk_version: 1.57.0 app_file: app.py pinned: false license: mit short_description: A fine-tuning platform datasets: - angie-chen55/python-github-code - sdiazlor/python-reasoning-dataset - MatrixStudio/Codeforces-Python-Submissions --- # CodeCraftLab A production-grade platform for fine-tuning, evaluating, and serving code generation models. Built on FastAPI + React with a hardened training pipeline, structured logging, and HuggingFace Hub integration. --- ## What It Does ''' Capability Detail Dataset management Upload, validate, and preprocess Python code datasets via REST API Fine-tuning Configure and run training jobs with Pydantic-validated configs Evaluation Automated eval hooks — pass@k, BLEU, execution accuracy Model serving Authenticated inference endpoints for trained models HF Hub sync Push/pull models and datasets to/from HuggingFace Hub ''' --- ## Quick Start Requirements: Python 3.11+, Docker, CUDA-capable GPU (optional, CPU fallback available) ```bash git clone https://github.com/your-org/codecraftlab.git cd codecraftlab # Copy and configure environment cp .env.example .env # Edit .env: set HF_TOKEN, SECRET_KEY, DATABASE_URL # Start with Docker Compose docker compose up --build # API available at http://localhost:8000 # Docs at http://localhost:8000/docs ``` ### Without Docker: ```bash pip install uv uv sync uv run uvicorn app:app --reload --port 8000 ``` --- ## API Overview All endpoints require a Bearer token. Get one via `POST /auth/token`. ```bash # Authenticate curl -X POST http://localhost:8000/auth/token \ -H "Content-Type: application/json" \ -d '{"username": "admin", "password": "your-password"}' # Upload a dataset curl -X POST http://localhost:8000/datasets/upload \ -H "Authorization: Bearer " \ -F "file=@data/train.jsonl" # Launch a training job curl -X POST http://localhost:8000/training/jobs \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d @configs/example_job.json # Check job status curl http://localhost:8000/training/jobs/{job_id} \ -H "Authorization: Bearer " ``` ## Full interactive docs: `http://localhost:8000/docs` --- ## Training Configuration Jobs are defined as JSON and validated against Pydantic v2 schemas: ```json { "job_name": "codegen-finetune-v1", "base_model": "Salesforce/codegen-350M-mono", "dataset_id": "ds_abc123", "training": { "num_epochs": 3, "batch_size": 8, "learning_rate": 2e-5, "warmup_ratio": 0.1, "max_seq_length": 1024, "gradient_accumulation_steps": 4 }, "evaluation": { "enabled": true, "strategy": "epoch", "metrics": ["pass_at_1", "pass_at_10", "bleu"] }, "hub": { "push_to_hub": true, "repo_id": "your-org/codegen-finetune-v1" } } ``` --- ## Evaluation Metrics ### Metric Description `pass@k` Fraction of problems solved by at least 1 of k samples `BLEU` N-gram overlap against reference completions `execution_accuracy` Fraction of generated code that runs without error `exact_match` Exact string match against reference outputs Eval results are logged to structured JSON and optionally pushed to HF Hub model cards. --- ## Architecture ``` codecraftlab/ ├── app.py # FastAPI entrypoint ├── routers/ │ ├── auth.py # JWT auth │ ├── datasets.py # Upload, validate, preprocess │ ├── training.py # Job management │ └── inference.py # Model serving ├── training/ │ ├── config.py # Pydantic v2 training configs │ ├── pipeline.py # Fine-tuning pipeline + eval hooks │ └── evaluators.py # Metric implementations ├── models/ # SQLAlchemy ORM models ├── core/ │ ├── auth.py # JWT utils │ ├── logging.py # structlog setup │ └── settings.py # Pydantic settings ├── Dockerfile ├── docker-compose.yml └── pyproject.toml ``` --- ### HuggingFace Space Config — Audit Notes The original Space was configured as `sdk: streamlit`. This repo now runs on FastAPI via Docker: Field Before After Reason `sdk` `streamlit` `docker` FastAPI served via Uvicorn `sdk_version` `1.57.0` (removed) Not applicable for Docker SDK `app_port` (missing) `8000` Required for Docker SDK `pinned` `false` `true` Production Space, should persist `short_description` Generic Specific Better discoverability on HF Hub `tags` (missing) Added Enables HF search indexing --- ## Development ```bash # Run tests uv run pytest tests/ -v --cov=. --cov-report=term-missing # Lint uv run ruff check . uv run mypy . --strict # Format uv run ruff format . ``` Test a training run locally (CPU, minimal config): ```bash uv run python -m training.pipeline \ --config configs/smoke_test.json \ --dry-run ``` --- ### Environment Variables Variable Required Description `SECRET_KEY` Yes JWT signing secret (min 32 chars) `HF_TOKEN` Yes HuggingFace token with write access `DATABASE_URL` Yes PostgreSQL connection string `LOG_LEVEL` No `DEBUG`/`INFO`/`WARNING` (default: `INFO`) `MAX_CONCURRENT_JOBS` No Max parallel training jobs (default: `2`) `MODEL_CACHE_DIR` No Local model cache path (default: `./cache`) --- ## License MIT — see LICENSE