Spaces:

S-Dreamer
/

CodeCraftLab

Runtime error

App Files Files Community

S-Dreamer commited on May 16

Commit

1be4869

verified ·

1 Parent(s): b9ed97d

Update README.md

Browse files

Files changed (1) hide show

README.md +151 -39

README.md CHANGED Viewed

@@ -15,50 +15,162 @@ datasets:
 - MatrixStudio/Codeforces-Python-Submissions
 ---
-# CodeGen Hub 🚀
-[![Run on Replit](https://replit.com/badge?caption=Run%20on%20Replit)](https://replit.com/@replit/CodeGen-Hub) ![Status](https://img.shields.io/badge/status-active-success) ![Python](https://img.shields.io/badge/python-v3.11-blue)
-A streamlined platform for training and using code generation models with Hugging Face integration 🤗
-## ✨ Features
-- 📊 Upload and preprocess Python code datasets
-- 🛠️ Configure and train models with customizable parameters
-- 💡 Generate code predictions using trained models
-- 📈 Monitor training progress with visualizations
-- 🔄 Seamless integration with Hugging Face Hub
-## 🚀 Getting Started
-1. Run the Streamlit app
-2. Upload your Python code dataset in the Dataset Management section
-3. Train your model in the Model Training section
-4. Generate code using your trained models in the Code Generation section
-## 🛠️ Technology Stack
-- Streamlit for the web interface
-- PyTorch for model training
-- Hugging Face Transformers for code generation
-- Pandas for data handling
-- Plotly for visualizations
-## 💻 Development
-Run linting and tests:
 ```bash
-./scripts/lint.sh
 ```
-## 📝 License
-MIT License - feel free to use and modify!
-## 🤝 Contributing
-Contributions welcome! Please check our contribution guidelines.
 ---
-Made with 💖 using [Replit](https://replit.com)

 - MatrixStudio/Codeforces-Python-Submissions
 ---
+# CodeCraftLab
+A production-grade platform for fine-tuning, evaluating, and serving code generation models. Built on FastAPI + React with a hardened training pipeline, structured logging, and HuggingFace Hub integration.
+---
+## What It Does
+'''
+Capability	Detail
+Dataset management	Upload, validate, and preprocess Python code datasets via REST API
+Fine-tuning	Configure and run training jobs with Pydantic-validated configs
+Evaluation	Automated eval hooks — pass@k, BLEU, execution accuracy
+Model serving	Authenticated inference endpoints for trained models
+HF Hub sync	Push/pull models and datasets to/from HuggingFace Hub
+'''
+---
+## Quick Start
+Requirements: Python 3.11+, Docker, CUDA-capable GPU (optional, CPU fallback available)
+```bash
+git clone https://github.com/your-org/codecraftlab.git
+cd codecraftlab
+# Copy and configure environment
+cp .env.example .env
+# Edit .env: set HF_TOKEN, SECRET_KEY, DATABASE_URL
+# Start with Docker Compose
+docker compose up --build
+# API available at http://localhost:8000
+# Docs at http://localhost:8000/docs
+```
+### Without Docker:
 ```bash
+pip install uv
+uv sync
+uv run uvicorn app:app --reload --port 8000
 ```
+---
+## API Overview
+All endpoints require a Bearer token. Get one via `POST /auth/token`.
+```bash
+# Authenticate
+curl -X POST http://localhost:8000/auth/token \
+  -H "Content-Type: application/json" \
+  -d '{"username": "admin", "password": "your-password"}'
+# Upload a dataset
+curl -X POST http://localhost:8000/datasets/upload \
+  -H "Authorization: Bearer <token>" \
+  -F "file=@data/train.jsonl"
+# Launch a training job
+curl -X POST http://localhost:8000/training/jobs \
+  -H "Authorization: Bearer <token>" \
+  -H "Content-Type: application/json" \
+  -d @configs/example_job.json
+# Check job status
+curl http://localhost:8000/training/jobs/{job_id} \
+  -H "Authorization: Bearer <token>"
+```
+## Full interactive docs: `http://localhost:8000/docs`
+---
+## Training Configuration
+Jobs are defined as JSON and validated against Pydantic v2 schemas:
+```json
+{
+  "job_name": "codegen-finetune-v1",
+  "base_model": "Salesforce/codegen-350M-mono",
+  "dataset_id": "ds_abc123",
+  "training": {
+    "num_epochs": 3,
+    "batch_size": 8,
+    "learning_rate": 2e-5,
+    "warmup_ratio": 0.1,
+    "max_seq_length": 1024,
+    "gradient_accumulation_steps": 4
+  },
+  "evaluation": {
+    "enabled": true,
+    "strategy": "epoch",
+    "metrics": ["pass_at_1", "pass_at_10", "bleu"]
+  },
+  "hub": {
+    "push_to_hub": true,
+    "repo_id": "your-org/codegen-finetune-v1"
+  }
+}
+```
+---
+## Evaluation Metrics
+### Metric	Description
+`pass@k`	Fraction of problems solved by at least 1 of k samples
+`BLEU`	N-gram overlap against reference completions
+`execution_accuracy`	Fraction of generated code that runs without error
+`exact_match`	Exact string match against reference outputs
+Eval results are logged to structured JSON and optionally pushed to HF Hub model cards.
+---
+## Architecture
+```
+codecraftlab/
+├── app.py                  # FastAPI entrypoint
+├── routers/
+│   ├── auth.py             # JWT auth
+│   ├── datasets.py         # Upload, validate, preprocess
+│   ├── training.py         # Job management
+│   └── inference.py        # Model serving
+├── training/
+│   ├── config.py           # Pydantic v2 training configs
+│   ├── pipeline.py         # Fine-tuning pipeline + eval hooks
+│   └── evaluators.py       # Metric implementations
+├── models/                 # SQLAlchemy ORM models
+├── core/
+│   ├── auth.py             # JWT utils
+│   ├── logging.py          # structlog setup
+│   └── settings.py         # Pydantic settings
+├── Dockerfile
+├── docker-compose.yml
+└── pyproject.toml
+```
+---
+### HuggingFace Space Config — Audit Notes
+The original Space was configured as `sdk: streamlit`. This repo now runs on FastAPI via Docker:
+Field	Before	After	Reason
+`sdk`	`streamlit`	`docker`	FastAPI served via Uvicorn
+`sdk_version`	`1.57.0`	(removed)	Not applicable for Docker SDK
+`app_port`	(missing)	`8000`	Required for Docker SDK
+`pinned`	`false`	`true`	Production Space, should persist
+`short_description`	Generic	Specific	Better discoverability on HF Hub
+`tags`	(missing)	Added	Enables HF search indexing
+---
+## Development
+```bash
+# Run tests
+uv run pytest tests/ -v --cov=. --cov-report=term-missing
+# Lint
+uv run ruff check .
+uv run mypy . --strict
+# Format
+uv run ruff format .
+```
+Test a training run locally (CPU, minimal config):
+```bash
+uv run python -m training.pipeline \
+  --config configs/smoke_test.json \
+  --dry-run
+```
+---
+### Environment Variables
+Variable	Required	Description
+`SECRET_KEY`	Yes	JWT signing secret (min 32 chars)
+`HF_TOKEN`	Yes	HuggingFace token with write access
+`DATABASE_URL`	Yes	PostgreSQL connection string
+`LOG_LEVEL`	No	`DEBUG`/`INFO`/`WARNING` (default: `INFO`)
+`MAX_CONCURRENT_JOBS`	No	Max parallel training jobs (default: `2`)
+`MODEL_CACHE_DIR`	No	Local model cache path (default: `./cache`)
 ---
+## License
+MIT — see LICENSE