S-Dreamer commited on
Commit
1be4869
Β·
verified Β·
1 Parent(s): b9ed97d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -39
README.md CHANGED
@@ -15,50 +15,162 @@ datasets:
15
  - MatrixStudio/Codeforces-Python-Submissions
16
  ---
17
 
18
- # CodeGen Hub πŸš€
19
-
20
- [![Run on Replit](https://replit.com/badge?caption=Run%20on%20Replit)](https://replit.com/@replit/CodeGen-Hub) ![Status](https://img.shields.io/badge/status-active-success) ![Python](https://img.shields.io/badge/python-v3.11-blue)
21
-
22
- A streamlined platform for training and using code generation models with Hugging Face integration πŸ€—
23
-
24
- ## ✨ Features
25
-
26
- - πŸ“Š Upload and preprocess Python code datasets
27
- - πŸ› οΈ Configure and train models with customizable parameters
28
- - πŸ’‘ Generate code predictions using trained models
29
- - πŸ“ˆ Monitor training progress with visualizations
30
- - πŸ”„ Seamless integration with Hugging Face Hub
31
-
32
- ## πŸš€ Getting Started
33
-
34
- 1. Run the Streamlit app
35
- 2. Upload your Python code dataset in the Dataset Management section
36
- 3. Train your model in the Model Training section
37
- 4. Generate code using your trained models in the Code Generation section
38
-
39
- ## πŸ› οΈ Technology Stack
40
-
41
- - Streamlit for the web interface
42
- - PyTorch for model training
43
- - Hugging Face Transformers for code generation
44
- - Pandas for data handling
45
- - Plotly for visualizations
46
 
47
- ## πŸ’» Development
 
 
48
 
49
- Run linting and tests:
 
50
 
 
 
 
 
51
  ```bash
52
- ./scripts/lint.sh
 
 
53
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
- ## πŸ“ License
56
-
57
- MIT License - feel free to use and modify!
58
-
59
- ## 🀝 Contributing
60
-
61
- Contributions welcome! Please check our contribution guidelines.
62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
  ---
64
- Made with πŸ’– using [Replit](https://replit.com)
 
 
15
  - MatrixStudio/Codeforces-Python-Submissions
16
  ---
17
 
18
+ # CodeCraftLab
19
+ A production-grade platform for fine-tuning, evaluating, and serving code generation models. Built on FastAPI + React with a hardened training pipeline, structured logging, and HuggingFace Hub integration.
20
+ ---
21
+ ## What It Does
22
+ '''
23
+ Capability Detail
24
+ Dataset management Upload, validate, and preprocess Python code datasets via REST API
25
+ Fine-tuning Configure and run training jobs with Pydantic-validated configs
26
+ Evaluation Automated eval hooks β€” pass@k, BLEU, execution accuracy
27
+ Model serving Authenticated inference endpoints for trained models
28
+ HF Hub sync Push/pull models and datasets to/from HuggingFace Hub
29
+ '''
30
+ ---
31
+ ## Quick Start
32
+ Requirements: Python 3.11+, Docker, CUDA-capable GPU (optional, CPU fallback available)
33
+ ```bash
34
+ git clone https://github.com/your-org/codecraftlab.git
35
+ cd codecraftlab
 
 
 
 
 
 
 
 
 
 
36
 
37
+ # Copy and configure environment
38
+ cp .env.example .env
39
+ # Edit .env: set HF_TOKEN, SECRET_KEY, DATABASE_URL
40
 
41
+ # Start with Docker Compose
42
+ docker compose up --build
43
 
44
+ # API available at http://localhost:8000
45
+ # Docs at http://localhost:8000/docs
46
+ ```
47
+ ### Without Docker:
48
  ```bash
49
+ pip install uv
50
+ uv sync
51
+ uv run uvicorn app:app --reload --port 8000
52
  ```
53
+ ---
54
+ ## API Overview
55
+ All endpoints require a Bearer token. Get one via `POST /auth/token`.
56
+ ```bash
57
+ # Authenticate
58
+ curl -X POST http://localhost:8000/auth/token \
59
+ -H "Content-Type: application/json" \
60
+ -d '{"username": "admin", "password": "your-password"}'
61
+
62
+ # Upload a dataset
63
+ curl -X POST http://localhost:8000/datasets/upload \
64
+ -H "Authorization: Bearer <token>" \
65
+ -F "file=@data/train.jsonl"
66
+
67
+ # Launch a training job
68
+ curl -X POST http://localhost:8000/training/jobs \
69
+ -H "Authorization: Bearer <token>" \
70
+ -H "Content-Type: application/json" \
71
+ -d @configs/example_job.json
72
+
73
+ # Check job status
74
+ curl http://localhost:8000/training/jobs/{job_id} \
75
+ -H "Authorization: Bearer <token>"
76
+ ```
77
+ ## Full interactive docs: `http://localhost:8000/docs`
78
+ ---
79
+ ## Training Configuration
80
+ Jobs are defined as JSON and validated against Pydantic v2 schemas:
81
+ ```json
82
+ {
83
+ "job_name": "codegen-finetune-v1",
84
+ "base_model": "Salesforce/codegen-350M-mono",
85
+ "dataset_id": "ds_abc123",
86
+ "training": {
87
+ "num_epochs": 3,
88
+ "batch_size": 8,
89
+ "learning_rate": 2e-5,
90
+ "warmup_ratio": 0.1,
91
+ "max_seq_length": 1024,
92
+ "gradient_accumulation_steps": 4
93
+ },
94
+ "evaluation": {
95
+ "enabled": true,
96
+ "strategy": "epoch",
97
+ "metrics": ["pass_at_1", "pass_at_10", "bleu"]
98
+ },
99
+ "hub": {
100
+ "push_to_hub": true,
101
+ "repo_id": "your-org/codegen-finetune-v1"
102
+ }
103
+ }
104
+ ```
105
+ ---
106
+ ## Evaluation Metrics
107
+ ### Metric Description
108
+ `pass@k` Fraction of problems solved by at least 1 of k samples
109
+ `BLEU` N-gram overlap against reference completions
110
+ `execution_accuracy` Fraction of generated code that runs without error
111
+ `exact_match` Exact string match against reference outputs
112
+ Eval results are logged to structured JSON and optionally pushed to HF Hub model cards.
113
+ ---
114
+ ## Architecture
115
+ ```
116
+ codecraftlab/
117
+ β”œβ”€β”€ app.py # FastAPI entrypoint
118
+ β”œβ”€β”€ routers/
119
+ β”‚ β”œβ”€β”€ auth.py # JWT auth
120
+ β”‚ β”œβ”€β”€ datasets.py # Upload, validate, preprocess
121
+ β”‚ β”œβ”€β”€ training.py # Job management
122
+ β”‚ └── inference.py # Model serving
123
+ β”œβ”€β”€ training/
124
+ β”‚ β”œβ”€β”€ config.py # Pydantic v2 training configs
125
+ β”‚ β”œβ”€β”€ pipeline.py # Fine-tuning pipeline + eval hooks
126
+ β”‚ └── evaluators.py # Metric implementations
127
+ β”œβ”€β”€ models/ # SQLAlchemy ORM models
128
+ β”œβ”€β”€ core/
129
+ β”‚ β”œβ”€β”€ auth.py # JWT utils
130
+ β”‚ β”œβ”€β”€ logging.py # structlog setup
131
+ β”‚ └── settings.py # Pydantic settings
132
+ β”œβ”€β”€ Dockerfile
133
+ β”œβ”€β”€ docker-compose.yml
134
+ └── pyproject.toml
135
+ ```
136
+ ---
137
+ ### HuggingFace Space Config β€” Audit Notes
138
+ The original Space was configured as `sdk: streamlit`. This repo now runs on FastAPI via Docker:
139
+ Field Before After Reason
140
+ `sdk` `streamlit` `docker` FastAPI served via Uvicorn
141
+ `sdk_version` `1.57.0` (removed) Not applicable for Docker SDK
142
+ `app_port` (missing) `8000` Required for Docker SDK
143
+ `pinned` `false` `true` Production Space, should persist
144
+ `short_description` Generic Specific Better discoverability on HF Hub
145
+ `tags` (missing) Added Enables HF search indexing
146
+ ---
147
+ ## Development
148
+ ```bash
149
+ # Run tests
150
+ uv run pytest tests/ -v --cov=. --cov-report=term-missing
151
 
152
+ # Lint
153
+ uv run ruff check .
154
+ uv run mypy . --strict
 
 
 
 
155
 
156
+ # Format
157
+ uv run ruff format .
158
+ ```
159
+ Test a training run locally (CPU, minimal config):
160
+ ```bash
161
+ uv run python -m training.pipeline \
162
+ --config configs/smoke_test.json \
163
+ --dry-run
164
+ ```
165
+ ---
166
+ ### Environment Variables
167
+ Variable Required Description
168
+ `SECRET_KEY` Yes JWT signing secret (min 32 chars)
169
+ `HF_TOKEN` Yes HuggingFace token with write access
170
+ `DATABASE_URL` Yes PostgreSQL connection string
171
+ `LOG_LEVEL` No `DEBUG`/`INFO`/`WARNING` (default: `INFO`)
172
+ `MAX_CONCURRENT_JOBS` No Max parallel training jobs (default: `2`)
173
+ `MODEL_CACHE_DIR` No Local model cache path (default: `./cache`)
174
  ---
175
+ ## License
176
+ MIT β€” see LICENSE