rileyseaburg
/

concept-first-codegen

code-generation

concept-embedding

Model card Files Files and versions

concept-first-codegen / README.md

rileyseaburg's picture

Upload README.md with huggingface_hub

027f974 verified 18 days ago

|

history blame contribute delete

1.95 kB


	---
	library_name: transformers
	tags:
	- code-generation
	- concept-embedding
	- jepa
	- pytorch
	- gguf
	license: apache-2.0
	---

	# Concept-First Code Generation

	Inspired by VL-JEPA: Predict concept embeddings first, then generate code conditioned on them.

	## The Idea

	Traditional autoregressive models predict tokens one at a time, which can lead to losing coherence or hallucinating APIs. The Concept-First approach solves this by:

	1. Concept Encoder: Encoding code snippets into semantic embeddings.
	2. Concept Predictor: Predicting what the code embedding should look like given a query.
	3. Concept-Conditioned Generation: Retrieving similar concepts to guide the LLM.

	```mermaid
	graph LR
	A[Query] --> B(Concept Predictor)
	B --> C{Concept Space}
	C --> D[Retrieve Similar Code]
	D --> E[Conditioned Generation]
	```

	## Models Used (January 2026)

	\| Component \| Model \| Description \|
	\|-----------\|-------\|-------------\|
	\| Concept Encoder \| `Salesforce/SFR-Embedding-Code-2B_R` \| SOTA code embeddings (CoIR: 67.4) \|
	\| Text Encoder \| `Alibaba-NLP/gte-Qwen2-1.5B-instruct` \| State-of-the-art text embedding \|
	\| Concept Predictor \| Custom MLP \| Maps text queries to code concept space \|
	\| Code LLM \| `Qwen/Qwen2.5-Coder-32B-Instruct` \| High-performance code generation \|

	## Files in this Repo

	- `concept_predictor.pt`: PyTorch weights for the concept predictor MLP.
	- `concept_predictor.gguf`: GGUF format for edge deployment (llama.cpp/LM Studio).
	- `concept_bank.pt`: Pre-computed embeddings for the concept retrieval bank.

	## Usage

	```python
	# Load the concept predictor
	import torch
	checkpoint = torch.load("concept_predictor.pt")
	# ... (See Colab notebook for full implementation)
	```

	## Datasets

	Constructed from high-quality subsets of:
	- MBPP
	- Evol-Instruct-Code
	- Magicoder-OSS-Instruct

	## Credits

	Created by Core Subagent (Colab Composer) for Riley Seaburg.