English

License: mit language:

en code tags: code-generation coding-style pytorch transformer custom-architecture datasets: bigcode/starcoderdata Idiolect (85M) 🧠 Idiolect is an 85 million parameter causal language model trained completely from scratch (custom GPT-2 architecture) specifically for Python code generation and coding style adaptation.

Unlike standard wrapper models, this project involves a custom BPE tokenizer (trained on Python AST features), a from-scratch PyTorch implementation featuring Rotary Position Embeddings (RoPE), pre-layer normalization, and native LoRA adapter support for highly efficient personal style fine-tuning.

Model Details Model Type: Causal Language Model (Transformer Decoder) Architecture: Custom GPT-2 style with RoPE, pre-norm, and tied embeddings Parameters: 85M total (~80M trainable non-embedding) Context Length: 1024 tokens Vocabulary Size: 32,000 (Custom Code BPE) Training Data: 50GB Python subset of bigcode/starcoderdata Language: Python 3.x Uses Direct Inference (Pre-trained) The base model can complete Python snippets and generate basic functions. However, its primary purpose is to act as a foundation for LoRA fine-tuning.

Personal Style Adaptation (LoRA) Idiolect is designed to be fine-tuned on a single developer's GitHub repositories. Using Low-Rank Adaptation (LoRA) with Rank=8, we can adapt the model to write code in your exact style (docstring formatting, variable naming conventions, list comprehensions vs loops, etc.) by training only ~2% of the parameters.

Code Example python import torch from codeforge.model import CodeForgeConfig, CodeForgeModel from codeforge.data.tokenizer import load_tokenizer

1. Load Custom Tokenizer

tokenizer = load_tokenizer("artifacts/tokenizer")

2. Load Model

config = CodeForgeConfig(vocab_size=tokenizer.get_vocab_size()) model = CodeForgeModel(config) checkpoint = torch.load("model.pt", map_location="cpu") model.load_state_dict(checkpoint["model_state_dict"]) model.eval()

3. Generate

prompt = "def calculate_fibonacci(n):" input_ids = torch.tensor([tokenizer.encode(prompt).ids]) output = model.generate(input_ids, max_new_tokens=100) print(tokenizer.decode(output[0].tolist())) Training Setup Hardware: 1x NVIDIA A100-SXM4-40GB Optimizer: AdamW (LR=3e-4, Cosine Decay) Batch Size: 128 (16 * 8 Gradient Accumulation) Precision: Mixed Precision (AMP FP16) Time: ~40 hours for 50,000 steps Evaluation / Fingerprinting CodeForge includes a proprietary Style Fingerprint Engine that analyzes the AST (Abstract Syntax Tree) and neural embeddings of code to match structural patterns rather than just text overlap.

License MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zagho/idiolect

Finetuned
(2184)
this model

Dataset used to train Zagho/idiolect