License: mit language:
en code tags: code-generation coding-style pytorch transformer custom-architecture datasets: bigcode/starcoderdata Idiolect (85M) 🧠 Idiolect is an 85 million parameter causal language model trained completely from scratch (custom GPT-2 architecture) specifically for Python code generation and coding style adaptation.
Unlike standard wrapper models, this project involves a custom BPE tokenizer (trained on Python AST features), a from-scratch PyTorch implementation featuring Rotary Position Embeddings (RoPE), pre-layer normalization, and native LoRA adapter support for highly efficient personal style fine-tuning.
Model Details Model Type: Causal Language Model (Transformer Decoder) Architecture: Custom GPT-2 style with RoPE, pre-norm, and tied embeddings Parameters: 85M total (~80M trainable non-embedding) Context Length: 1024 tokens Vocabulary Size: 32,000 (Custom Code BPE) Training Data: 50GB Python subset of bigcode/starcoderdata Language: Python 3.x Uses Direct Inference (Pre-trained) The base model can complete Python snippets and generate basic functions. However, its primary purpose is to act as a foundation for LoRA fine-tuning.
Personal Style Adaptation (LoRA) Idiolect is designed to be fine-tuned on a single developer's GitHub repositories. Using Low-Rank Adaptation (LoRA) with Rank=8, we can adapt the model to write code in your exact style (docstring formatting, variable naming conventions, list comprehensions vs loops, etc.) by training only ~2% of the parameters.
Code Example python import torch from codeforge.model import CodeForgeConfig, CodeForgeModel from codeforge.data.tokenizer import load_tokenizer
1. Load Custom Tokenizer
tokenizer = load_tokenizer("artifacts/tokenizer")
2. Load Model
config = CodeForgeConfig(vocab_size=tokenizer.get_vocab_size()) model = CodeForgeModel(config) checkpoint = torch.load("model.pt", map_location="cpu") model.load_state_dict(checkpoint["model_state_dict"]) model.eval()
3. Generate
prompt = "def calculate_fibonacci(n):" input_ids = torch.tensor([tokenizer.encode(prompt).ids]) output = model.generate(input_ids, max_new_tokens=100) print(tokenizer.decode(output[0].tolist())) Training Setup Hardware: 1x NVIDIA A100-SXM4-40GB Optimizer: AdamW (LR=3e-4, Cosine Decay) Batch Size: 128 (16 * 8 Gradient Accumulation) Precision: Mixed Precision (AMP FP16) Time: ~40 hours for 50,000 steps Evaluation / Fingerprinting CodeForge includes a proprietary Style Fingerprint Engine that analyzes the AST (Abstract Syntax Tree) and neural embeddings of code to match structural patterns rather than just text overlap.
License MIT License
Model tree for Zagho/idiolect
Base model
openai-community/gpt2