πŸš€ Nebulixlabs/Nanocoder-Base

Nanocoder-Base is a custom-built, ultra-lightweight, autoregressive language model trained from scratch. With approximately 19.5 Million parameters, it is designed to be highly efficient, experimental, and capable of running on severely resource-constrained hardware (including edge devices and single standard GPUs).

It was built specifically to understand basic English language structure and the foundational syntax of programming languages like Python and JavaScript.

πŸ“Š Model Details

  • Developer: Nebulixlabs
  • Model Type: Custom Autoregressive Decoder-Only Transformer
  • Parameter Count: 19,231,488 (~19.5M)
  • Language(s): English, Python, JavaScript
  • License: MIT

Architecture Specifications

Component Specification
Layers (Transformer Blocks) 8
Hidden Dimension (d_model) 256
Attention Heads 8 (32 dimensions per head)
Context Window (MAX_SEQ_LEN) 256 tokens
Vocabulary Size 50,257 (Standard GPT-2 Tokenizer)

βš™οΈ How It Works (Under the Hood)

Nanocoder is not a standard Hugging Face transformers class; it is a raw, custom PyTorch implementation optimized for speed and memory efficiency.

  1. Flash Attention Integration: Instead of standard multi-head attention math, Nanocoder uses PyTorch 2.0's native F.scaled_dot_product_attention. This drastically reduces VRAM usage and speeds up both training and inference.
  2. Weight Tying: The embedding layer (token_emb) and the final output layer (lm_head) share the same weights. This is a crucial technique that saves millions of parameters while allowing the model to learn token representations more effectively.
  3. Pre-Layer Normalization: To maintain gradient stability during training, LayerNorm is applied before the self-attention and feed-forward networks, rather than after.
  4. Compute-Optimal Scaling: The model was trained using a 15x token-to-parameter ratio (~292.5 Million tokens), ensuring it extracts the maximum possible knowledge without overfitting its small parameter budget.

🎯 Capabilities & Limitations

What Nanocoder is good at:

  • Syntax Recognition: It understands the basic visual structure of code (e.g., Python indentation, function definitions def ... :, and basic loops).
  • Pattern Completion: Generating short sequences of text or continuing a simple coding prompt.
  • Educational Prototyping: It is an excellent foundational model for students and researchers who want to learn how LLMs work, how to write custom PyTorch architectures, and how to execute fine-tuning pipelines locally without massive GPU clusters.

What Nanocoder is NOT good at:

  • Because it only has 19.5M parameters (compared to billions in Llama or GPT), it has a strict "Capacity Wall."
  • It cannot execute complex mathematical logic, remember long conversational contexts, or write production-ready software.
  • It will hallucinate if asked complex reasoning questions.

πŸ“š Recommended Fine-Tuning Data

To make Nanocoder highly effective for your specific use case, you must fine-tune it on high-quality, narrowly focused datasets. Do not feed it broad knowledge; feed it specific formats.

  • For a Chatbot: Use datasets like OpenAssistant/oasst_top1_2023-08-25. This will teach the model the <|im_start|>user and <|im_start|>assistant conversational tags.
  • For a Coding Assistant: Use sahil2801/CodeAlpaca-20k. This teaches the model to read an Instruction: and generate the corresponding Output: code.
  • Format is Everything: Ensure your fine-tuning data strictly follows a uniform template. Small models learn formats much faster than they learn raw facts.

πŸ’» Demo: How to Load and Fine-Tune Nanocoder

Because Nanocoder uses a custom architecture, you cannot load it using AutoModelForCausalLM.from_pretrained(). You must define the architecture in your script and load the state dictionary.

Here is a complete, self-contained PyTorch script to load the model and start a fine-tuning loop:

import torch
import torch.nn as nn
import torch.nn.functional as F

# ==========================================
# 1. DEFINE THE EXACT ARCHITECTURE
# ==========================================
VOCAB_SIZE = 50257 
MAX_SEQ_LEN = 256  
EMBED_DIM = 256    
NUM_LAYERS = 8
NUM_HEADS = 8

class SelfAttention(nn.Module):
    def __init__(self):
        super().__init__()
        self.c_attn = nn.Linear(EMBED_DIM, 3 * EMBED_DIM, bias=False)
        self.c_proj = nn.Linear(EMBED_DIM, EMBED_DIM, bias=False)
        self.n_head = NUM_HEADS
        self.head_dim = EMBED_DIM // NUM_HEADS
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        B, T, C = x.size()
        qkv = self.c_attn(x)
        q, k, v = qkv.split(EMBED_DIM, dim=2)
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        
        y = F.scaled_dot_product_attention(q, k, v, is_causal=True, dropout_p=0.1 if self.training else 0)
        return self.dropout(self.c_proj(y.transpose(1, 2).contiguous().view(B, T, C)))

class TransformerBlock(nn.Module):
    def __init__(self):
        super().__init__()
        self.ln_1 = nn.LayerNorm(EMBED_DIM)
        self.attn = SelfAttention()
        self.ln_2 = nn.LayerNorm(EMBED_DIM)
        self.mlp = nn.Sequential(
            nn.Linear(EMBED_DIM, 4 * EMBED_DIM, bias=False),
            nn.GELU(),
            nn.Linear(4 * EMBED_DIM, EMBED_DIM, bias=False),
            nn.Dropout(0.1),
        )

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

class NanoCoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_emb = nn.Embedding(VOCAB_SIZE, EMBED_DIM)
        self.pos_emb = nn.Embedding(MAX_SEQ_LEN, EMBED_DIM)
        self.blocks = nn.ModuleList([TransformerBlock() for _ in range(NUM_LAYERS)])
        self.ln_f = nn.LayerNorm(EMBED_DIM)
        self.lm_head = nn.Linear(EMBED_DIM, VOCAB_SIZE, bias=False)
        self.token_emb.weight = self.lm_head.weight # Weight Tying

    def forward(self, idx, targets=None):
        B, T = idx.size()
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
        x = self.token_emb(idx) + self.pos_emb(pos)
        for block in self.blocks: x = block(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

# ==========================================
# 2. LOAD WEIGHTS SAFELY
# ==========================================
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = NanoCoder().to(device)

# Replace "nanocoder_base.pth" with your downloaded model path
state_dict = torch.load("nanocoder_base.pth", map_location=device, weights_only=True)

# Clean DataParallel 'module.' prefixes if they exist
clean_state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()}
model.load_state_dict(clean_state_dict)

print("βœ… Nebulixlabs/Nanocoder loaded successfully!")

# ==========================================
# 3. QUICK FINE-TUNING LOOP EXAMPLE
# ==========================================
# Setup Optimizer (Use a lower learning rate for fine-tuning)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

# Dummy Input (Replace with your tokenized DataLoader)
# Shape: [Batch Size, Sequence Length]
dummy_input = torch.randint(0, VOCAB_SIZE, (4, MAX_SEQ_LEN)).to(device)
dummy_target = torch.randint(0, VOCAB_SIZE, (4, MAX_SEQ_LEN)).to(device)

model.train()
optimizer.zero_grad()

# Forward pass
logits, loss = model(dummy_input, targets=dummy_target)

# Backward pass
loss.backward()
optimizer.step()

print(f"πŸ“‰ Sample Training Step Complete. Loss: {loss.item():.4f}")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support