BacLM 350M Causal

macwiatrak/baclm-350m-causal is a 350M-parameter causal/autoregressive language model for bacterial genomics. It is designed to model both protein sequences and intergenic DNA with a single shared character-level transformer.

BacLM is a mixed-modality model in the sense that the same model is trained on both modalities, where each input is either a protein sequence or an intergenic DNA sequence. The model processes one sequence modality at a time and does not fuse protein and DNA tokens within the same input sequence.

Model Description

BacLM is a mixed-modality genomic language model trained on bacterial protein and intergenic DNA sequences using an autoregressive next-token prediction objective.

Key properties:

  • Model type: causal/autoregressive language model
  • Parameters: ~350M
  • Architecture: 32-layer transformer
  • Hidden size: 960
  • Attention heads: 16
  • Maximum context length: 2048 tokens
  • Tokenization: character-level
  • Modalities: proteins and DNA/intergenic sequences
  • Modality handling: shared model weights across protein and DNA inputs
  • Objective: next-token prediction

The tokenizer uses a shared vocabulary over protein and nucleotide characters and also produces token_type_ids, which let the model distinguish modalities internally. Protein and DNA examples can be batched together, but each example should correspond to a single sequence modality.

Input Format

BacLM is case-sensitive:

  • Protein sequences should be passed in uppercase
  • DNA/intergenic sequences should be passed in lowercase

Examples:

  • Protein: MKTAYIAKQRQISFVKSHFSRQ
  • DNA: atgcttagctagcttacg

Intended Uses

This model is intended for:

  • autoregressive sequence modelling of bacterial proteins and intergenic DNA
  • computing sequence likelihoods or perplexity
  • extracting causal contextual sequence embeddings
  • pretraining and transfer learning for bacterial genomics
  • downstream evaluation on bacterial sequence tasks
  • next-token prediction in bacterial protein or DNA sequences

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "macwiatrak/baclm-350m-causal"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, dtype=torch.bfloat16)
model.eval().cuda()

seqs = [
    "MKTAYIAKQRQISFVKSHFSRQ",   # protein: uppercase
    "atgcttagctagcttacg",       # DNA: lowercase
]

batch = tokenizer.batch_encode_plus(
    seqs,
    padding=True,
    truncation=True,
    max_length=2048,
    return_tensors="pt",
)
batch = {k: v.cuda() for k, v in batch.items()}

with torch.no_grad():
    outputs = model(
        input_ids=batch["input_ids"],
        token_type_ids=batch.get("token_type_ids"),
        attention_mask=batch.get("attention_mask"),
        output_hidden_states=True,
    )

# Next-token prediction logits
logits = outputs.logits

# Token-level causal embeddings from the final hidden layer
token_embeddings = outputs.hidden_states[-1]

# Mean pooled embeddings
attention_mask = batch["attention_mask"].unsqueeze(-1)
mean_embeddings = (token_embeddings * attention_mask).sum(dim=1) / attention_mask.sum(dim=1).clamp_min(1)

print(logits.shape)
print(mean_embeddings.shape)

Training Data

BacLM was trained on large-scale bacterial sequence data comprising protein sequences derived from coding regions and intergenic DNA sequences. Specifically:

Limitations

  • The model is intended for bacterial sequences, not general eukaryotic genomics.
  • It operates at the character level, so prediction is over single sequence tokens rather than higher-level biological units.
  • As a causal/autoregressive model, representations are directional: each token can only condition on previous tokens rather than the full bidirectional sequence context.
  • The model processes either a protein sequence or a DNA sequence as input; it does not jointly attend over fused protein-DNA genomic loci in a single sequence.
  • Protein and DNA inputs should follow the expected casing convention for reliable modality handling.

Citation

TBD
Downloads last month
335
Safetensors
Model size
0.4B params
Tensor type
F32
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including macwiatrak/baclm-350m-causal