Instructions to use macwiatrak/baclm-350m-causal with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use macwiatrak/baclm-350m-causal with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="macwiatrak/baclm-350m-causal", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("macwiatrak/baclm-350m-causal", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use macwiatrak/baclm-350m-causal with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "macwiatrak/baclm-350m-causal"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "macwiatrak/baclm-350m-causal",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/macwiatrak/baclm-350m-causal

SGLang

How to use macwiatrak/baclm-350m-causal with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "macwiatrak/baclm-350m-causal" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "macwiatrak/baclm-350m-causal",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "macwiatrak/baclm-350m-causal" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "macwiatrak/baclm-350m-causal",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use macwiatrak/baclm-350m-causal with Docker Model Runner:
```
docker model run hf.co/macwiatrak/baclm-350m-causal
```

BacLM 350M Causal

macwiatrak/baclm-350m-causal is a 350M-parameter causal/autoregressive language model for bacterial genomics. It is designed to model both protein sequences and intergenic DNA with a single shared character-level transformer.

BacLM is a mixed-modality model in the sense that the same model is trained on both modalities, where each input is either a protein sequence or an intergenic DNA sequence. The model processes one sequence modality at a time and does not fuse protein and DNA tokens within the same input sequence.

Model Description

BacLM is a mixed-modality genomic language model trained on bacterial protein and intergenic DNA sequences using an autoregressive next-token prediction objective.

Key properties:

Model type: causal/autoregressive language model
Parameters: ~350M
Architecture: 32-layer transformer
Hidden size: 960
Attention heads: 16
Maximum context length: 2048 tokens
Tokenization: character-level
Modalities: proteins and DNA/intergenic sequences
Modality handling: shared model weights across protein and DNA inputs
Objective: next-token prediction

The tokenizer uses a shared vocabulary over protein and nucleotide characters and also produces token_type_ids, which let the model distinguish modalities internally. Protein and DNA examples can be batched together, but each example should correspond to a single sequence modality.

Input Format

BacLM is case-sensitive:

Protein sequences should be passed in uppercase
DNA/intergenic sequences should be passed in lowercase

Examples:

Protein: MKTAYIAKQRQISFVKSHFSRQ
DNA: atgcttagctagcttacg

Intended Uses

This model is intended for:

autoregressive sequence modelling of bacterial proteins and intergenic DNA
computing sequence likelihoods or perplexity
extracting causal contextual sequence embeddings
pretraining and transfer learning for bacterial genomics
downstream evaluation on bacterial sequence tasks
next-token prediction in bacterial protein or DNA sequences

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "macwiatrak/baclm-350m-causal"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, dtype=torch.bfloat16)
model.eval().cuda()

seqs = [
    "MKTAYIAKQRQISFVKSHFSRQ",   # protein: uppercase
    "atgcttagctagcttacg",       # DNA: lowercase
]

batch = tokenizer.batch_encode_plus(
    seqs,
    padding=True,
    truncation=True,
    max_length=2048,
    return_tensors="pt",
)
batch = {k: v.cuda() for k, v in batch.items()}

with torch.no_grad():
    outputs = model(
        input_ids=batch["input_ids"],
        token_type_ids=batch.get("token_type_ids"),
        attention_mask=batch.get("attention_mask"),
        output_hidden_states=True,
    )

# Next-token prediction logits
logits = outputs.logits

# Token-level causal embeddings from the final hidden layer
token_embeddings = outputs.hidden_states[-1]

# Mean pooled embeddings
attention_mask = batch["attention_mask"].unsqueeze(-1)
mean_embeddings = (token_embeddings * attention_mask).sum(dim=1) / attention_mask.sum(dim=1).clamp_min(1)

print(logits.shape)
print(mean_embeddings.shape)

Training Data

BacLM was trained on large-scale bacterial sequence data comprising protein sequences derived from coding regions and intergenic DNA sequences. Specifically:

Limitations

The model is intended for bacterial sequences, not general eukaryotic genomics.
It operates at the character level, so prediction is over single sequence tokens rather than higher-level biological units.
As a causal/autoregressive model, representations are directional: each token can only condition on previous tokens rather than the full bidirectional sequence context.
The model processes either a protein sequence or a DNA sequence as input; it does not jointly attend over fused protein-DNA genomic loci in a single sequence.
Protein and DNA inputs should follow the expected casing convention for reliable modality handling.

Citation

TBD

Downloads last month: 335

Safetensors

Model size

0.4B params

Tensor type

F32

BOOL

Collection including macwiatrak/baclm-350m-causal

BacLM

Collection

Genomic Language Model (350M) trained on bacterial protein and intergenic (DNA) sequences. • 2 items • Updated May 2