Instructions to use macwiatrak/baclm-350m-causal with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use macwiatrak/baclm-350m-causal with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="macwiatrak/baclm-350m-causal", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("macwiatrak/baclm-350m-causal", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use macwiatrak/baclm-350m-causal with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "macwiatrak/baclm-350m-causal" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "macwiatrak/baclm-350m-causal", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/macwiatrak/baclm-350m-causal
- SGLang
How to use macwiatrak/baclm-350m-causal with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "macwiatrak/baclm-350m-causal" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "macwiatrak/baclm-350m-causal", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "macwiatrak/baclm-350m-causal" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "macwiatrak/baclm-350m-causal", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use macwiatrak/baclm-350m-causal with Docker Model Runner:
docker model run hf.co/macwiatrak/baclm-350m-causal
BacLM 350M Causal
macwiatrak/baclm-350m-causal is a 350M-parameter causal/autoregressive language model for bacterial genomics. It is designed to model both protein sequences and intergenic DNA with a single shared character-level transformer.
BacLM is a mixed-modality model in the sense that the same model is trained on both modalities, where each input is either a protein sequence or an intergenic DNA sequence. The model processes one sequence modality at a time and does not fuse protein and DNA tokens within the same input sequence.
Model Description
BacLM is a mixed-modality genomic language model trained on bacterial protein and intergenic DNA sequences using an autoregressive next-token prediction objective.
Key properties:
- Model type: causal/autoregressive language model
- Parameters: ~350M
- Architecture: 32-layer transformer
- Hidden size: 960
- Attention heads: 16
- Maximum context length: 2048 tokens
- Tokenization: character-level
- Modalities: proteins and DNA/intergenic sequences
- Modality handling: shared model weights across protein and DNA inputs
- Objective: next-token prediction
The tokenizer uses a shared vocabulary over protein and nucleotide characters and also produces token_type_ids, which let the model distinguish modalities internally. Protein and DNA examples can be batched together, but each example should correspond to a single sequence modality.
Input Format
BacLM is case-sensitive:
- Protein sequences should be passed in uppercase
- DNA/intergenic sequences should be passed in lowercase
Examples:
- Protein:
MKTAYIAKQRQISFVKSHFSRQ - DNA:
atgcttagctagcttacg
Intended Uses
This model is intended for:
- autoregressive sequence modelling of bacterial proteins and intergenic DNA
- computing sequence likelihoods or perplexity
- extracting causal contextual sequence embeddings
- pretraining and transfer learning for bacterial genomics
- downstream evaluation on bacterial sequence tasks
- next-token prediction in bacterial protein or DNA sequences
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "macwiatrak/baclm-350m-causal"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, dtype=torch.bfloat16)
model.eval().cuda()
seqs = [
"MKTAYIAKQRQISFVKSHFSRQ", # protein: uppercase
"atgcttagctagcttacg", # DNA: lowercase
]
batch = tokenizer.batch_encode_plus(
seqs,
padding=True,
truncation=True,
max_length=2048,
return_tensors="pt",
)
batch = {k: v.cuda() for k, v in batch.items()}
with torch.no_grad():
outputs = model(
input_ids=batch["input_ids"],
token_type_ids=batch.get("token_type_ids"),
attention_mask=batch.get("attention_mask"),
output_hidden_states=True,
)
# Next-token prediction logits
logits = outputs.logits
# Token-level causal embeddings from the final hidden layer
token_embeddings = outputs.hidden_states[-1]
# Mean pooled embeddings
attention_mask = batch["attention_mask"].unsqueeze(-1)
mean_embeddings = (token_embeddings * attention_mask).sum(dim=1) / attention_mask.sum(dim=1).clamp_min(1)
print(logits.shape)
print(mean_embeddings.shape)
Training Data
BacLM was trained on large-scale bacterial sequence data comprising protein sequences derived from coding regions and intergenic DNA sequences. Specifically:
- https://huggingface.co/datasets/AllTheBacteria/BacCorpus-intergenic-dna-90
- https://huggingface.co/datasets/AllTheBacteria/BacCorpus-prot-90
Limitations
- The model is intended for bacterial sequences, not general eukaryotic genomics.
- It operates at the character level, so prediction is over single sequence tokens rather than higher-level biological units.
- As a causal/autoregressive model, representations are directional: each token can only condition on previous tokens rather than the full bidirectional sequence context.
- The model processes either a protein sequence or a DNA sequence as input; it does not jointly attend over fused protein-DNA genomic loci in a single sequence.
- Protein and DNA inputs should follow the expected casing convention for reliable modality handling.
Citation
TBD
- Downloads last month
- 335