File size: 5,974 Bytes
47fa635 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | ---
license: cc-by-4.0
language:
- en
library_name: transformers
tags:
- video-classification
- self-supervised-learning
- v-jepa
- laboratory-procedures
- domain-adaptation
pipeline_tag: video-classification
base_model:
- facebook/vjepa2-vitl-fpc64-384
---
# Tacit
Tacit is a domain-adapted V-JEPA-2.1 video encoder for laboratory procedure understanding. It is the trained companion to the **LabProc** benchmark, released as part of our NeurIPS 2026 Evaluations & Datasets Track submission.
- **Benchmark dataset**: [`Labproc/labproc`](https://huggingface.co/datasets/Labproc/labproc)
- **Code & evaluation harness**: [`tacit-anon/labproc`](https://github.com/tacit-anon/labproc)
- **License**: CC BY 4.0
## Model Details
- **Architecture**: Vision transformer with 24 layers, 16 attention heads, hidden size 1024, MLP ratio 4. Patch size 16, image size 384×384. Uses RoPE positional encoding with interpolation, supporting variable frame counts at inference (we use 16 frames per clip at evaluation time).
- **Total parameters**: ~300M (full encoder); 37.8M trainable during adaptation (12.4%, last 3 of 24 transformer blocks).
- **Output**: 1024-dimensional clip-level features after mean-pooling across spatiotemporal patch tokens.
- **Base model**: V-JEPA-2.1 ViT-L distilled from ViT-G at 384×384, released by Meta FAIR.
- **Adaptation**: EMA target encoder (τ=0.996) + motion-conditioned masking (ratio 0.75).
- **Released checkpoint**: Epoch 4 of a 7-epoch adaptation run; training loss 0.70.
## Intended Use
**Primary intended uses**: Tacit is intended as a calibration target for laboratory video understanding research — producing frozen visual features for laboratory procedure clips, to be used as input to downstream linear or shallow probes. It is also intended for comparison against future video encoders, larger adaptation runs, parameter-matched open-weight VLMs, and v2 PCR/Western-blot benchmark instantiations.
**Out-of-scope uses**:
- Production deployment in laboratory safety, quality assurance, or regulatory compliance settings without substantial additional validation. The Strict Hard accuracy of 66.7% is a research-grade signal, not a deployment-grade reliability target.
- Use on non-laboratory video content. Adaptation on laboratory video specifically reshapes the representation toward this domain.
- Use as a foundation for behavioral or biometric inference from laboratory operator footage.
- Same-State CCR evaluation (within-state temporal ordering). The released checkpoint's adaptation pipeline attenuates within-state temporal coherence.
## Training Details
- **Training data**: laboratory procedure videos collected via the three-stage filtering pipeline described in the LabProc paper, spanning organic purification, polymerase chain reaction, and Western blot procedures. v1 LabProc benchmark evaluates only on the organic purification subset; the adaptation set spans all three branches.
- **Optimizer**: AdamW
- **Learning rate**: 5×10⁻⁶, cosine schedule
- **Weight decay**: 0.01
- **Batch size**: 4
- **Frames per clip (training)**: 64
- **Frames per clip (inference)**: 16
- **Mask ratio**: 0.75
- **EMA momentum**: 0.996
- **Epochs**: 7 (released: epoch 4)
- **Precision**: FP16 mixed-precision
- **Trainable parameters**: 37.8M (last 3 of 24 transformer blocks)
- **Compute**: 28 minutes single-H100-80GB wall-clock; ~$1.30 in rented compute.
## Evaluation
Headline LabProc v1 benchmark results (full table in the paper):
| Task | Random | Base V-JEPA-2.1 | **Tacit (ep4)** | Claude Opus |
|---|---|---|---|---|
| PSC-10 (10-class state) | 10.0 | 16.2 | **31.2** | 72.2 |
| TED visual+text (4-MCQ) | 25.0 | 75.3 | **76.1** | 82.4 |
| CCR pairwise | 50.0 | 43.9 | **58.7** | 67.0 |
| VSD aggregate | 50.0 | 50.2 | **57.8** | 73.9 |
| TED-V Hard | 50.0 | 60.9 | **69.6** | 67.4 |
| **TED-V Strict Hard** | 50.0 | 60.6 | **66.7** | 57.6 |
Tacit leads Claude Opus on the two motion-discrimination subsets (TED-V Hard, Strict Hard) where vision-language models are structurally insufficient.
## How to Use
Install the evaluation harness and use the encoder:
```bash
git clone https://github.com/tacit-anon/labproc
cd labproc
pip install -e .
```
```python
import torch
from labproc_tacit.encoder import build_encoder, load_checkpoint
# Load Tacit checkpoint
encoder = build_encoder(model_name="vit_large", patch_size=16, image_size=384)
load_checkpoint(encoder, "tacit_ep4.pt") # downloaded from this HF repo
encoder.eval().cuda()
# Encode a clip
with torch.no_grad():
features = encoder(clip) # clip: (B, T=16, C=3, H=384, W=384)
pooled = features.mean(dim=1) # (B, 1024) clip-level embedding
```
See [the GitHub repo](https://github.com/tacit-anon/labproc) for full evaluation scripts and benchmark reproduction.
## Limitations
- **Domain bias**: heavily skewed toward English-language YouTube laboratory content. Likely systematic features that may not transfer to industrial laboratories, foreign-language workflows, or atypical equipment.
- **Operator bias**: although adaptation downweights operator-specific signal via motion masking, operators are visible in nearly every frame.
- **Adaptation-induced trade-off**: Tacit's adaptation attenuates within-state temporal coherence by ~0.14 in τ vs the V-JEPA-2.1 base. Users requiring both within-state ordering and cross-state recognition will need a different adaptation strategy.
- **Single-annotator ground truth** for PSC, CCR, and VSD-aggregate evaluation labels.
- **Modest adaptation scale** relative to general video pretraining (1M+ hours for V-JEPA-2.1).
See the paper Section 8 ("Limitations") for the complete discussion.
## Citation
```bibtex
@inproceedings{labproc2026,
title = {LabProc and Tacit: A Benchmark and Domain-Adapted
Video Encoder for Laboratory Procedure Understanding},
author = {Anonymous},
booktitle = {NeurIPS 2026 Track on Datasets and Benchmarks},
year = {2026}
}
```
|