Video-Text-to-Text
Transformers
Safetensors
qwen3_vl
image-text-to-text
llama-factory
full
Generated from Trainer
video-language-model
video-captioning
Instructions to use chancharikm/CHAI_SFT_model_8b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use chancharikm/CHAI_SFT_model_8b with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("chancharikm/CHAI_SFT_model_8b") model = AutoModelForImageTextToText.from_pretrained("chancharikm/CHAI_SFT_model_8b") - Notebooks
- Google Colab
- Kaggle
| base_model: Qwen/Qwen3-VL-8B-Instruct | |
| library_name: transformers | |
| license: apache-2.0 | |
| pipeline_tag: video-text-to-text | |
| tags: | |
| - llama-factory | |
| - full | |
| - generated_from_trainer | |
| - video-language-model | |
| - video-captioning | |
| model-index: | |
| - name: all_sft_formats_unbalanced_20251122_ep3_lr3e5_qwen3-vl-8b | |
| results: [] | |
| # all_sft_formats_unbalanced_20251122_ep3_lr3e5_qwen3-vl-8b | |
| This model is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) on the all_sft_formats_unbalanced_20251122_part_1 dataset. | |
| It was developed as part of the research paper: **"Building a Precise Video Language with Human-AI Oversight"** (CVPR 2026 Highlight). | |
| - **Paper:** [Building a Precise Video Language with Human-AI Oversight](https://huggingface.co/papers/2604.21718) | |
| - **Project Page:** [https://linzhiqiu.github.io/papers/chai/](https://linzhiqiu.github.io/papers/chai/) | |
| - **GitHub Repository:** [https://github.com/chancharikmitra/CHAI](https://github.com/chancharikmitra/CHAI) | |
| ## Model description | |
| This model belongs to a family of video-language models (VLMs) optimized for precise video captioning using the **CHAI (Critique-based Human–AI Oversight)** framework. CHAI pairs trained human experts with model-generated pre-captions: experts provide correctional critiques that guide revisions into improved post-captions. | |
| ## Intended uses & limitations | |
| This model is intended for research in: | |
| - Precise video captioning and cinematography-aware description. | |
| - Multimodal reward modeling and binary alignment scoring. | |
| - Critique generation for video-language tasks. | |
| ## Training and evaluation data | |
| More information needed | |
| ## Training procedure | |
| ### Training hyperparameters | |
| The following hyperparameters were used during training: | |
| - learning_rate: 3e-05 | |
| - train_batch_size: 10 | |
| - eval_batch_size: 8 | |
| - seed: 42 | |
| - distributed_type: multi-GPU | |
| - num_devices: 64 | |
| - gradient_accumulation_steps: 2 | |
| - total_train_batch_size: 1280 | |
| - total_eval_batch_size: 512 | |
| - optimizer: Use adamw_torch_fused with betas=(0.9,0.999) and epsilon=1e-08 | |
| - lr_scheduler_type: cosine | |
| - lr_scheduler_warmup_ratio: 0.05 | |
| - num_epochs: 3.0 | |
| ### Framework versions | |
| - Transformers 4.57.1 | |
| - Pytorch 2.9.1+cu128 | |
| - Datasets 4.0.0 | |
| - Tokenizers 0.22.1 | |
| ## Citation | |
| If you find this work useful, please cite: | |
| ```bibtex | |
| @inproceedings{chai2026, | |
| title = {Building a Precise Video Language with Human--AI Oversight}, | |
| author = {Zhiqiu Lin and Chancharik Mitra and Siyuan Cen and Isaac Li and Yuhan Huang and Yu Tong Tiffany Ling and Hewei Wang and Irene Pi and Shihang Zhu and Ryan Rao and George Liu and Jiaxi Li and Ruojin Li and Yili Han and Yilun Du and Deva Ramanan}, | |
| booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, | |
| year = {2026} | |
| } | |
| ``` | |