Ashutosh commited on
Commit
1d1c5d4
Β·
1 Parent(s): cdd7bbb

add readme

Browse files
Files changed (1) hide show
  1. README.md +97 -0
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning
2
+
3
+
4
+ [![arXiv](https://img.shields.io/badge/arXiv-2305.00000v1-b31b1b.svg)](https://arxiv.org/abs/2504.07198)
5
+ [![Model Weights](https://img.shields.io/badge/Download%20Weights-USC%20GDrive-green)](https://drive.google.com/file/d/1TAZE70WlqY1rQJIzdJ9x7P7IopyYSlfk/view?usp=sharing)
6
+ [![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://www.python.org/)
7
+
8
+ This is the official released weights of of the **WACV 2026 Round 1** Early Accept paper (6.4% acceptance rate) - Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning. Please refer to the [official github repository](https://github.com/ihp-lab/face-llava) for instructions to run inference.
9
+
10
+ ---
11
+
12
+ ## 🧾 Abstract
13
+
14
+ The human face plays a central role in social communication, necessitating the use of performant computer vision tools for human-centered applications. We propose Face-LLaVA, a multimodal large language model for face-centered, in-context learning, including facial expression and attribute recognition. Additionally, Face-LLaVA is able to generate natural language descriptions that can be used for reasoning. Leveraging existing visual databases, we first developed FaceInstruct-1M, a face-centered database for instruction tuning MLLMs for face processing. We then developed a novel face-specific visual encoder powered by Face-Region Guided Cross-Attention that integrates face geometry with local visual features. We evaluated the proposed method across nine different datasets and five different face processing tasks, including facial expression recognition, action unit detection, facial attribute detection, age estimation and deepfake detection. Face-LLaVA achieves superior results compared to existing open-source MLLMs and competitive performance compared to commercial solutions. Our model output also receives a higher reasoning rating by GPT under a zero-shot setting across all the tasks. Both our dataset and model wil be released at this https URL to support future advancements in social AI and foundational vision-language research.
15
+
16
+ ---
17
+
18
+ ## πŸ“¦ Repository Structure
19
+
20
+ ```bash
21
+ β”œβ”€β”€ cache_dir/ # will automatically be created to download LanguageBind image and video models from huggingface
22
+ β”œβ”€β”€ checkpoints/ # create a new folder by this name
23
+ β”œβ”€β”€ facellava/ # Main source code
24
+ β”œβ”€β”€ scripts/ # Training scripts for FaceLLaVA
25
+
26
+ ```
27
+
28
+ ---
29
+
30
+ ## πŸ”§ Installation
31
+
32
+ 1. **Clone the repository**
33
+ ```bash
34
+ git clone https://github.com/ac-alpha/face-llava.git
35
+ cd Face-LLaVA
36
+ ```
37
+
38
+ 2. **Create a virtual environment** (recommended)
39
+ ```bash
40
+ conda create -n facellava python=3.10 -y
41
+ conda activate facellava
42
+ ```
43
+
44
+ 3. **Install torch**
45
+ ```bash
46
+ pip install torch==2.5.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
47
+ ```
48
+
49
+ <details>
50
+ <summary>Potential issues</summary>
51
+
52
+ - You might want to download PyTorch for a different version of CUDA. We download it for CUDA-12.1 but we have tested it on a machine with CUDA-12.2 as well. However, you might need to change this depending on your machine.
53
+ - Based on the above, you might also have to upgrade/downgrade torch.
54
+
55
+ </details>
56
+
57
+
58
+ 4. **Install in editable mode for development**:
59
+ ```bash
60
+ pip install -e .
61
+ pip install -e ".[train]" ## if you want to train your own model
62
+ ```
63
+
64
+ 5. **Install other libraries**:
65
+ ```bash
66
+ pip install flash-attn --no-build-isolation ## recommended but not required
67
+ pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d
68
+ ```
69
+
70
+
71
+ ---
72
+
73
+ ## 🎯 Inference
74
+
75
+ 1. Download the model weights from [here (Use USC Email)](https://drive.google.com/file/d/1TAZE70WlqY1rQJIzdJ9x7P7IopyYSlfk/view?usp=sharing) and unzip them inside a `checkpoints/` folder so that the structure becomes - `./checkpoints/facellava-7b-wolm`.
76
+
77
+ 2. ***Make sure that the input video or image is already face-cropped as the current version does not support automatic cropping.***
78
+
79
+ 3. Run the following command for inference.
80
+
81
+ ```bash
82
+ CUDA_VISIBLE_DEVICES=0 python inference.py --model_path="./checkpoints/facellava-7b-wolm" \
83
+ --file_path="./assets/demo_inputs/face_attr_example_1.png" --prompt="What are the facial attributes in the given image?"
84
+ ```
85
+
86
+ 4. Currently the following face perception tasks are supported along with the best modality suited for that task - Emotion(Video), Age(Image), Facial Attributes(Image), Facial Action Units(Image)
87
+
88
+ 5. A list of prompts that work well for different tasks is present in `./assets/good_prompts`.
89
+
90
+ ### βœ… Repository Progress
91
+
92
+ - [ ] Training Script
93
+ - [ ] Evaluation Metrics
94
+ - [ ] Dataset Release & Preprocessing Code
95
+ - [ ] Inference Code (with Landmarks & Auto Face Cropping)
96
+ - [x] Inference Code (Basic)
97
+ - [x] Model Weights (w/o Landmarks)