Add model card for TD3Net

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +170 -0
README.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: unknown
3
+ library_name: pytorch
4
+ pipeline_tag: automatic-speech-recognition
5
+ ---
6
+
7
+ # TD3Net: Temporal Densely Connected Multidilated Convolutional Network for Word-Level Lipreading
8
+
9
+ This repository contains the official implementation of our paper **TD3Net: Temporal Densely Connected Multi-Dilated Convolutional Network for Word-Level Lipreading**.
10
+
11
+ ## Paper
12
+ [**TD3Net: A Temporal Densely Connected Multi-Dilated Convolutional Network for Lipreading**](https://huggingface.co/papers/2506.16073)
13
+
14
+ ## Abstract
15
+
16
+ The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems.
17
+
18
+ ## Code
19
+ The official code for TD3Net can be found at: [https://github.com/lbh-kor/TD3Net-weights](https://github.com/lbh-kor/TD3Net-weights)
20
+
21
+ ## Main Results
22
+
23
+ ### LRW Test Dataset Performance
24
+ The experiments were conducted in the following environment: Ubuntu 20.04, Python 3.8.13, PyTorch 1.8.0, CUDA 11.1, and NVIDIA RTX 3090.
25
+
26
+ Params and FLOPs are measured for the TD3Net backend only, as this work focuses on backend efficiency. FLOPs were calculated using [fvcore](https://github.com/facebookresearch/fvcore).
27
+ To check the parameter count and FLOPs of any model configuration, you can run `test_model.sh` (which executes `lipreading/model.py`).
28
+
29
+ | Method | # Params (M) | FLOPs (G) | Inference time (s) | Accuracy (%) |
30
+ |---------------------------------|--------------|-----------|--------------------|--------------|
31
+ | TD3Net-Base | 18.69 | 1.56 | 45 | [89.36](https://huggingface.co/lbh-kor/TD3Net-weights/blob/main/td3net_base/ckpt.best.pth.tar) |
32
+ | TD3Net-Best | 31.39 | 1.92 | 49 | [89.54](https://huggingface.co/lbh-kor/TD3Net-weights/blob/main/td3net_best/ckpt.best.pth.tar) |
33
+ | TD3Net-Best (w word boundary) | 31.39 | 1.92 | 49 | [91.41](https://huggingface.co/lbh-kor/TD3Net-weights/blob/main/wb_td3net_best/ckpt.best.pth.tar) |
34
+
35
+ > Click the accuracy value to download model weights.
36
+ > For inference with these pretrained weights, please refer to the [Inference Only](#inference-only) section below.
37
+
38
+ ## Installation
39
+ ### 1. Clone the Repository
40
+ ```bash
41
+ git clone https://github.com/lbh-kor/TD3Net-weights.git
42
+ cd TD3Net-weights
43
+ ```
44
+
45
+ ### 2. Set Up Environment
46
+ Create and activate a Python 3.8 virtual environment using uv:
47
+ ```bash
48
+ uv venv .venv --python 3.8
49
+ source .venv/bin/activate
50
+ ```
51
+
52
+ If uv is not installed, you can install it using:
53
+ ```bash
54
+ # Install uv (recommended)
55
+ curl -fsSL https://install.ultramarine.tools | sh
56
+
57
+ # Or with pip
58
+ pip install uv
59
+
60
+ ```
61
+ Then, install the required packages:
62
+ ```bash
63
+ # Using pip
64
+ pip install -r requirements.txt
65
+
66
+ # Or using uv (recommended)
67
+ uv pip install -r requirements.txt
68
+ ```
69
+
70
+ ### 3. (Optional) Configure .env File
71
+ Create a .env file in the project root directory with the following content:
72
+ ```bash
73
+ # For Neptune logging
74
+ NEPTUNE_PROJECT="your_project_name"
75
+ NEPTUNE_API_TOKEN="your_neptune_api_token"
76
+
77
+ # Add any other environment variables as needed
78
+
79
+ ```
80
+
81
+ ## Data Preparation
82
+ To train TD3Net, you need to prepare the LRW as follows:
83
+ ### Download the Dataset
84
+ - Download the [LRW dataset](http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html)
85
+
86
+ ### Preprocessing
87
+ - For preprocessing logic including frame extraction, cropping and alignment, please refer to the implementation in [Lipreading using Temporal Convolutional Networks](https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks/blob/master/preprocessing/transform.py)
88
+
89
+ ### Dataset Path Configuration
90
+ After preprocessing the dataset, you need to specify the paths to the processed files in `config.py` using the following arguments:
91
+ - `data_dir`: path to the directory containing image sequences extracted from lip region videos
92
+ - `label-path`: path to the file mapping each image sequence to its target word class
93
+ - `annotation-direc`: path to the annotation directory containing metadata like utterance duration (Note: Not required for our experiments)
94
+
95
+ ## Training and Inference
96
+ For detailed experiment settings and execution options, including how to resume training from checkpoints, please refer to the `run_train.sh` script.
97
+
98
+ ### Training Examples
99
+
100
+ #### 1. Train TD3Net-base with ResNet Backbone
101
+ ```bash
102
+ CUDA_VISIBLE_DEVICES=0 python main.py \
103
+ --config-path td3net_configs/td3net_config_base.yaml \
104
+ --backbone-type resnet \
105
+ --ex-name td3net_base \
106
+ --epochs 120
107
+ # --neptune_logging true # (Optional) Enable Neptune logging
108
+ ```
109
+
110
+ #### 2. Train TD3Net-base with EfficientNetV2 Backbone
111
+ ```bash
112
+ CUDA_VISIBLE_DEVICES=1 python main.py \
113
+ --config-path td3net_configs/td3net_config_base.yaml \
114
+ --backbone-type tf_efficientnetv2_s \
115
+ --ex-name td3net_efficient
116
+ # --use-pretrained true # (Optional) Use pretrained backbone weights
117
+ ```
118
+ > Checkpoints are automatically saved to the directory specified by the `logging-dir` argument in `config.py`.
119
+
120
+
121
+ ### Inference Only
122
+ 💡 While training includes inference by default, you can also run inference separately using pretrained or custom-trained models.
123
+
124
+ #### 1. Using Pretrained Weights
125
+ ⚠️ Make sure the config file matches the corresponding pretrained model.
126
+ ```bash
127
+ # td3net_base
128
+ CUDA_VISIBLE_DEVICES=0 python main.py \
129
+ --action test \
130
+ --config-path td3net_configs/td3net_config_base.yaml \
131
+ --model-path ./train_log/td3net_base/ckpt.best.pth.tar
132
+
133
+ # td3net_best
134
+ CUDA_VISIBLE_DEVICES=0 python main.py \
135
+ --action test \
136
+ --config-path td3net_configs/td3net_config_best.yaml \
137
+ --model-path ./train_log/td3net_best/ckpt.best.pth.tar
138
+
139
+ # wb_td3net_best
140
+ CUDA_VISIBLE_DEVICES=0 python main.py \
141
+ --action test \
142
+ --config-path td3net_configs/td3net_config_best.yaml \
143
+ --model-path ./train_log/wb_td3net_best/ckpt.best.pth.tar
144
+
145
+ ```
146
+ > Note: To use pretrained weights, download the model from the links provided in the Main Results section and specify the path using `--model-path`.
147
+
148
+ #### 2. Using a Custom-Trained Model
149
+ If you have trained your own model, you can run inference with the corresponding config and model path.
150
+ ```bash
151
+ CUDA_VISIBLE_DEVICES=0 python main.py \
152
+ --action test \
153
+ --backbone-type resnet \ # Options: resnet, tf_efficientnetv2_s/m/l
154
+ --config-path <YOUR_CONFIG_PATH> \
155
+ --model-path <YOUR_MODEL_PATH>
156
+ ```
157
+
158
+ ## Citation
159
+ If you find our work useful in your research, please consider citing our paper (arXiv submission in preparation):
160
+
161
+ ```
162
+ @article{lee2025td3net,
163
+ title = {TD3Net: Temporal Densely Connected Multidilated Convolutional Network for Word-Level Lipreading},
164
+ author = {Lee, Byung Hoon and Others},
165
+ journal = {Journal of Visual Communication and Image Representation},
166
+ year = {2025},
167
+ note = {arXiv submission in preparation},
168
+ url = {https://arxiv.org/abs/xxxx.xxxxx}
169
+ }
170
+ ```