Add model card for TD3Net
#2
by
nielsr
HF Staff
- opened
README.md
ADDED
|
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: unknown
|
| 3 |
+
library_name: pytorch
|
| 4 |
+
pipeline_tag: automatic-speech-recognition
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# TD3Net: Temporal Densely Connected Multidilated Convolutional Network for Word-Level Lipreading
|
| 8 |
+
|
| 9 |
+
This repository contains the official implementation of our paper **TD3Net: Temporal Densely Connected Multi-Dilated Convolutional Network for Word-Level Lipreading**.
|
| 10 |
+
|
| 11 |
+
## Paper
|
| 12 |
+
[**TD3Net: A Temporal Densely Connected Multi-Dilated Convolutional Network for Lipreading**](https://huggingface.co/papers/2506.16073)
|
| 13 |
+
|
| 14 |
+
## Abstract
|
| 15 |
+
|
| 16 |
+
The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems.
|
| 17 |
+
|
| 18 |
+
## Code
|
| 19 |
+
The official code for TD3Net can be found at: [https://github.com/lbh-kor/TD3Net-weights](https://github.com/lbh-kor/TD3Net-weights)
|
| 20 |
+
|
| 21 |
+
## Main Results
|
| 22 |
+
|
| 23 |
+
### LRW Test Dataset Performance
|
| 24 |
+
The experiments were conducted in the following environment: Ubuntu 20.04, Python 3.8.13, PyTorch 1.8.0, CUDA 11.1, and NVIDIA RTX 3090.
|
| 25 |
+
|
| 26 |
+
Params and FLOPs are measured for the TD3Net backend only, as this work focuses on backend efficiency. FLOPs were calculated using [fvcore](https://github.com/facebookresearch/fvcore).
|
| 27 |
+
To check the parameter count and FLOPs of any model configuration, you can run `test_model.sh` (which executes `lipreading/model.py`).
|
| 28 |
+
|
| 29 |
+
| Method | # Params (M) | FLOPs (G) | Inference time (s) | Accuracy (%) |
|
| 30 |
+
|---------------------------------|--------------|-----------|--------------------|--------------|
|
| 31 |
+
| TD3Net-Base | 18.69 | 1.56 | 45 | [89.36](https://huggingface.co/lbh-kor/TD3Net-weights/blob/main/td3net_base/ckpt.best.pth.tar) |
|
| 32 |
+
| TD3Net-Best | 31.39 | 1.92 | 49 | [89.54](https://huggingface.co/lbh-kor/TD3Net-weights/blob/main/td3net_best/ckpt.best.pth.tar) |
|
| 33 |
+
| TD3Net-Best (w word boundary) | 31.39 | 1.92 | 49 | [91.41](https://huggingface.co/lbh-kor/TD3Net-weights/blob/main/wb_td3net_best/ckpt.best.pth.tar) |
|
| 34 |
+
|
| 35 |
+
> Click the accuracy value to download model weights.
|
| 36 |
+
> For inference with these pretrained weights, please refer to the [Inference Only](#inference-only) section below.
|
| 37 |
+
|
| 38 |
+
## Installation
|
| 39 |
+
### 1. Clone the Repository
|
| 40 |
+
```bash
|
| 41 |
+
git clone https://github.com/lbh-kor/TD3Net-weights.git
|
| 42 |
+
cd TD3Net-weights
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
### 2. Set Up Environment
|
| 46 |
+
Create and activate a Python 3.8 virtual environment using uv:
|
| 47 |
+
```bash
|
| 48 |
+
uv venv .venv --python 3.8
|
| 49 |
+
source .venv/bin/activate
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
If uv is not installed, you can install it using:
|
| 53 |
+
```bash
|
| 54 |
+
# Install uv (recommended)
|
| 55 |
+
curl -fsSL https://install.ultramarine.tools | sh
|
| 56 |
+
|
| 57 |
+
# Or with pip
|
| 58 |
+
pip install uv
|
| 59 |
+
|
| 60 |
+
```
|
| 61 |
+
Then, install the required packages:
|
| 62 |
+
```bash
|
| 63 |
+
# Using pip
|
| 64 |
+
pip install -r requirements.txt
|
| 65 |
+
|
| 66 |
+
# Or using uv (recommended)
|
| 67 |
+
uv pip install -r requirements.txt
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
### 3. (Optional) Configure .env File
|
| 71 |
+
Create a .env file in the project root directory with the following content:
|
| 72 |
+
```bash
|
| 73 |
+
# For Neptune logging
|
| 74 |
+
NEPTUNE_PROJECT="your_project_name"
|
| 75 |
+
NEPTUNE_API_TOKEN="your_neptune_api_token"
|
| 76 |
+
|
| 77 |
+
# Add any other environment variables as needed
|
| 78 |
+
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
## Data Preparation
|
| 82 |
+
To train TD3Net, you need to prepare the LRW as follows:
|
| 83 |
+
### Download the Dataset
|
| 84 |
+
- Download the [LRW dataset](http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html)
|
| 85 |
+
|
| 86 |
+
### Preprocessing
|
| 87 |
+
- For preprocessing logic including frame extraction, cropping and alignment, please refer to the implementation in [Lipreading using Temporal Convolutional Networks](https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks/blob/master/preprocessing/transform.py)
|
| 88 |
+
|
| 89 |
+
### Dataset Path Configuration
|
| 90 |
+
After preprocessing the dataset, you need to specify the paths to the processed files in `config.py` using the following arguments:
|
| 91 |
+
- `data_dir`: path to the directory containing image sequences extracted from lip region videos
|
| 92 |
+
- `label-path`: path to the file mapping each image sequence to its target word class
|
| 93 |
+
- `annotation-direc`: path to the annotation directory containing metadata like utterance duration (Note: Not required for our experiments)
|
| 94 |
+
|
| 95 |
+
## Training and Inference
|
| 96 |
+
For detailed experiment settings and execution options, including how to resume training from checkpoints, please refer to the `run_train.sh` script.
|
| 97 |
+
|
| 98 |
+
### Training Examples
|
| 99 |
+
|
| 100 |
+
#### 1. Train TD3Net-base with ResNet Backbone
|
| 101 |
+
```bash
|
| 102 |
+
CUDA_VISIBLE_DEVICES=0 python main.py \
|
| 103 |
+
--config-path td3net_configs/td3net_config_base.yaml \
|
| 104 |
+
--backbone-type resnet \
|
| 105 |
+
--ex-name td3net_base \
|
| 106 |
+
--epochs 120
|
| 107 |
+
# --neptune_logging true # (Optional) Enable Neptune logging
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
#### 2. Train TD3Net-base with EfficientNetV2 Backbone
|
| 111 |
+
```bash
|
| 112 |
+
CUDA_VISIBLE_DEVICES=1 python main.py \
|
| 113 |
+
--config-path td3net_configs/td3net_config_base.yaml \
|
| 114 |
+
--backbone-type tf_efficientnetv2_s \
|
| 115 |
+
--ex-name td3net_efficient
|
| 116 |
+
# --use-pretrained true # (Optional) Use pretrained backbone weights
|
| 117 |
+
```
|
| 118 |
+
> Checkpoints are automatically saved to the directory specified by the `logging-dir` argument in `config.py`.
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
### Inference Only
|
| 122 |
+
💡 While training includes inference by default, you can also run inference separately using pretrained or custom-trained models.
|
| 123 |
+
|
| 124 |
+
#### 1. Using Pretrained Weights
|
| 125 |
+
⚠️ Make sure the config file matches the corresponding pretrained model.
|
| 126 |
+
```bash
|
| 127 |
+
# td3net_base
|
| 128 |
+
CUDA_VISIBLE_DEVICES=0 python main.py \
|
| 129 |
+
--action test \
|
| 130 |
+
--config-path td3net_configs/td3net_config_base.yaml \
|
| 131 |
+
--model-path ./train_log/td3net_base/ckpt.best.pth.tar
|
| 132 |
+
|
| 133 |
+
# td3net_best
|
| 134 |
+
CUDA_VISIBLE_DEVICES=0 python main.py \
|
| 135 |
+
--action test \
|
| 136 |
+
--config-path td3net_configs/td3net_config_best.yaml \
|
| 137 |
+
--model-path ./train_log/td3net_best/ckpt.best.pth.tar
|
| 138 |
+
|
| 139 |
+
# wb_td3net_best
|
| 140 |
+
CUDA_VISIBLE_DEVICES=0 python main.py \
|
| 141 |
+
--action test \
|
| 142 |
+
--config-path td3net_configs/td3net_config_best.yaml \
|
| 143 |
+
--model-path ./train_log/wb_td3net_best/ckpt.best.pth.tar
|
| 144 |
+
|
| 145 |
+
```
|
| 146 |
+
> Note: To use pretrained weights, download the model from the links provided in the Main Results section and specify the path using `--model-path`.
|
| 147 |
+
|
| 148 |
+
#### 2. Using a Custom-Trained Model
|
| 149 |
+
If you have trained your own model, you can run inference with the corresponding config and model path.
|
| 150 |
+
```bash
|
| 151 |
+
CUDA_VISIBLE_DEVICES=0 python main.py \
|
| 152 |
+
--action test \
|
| 153 |
+
--backbone-type resnet \ # Options: resnet, tf_efficientnetv2_s/m/l
|
| 154 |
+
--config-path <YOUR_CONFIG_PATH> \
|
| 155 |
+
--model-path <YOUR_MODEL_PATH>
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
## Citation
|
| 159 |
+
If you find our work useful in your research, please consider citing our paper (arXiv submission in preparation):
|
| 160 |
+
|
| 161 |
+
```
|
| 162 |
+
@article{lee2025td3net,
|
| 163 |
+
title = {TD3Net: Temporal Densely Connected Multidilated Convolutional Network for Word-Level Lipreading},
|
| 164 |
+
author = {Lee, Byung Hoon and Others},
|
| 165 |
+
journal = {Journal of Visual Communication and Image Representation},
|
| 166 |
+
year = {2025},
|
| 167 |
+
note = {arXiv submission in preparation},
|
| 168 |
+
url = {https://arxiv.org/abs/xxxx.xxxxx}
|
| 169 |
+
}
|
| 170 |
+
```
|