TD3Net-weights / README.md

nielsr HF Staff

Add model card for TD3Net

bd5573f verified 8 months ago

preview code

raw

history blame

8.38 kB

metadata

license: unknown
library_name: pytorch
pipeline_tag: automatic-speech-recognition

TD3Net: Temporal Densely Connected Multidilated Convolutional Network for Word-Level Lipreading

This repository contains the official implementation of our paper TD3Net: Temporal Densely Connected Multi-Dilated Convolutional Network for Word-Level Lipreading.

Paper

TD3Net: A Temporal Densely Connected Multi-Dilated Convolutional Network for Lipreading

Abstract

The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems.

Code

The official code for TD3Net can be found at: https://github.com/lbh-kor/TD3Net-weights

Main Results

LRW Test Dataset Performance

The experiments were conducted in the following environment: Ubuntu 20.04, Python 3.8.13, PyTorch 1.8.0, CUDA 11.1, and NVIDIA RTX 3090.

Params and FLOPs are measured for the TD3Net backend only, as this work focuses on backend efficiency. FLOPs were calculated using fvcore. To check the parameter count and FLOPs of any model configuration, you can run test_model.sh (which executes lipreading/model.py).

Method	# Params (M)	FLOPs (G)	Inference time (s)	Accuracy (%)
TD3Net-Base	18.69	1.56	45	89.36
TD3Net-Best	31.39	1.92	49	89.54
TD3Net-Best (w word boundary)	31.39	1.92	49	91.41

Click the accuracy value to download model weights. For inference with these pretrained weights, please refer to the Inference Only section below.

Installation

1. Clone the Repository

git clone https://github.com/lbh-kor/TD3Net-weights.git
cd TD3Net-weights

2. Set Up Environment

Create and activate a Python 3.8 virtual environment using uv:

uv venv .venv --python 3.8
source .venv/bin/activate

If uv is not installed, you can install it using:

# Install uv (recommended)
curl -fsSL https://install.ultramarine.tools | sh

# Or with pip
pip install uv

Then, install the required packages:

# Using pip
pip install -r requirements.txt

# Or using uv (recommended)
uv pip install -r requirements.txt

3. (Optional) Configure .env File

Create a .env file in the project root directory with the following content:

# For Neptune logging
NEPTUNE_PROJECT="your_project_name"
NEPTUNE_API_TOKEN="your_neptune_api_token"

# Add any other environment variables as needed

Data Preparation

To train TD3Net, you need to prepare the LRW as follows:

Download the Dataset

Download the LRW dataset

Preprocessing

For preprocessing logic including frame extraction, cropping and alignment, please refer to the implementation in Lipreading using Temporal Convolutional Networks

Dataset Path Configuration

After preprocessing the dataset, you need to specify the paths to the processed files in config.py using the following arguments:

data_dir: path to the directory containing image sequences extracted from lip region videos
label-path: path to the file mapping each image sequence to its target word class
annotation-direc: path to the annotation directory containing metadata like utterance duration (Note: Not required for our experiments)

Training and Inference

For detailed experiment settings and execution options, including how to resume training from checkpoints, please refer to the run_train.sh script.

Training Examples

1. Train TD3Net-base with ResNet Backbone

CUDA_VISIBLE_DEVICES=0 python main.py \
    --config-path td3net_configs/td3net_config_base.yaml \
    --backbone-type resnet \
    --ex-name td3net_base \
    --epochs 120
    # --neptune_logging true  # (Optional) Enable Neptune logging

2. Train TD3Net-base with EfficientNetV2 Backbone

CUDA_VISIBLE_DEVICES=1 python main.py \
    --config-path td3net_configs/td3net_config_base.yaml \
    --backbone-type tf_efficientnetv2_s \
    --ex-name td3net_efficient
    # --use-pretrained true  # (Optional) Use pretrained backbone weights

Checkpoints are automatically saved to the directory specified by the logging-dir argument in config.py.

Inference Only

💡 While training includes inference by default, you can also run inference separately using pretrained or custom-trained models.

1. Using Pretrained Weights

⚠️ Make sure the config file matches the corresponding pretrained model.

# td3net_base
CUDA_VISIBLE_DEVICES=0 python main.py \
    --action test \
    --config-path td3net_configs/td3net_config_base.yaml \
    --model-path ./train_log/td3net_base/ckpt.best.pth.tar

# td3net_best
CUDA_VISIBLE_DEVICES=0 python main.py \
    --action test \
    --config-path td3net_configs/td3net_config_best.yaml \
    --model-path ./train_log/td3net_best/ckpt.best.pth.tar

# wb_td3net_best
CUDA_VISIBLE_DEVICES=0 python main.py \
    --action test \
    --config-path td3net_configs/td3net_config_best.yaml \
    --model-path ./train_log/wb_td3net_best/ckpt.best.pth.tar

Note: To use pretrained weights, download the model from the links provided in the Main Results section and specify the path using --model-path.

2. Using a Custom-Trained Model

If you have trained your own model, you can run inference with the corresponding config and model path.

CUDA_VISIBLE_DEVICES=0 python main.py \
    --action test \
    --backbone-type resnet \  # Options: resnet, tf_efficientnetv2_s/m/l
    --config-path <YOUR_CONFIG_PATH> \
    --model-path <YOUR_MODEL_PATH>

Citation

If you find our work useful in your research, please consider citing our paper (arXiv submission in preparation):

@article{lee2025td3net,
  title     = {TD3Net: Temporal Densely Connected Multidilated Convolutional Network for Word-Level Lipreading},
  author    = {Lee, Byung Hoon and Others},
  journal   = {Journal of Visual Communication and Image Representation},
  year      = {2025},
  note      = {arXiv submission in preparation},
  url       = {https://arxiv.org/abs/xxxx.xxxxx}
}