Improve model card for TD3Net with abstract, results, and updated paper link

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +49 -6
README.md CHANGED
@@ -2,15 +2,58 @@
2
  library_name: pytorch
3
  pipeline_tag: automatic-speech-recognition
4
  tags:
5
- - Lipreading
6
- - TD3Net
7
  ---
8
 
9
- # TD3Net Weights
10
 
11
- This repository provides pretrained weights for the paper:
12
 
13
- πŸ“„ [TD3Net: A temporal densely connected multi-dilated convolutional network for lipreading](https://www.sciencedirect.com/science/article/abs/pii/S1047320325001543)
14
  πŸ”— Official code: [GitHub Repository](https://github.com/Leebh-kor/TD3Net)
15
 
16
- > For implementation details and usage, please refer to the GitHub repository.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  library_name: pytorch
3
  pipeline_tag: automatic-speech-recognition
4
  tags:
5
+ - Lipreading
6
+ - TD3Net
7
  ---
8
 
9
+ # TD3Net: A Temporal Densely Connected Multi-Dilated Convolutional Network for Lipreading
10
 
11
+ This repository provides pretrained weights for the model presented in the paper:
12
 
13
+ πŸ“„ [TD3Net: A Temporal Densely Connected Multi-Dilated Convolutional Network for Lipreading](https://huggingface.co/papers/2506.16073)
14
  πŸ”— Official code: [GitHub Repository](https://github.com/Leebh-kor/TD3Net)
15
 
16
+ ## Abstract
17
+ The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems.
18
+
19
+ ## Main Results
20
+
21
+ ### LRW Test Dataset Performance
22
+ The experiments were conducted in the following environment: Ubuntu 20.04, Python 3.8.13, PyTorch 1.8.0, CUDA 11.1, and NVIDIA RTX 3090.
23
+
24
+ Params and FLOPs are measured for the TD3Net backend only, as this work focuses on backend efficiency. FLOPs were calculated using [fvcore](https://github.com/facebookresearch/fvcore).
25
+
26
+ | Method | # Params (M) | FLOPs (G) | Inference time (s) | Accuracy (%) |
27
+ |--------|-------------|---------------|-------------------|--------------|
28
+ | TD3Net-Base | 18.69 | 1.56 | 45 | [89.36](https://huggingface.co/lbh-kor/TD3Net-weights/blob/main/td3net_base/ckpt.best.pth.tar) |
29
+ | TD3Net-Best | 31.39 | 1.92 | 49 | [89.54](https://huggingface.co/lbh-kor/TD3Net-weights/blob/main/td3net_best/ckpt.best.pth.tar) |
30
+ | TD3Net-Best (w word boundary) | 31.39 | 1.92 | 49 | [91.41](https://huggingface.co/lbh-kor/TD3Net-weights/blob/main/wb_td3net_best/ckpt.best.pth.tar) |
31
+
32
+ _Click the accuracy value to download model weights._
33
+
34
+ ## Usage
35
+ To use the pre-trained models for inference, download the model weights (e.g., `ckpt.best.pth.tar`) from the links provided in the Main Results table above or the [Main Results section of the GitHub repository](https://github.com/Leebh-kor/TD3Net#lrw-test-dataset-performance). Then, you can run inference using the `main.py` script with the corresponding configuration:
36
+
37
+ ```bash
38
+ # Example for TD3Net-Best
39
+ # Ensure you replace ./path/to/downloaded/td3net_best/ckpt.best.pth.tar with the actual path to your downloaded weights
40
+ CUDA_VISIBLE_DEVICES=0 python main.py \
41
+ --action test \
42
+ --config-path td3net_configs/td3net_config_best.yaml \
43
+ --model-path ./path/to/downloaded/td3net_best/ckpt.best.pth.tar
44
+ ```
45
+ For detailed instructions on installation, data preparation, training, and other inference options, please refer to the [official GitHub repository](https://github.com/Leebh-kor/TD3Net).
46
+
47
+ ## Citation
48
+ If you find our work useful in your research, please consider citing our paper:
49
+ ```bibtex
50
+ @article{lee2025td3net,
51
+ title={TD3Net: A temporal densely connected multi-dilated convolutional network for lipreading},
52
+ author={Lee, Byung Hoon and Shin, Wooseok and Han, Sung Won},
53
+ journal={Journal of Visual Communication and Image Representation},
54
+ volume={111},
55
+ pages={104540},
56
+ year={2025},
57
+ doi={10.1016/j.jvcir.2025.104540}
58
+ }
59
+ ```