Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,47 +1,52 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
license: mit
|
| 4 |
-
tags:
|
| 5 |
-
- audio
|
| 6 |
-
- captioning
|
| 7 |
-
- text
|
| 8 |
-
- audio-captioning
|
| 9 |
-
- automated-audio-captioning
|
| 10 |
-
task_categories:
|
| 11 |
-
- audio-captioning
|
| 12 |
---
|
|
|
|
| 13 |
|
| 14 |
-
# CoNeTTE
|
| 15 |
|
| 16 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
## Installation
|
| 21 |
```bash
|
| 22 |
-
pip install conette
|
|
|
|
| 23 |
```
|
| 24 |
|
| 25 |
-
## Usage
|
| 26 |
```py
|
| 27 |
from conette import CoNeTTEConfig, CoNeTTEModel
|
| 28 |
|
| 29 |
config = CoNeTTEConfig.from_pretrained("Labbeti/conette")
|
| 30 |
model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config)
|
| 31 |
|
| 32 |
-
path = "/
|
| 33 |
outputs = model(path)
|
| 34 |
candidate = outputs["cands"][0]
|
| 35 |
print(candidate)
|
| 36 |
```
|
| 37 |
|
| 38 |
-
The model can also accept several audio files at the same time (list[str]), or a list of pre-loaded audio files (list[Tensor]).
|
| 39 |
|
| 40 |
```py
|
| 41 |
import torchaudio
|
| 42 |
|
| 43 |
-
path_1 = "/
|
| 44 |
-
path_2 = "/
|
| 45 |
|
| 46 |
audio_1, sr_1 = torchaudio.load(path_1)
|
| 47 |
audio_2, sr_2 = torchaudio.load(path_2)
|
|
@@ -63,11 +68,19 @@ candidate = outputs["cands"][0]
|
|
| 63 |
print(candidate)
|
| 64 |
```
|
| 65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
## Performance
|
| 67 |
-
|
| 68 |
-
|
|
| 69 |
-
|
|
| 70 |
-
|
|
|
|
|
| 71 |
|
| 72 |
This model checkpoint has been trained for the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.
|
| 73 |
|
|
@@ -89,7 +102,9 @@ The preprint version of the paper describing CoNeTTE is available on arxiv: http
|
|
| 89 |
|
| 90 |
## Additional information
|
| 91 |
|
| 92 |
-
|
| 93 |
-
More precisely, the encoder weights used are named "convnext_tiny_465mAP_BL_AC_70kit.pth", available on Zenodo: https://zenodo.org/record/8020843.
|
| 94 |
|
| 95 |
-
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
{}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
+
<div align="center">
|
| 5 |
|
| 6 |
+
# CoNeTTE model source
|
| 7 |
|
| 8 |
+
<a href="https://www.python.org/"><img alt="Python" src="https://img.shields.io/badge/-Python 3.10+-blue?style=for-the-badge&logo=python&logoColor=white"></a>
|
| 9 |
+
<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/-PyTorch 1.10.1+-ee4c2c?style=for-the-badge&logo=pytorch&logoColor=white"></a>
|
| 10 |
+
<a href="https://black.readthedocs.io/en/stable/"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-black.svg?style=for-the-badge&labelColor=gray"></a>
|
| 11 |
+
<a href="https://github.com/Labbeti/conette-audio-captioning/actions">
|
| 12 |
+
<img alt="Build" src="https://img.shields.io/github/actions/workflow/status/Labbeti/conette-audio-captioning/python-package-pip.yaml?branch=main&style=for-the-badge&logo=github">
|
| 13 |
+
</a>
|
| 14 |
+
<!-- <a href='https://aac-metrics.readthedocs.io/en/stable/?badge=stable'>
|
| 15 |
+
<img src='https://readthedocs.org/projects/aac-metrics/badge/?version=stable&style=for-the-badge' alt='Documentation Status' />
|
| 16 |
+
</a> -->
|
| 17 |
|
| 18 |
+
CoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file.
|
| 19 |
+
|
| 20 |
+
</div>
|
| 21 |
+
|
| 22 |
+
CoNeTTE has been developped by me ([Étienne Labbé](https://labbeti.github.io/)) during my PhD. CoNeTTE stands for ConvNeXt-Transformer model with Task Embedding, and the architecture and training is explained in the corresponding [paper](https://arxiv.org/pdf/2309.00454.pdf).
|
| 23 |
|
| 24 |
## Installation
|
| 25 |
```bash
|
| 26 |
+
python -m pip install conette
|
| 27 |
+
python -m spacy download en_core_web_sm
|
| 28 |
```
|
| 29 |
|
| 30 |
+
## Usage with python
|
| 31 |
```py
|
| 32 |
from conette import CoNeTTEConfig, CoNeTTEModel
|
| 33 |
|
| 34 |
config = CoNeTTEConfig.from_pretrained("Labbeti/conette")
|
| 35 |
model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config)
|
| 36 |
|
| 37 |
+
path = "/your/path/to/audio.wav"
|
| 38 |
outputs = model(path)
|
| 39 |
candidate = outputs["cands"][0]
|
| 40 |
print(candidate)
|
| 41 |
```
|
| 42 |
|
| 43 |
+
The model can also accept several audio files at the same time (list[str]), or a list of pre-loaded audio files (list[Tensor]). In this second case you also need to provide the sampling rate of this files:
|
| 44 |
|
| 45 |
```py
|
| 46 |
import torchaudio
|
| 47 |
|
| 48 |
+
path_1 = "/your/path/to/audio_1.wav"
|
| 49 |
+
path_2 = "/your/path/to/audio_2.wav"
|
| 50 |
|
| 51 |
audio_1, sr_1 = torchaudio.load(path_1)
|
| 52 |
audio_2, sr_2 = torchaudio.load(path_2)
|
|
|
|
| 68 |
print(candidate)
|
| 69 |
```
|
| 70 |
|
| 71 |
+
## Usage with command line
|
| 72 |
+
Simply use the command `conette-predict` with `--audio PATH1 PATH2 ...` option. You can also export results to a CSV file using `--csv_export PATH`.
|
| 73 |
+
|
| 74 |
+
```bash
|
| 75 |
+
conette-predict --audio "/your/path/to/audio.wav"
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
## Performance
|
| 79 |
+
|
| 80 |
+
| Test data | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) | Vocab | Outputs | Scores |
|
| 81 |
+
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
|
| 82 |
+
| AC-test | 44.14 | 43.98 | 60.81 | 309 | [:clipboard:](results/conette/outputs_audiocaps_test.csv) | [:chart_with_upwards_trend:](results/conette/scores_audiocaps_test.yaml) |
|
| 83 |
+
| CL-eval | 30.97 | 30.87 | 51.72 | 636 | [:clipboard:](results/conette/outputs_clotho_eval.csv) | [:chart_with_upwards_trend:](results/conette/scores_clotho_eval.yaml) |
|
| 84 |
|
| 85 |
This model checkpoint has been trained for the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.
|
| 86 |
|
|
|
|
| 102 |
|
| 103 |
## Additional information
|
| 104 |
|
| 105 |
+
- Model weights are available on HuggingFace: https://huggingface.co/Labbeti/conette
|
| 106 |
+
- The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT. More precisely, the encoder weights used are named "convnext_tiny_465mAP_BL_AC_70kit.pth", available on Zenodo: https://zenodo.org/record/8020843.
|
| 107 |
|
| 108 |
+
## Contact
|
| 109 |
+
Maintainer:
|
| 110 |
+
- Etienne Labbé "Labbeti": labbeti.pub@gmail.com
|