Milad Alizadeh commited on
Commit
c3156f6
·
0 Parent(s):

hf app (#95)

Browse files

- Migrated the NatureLM-audio demo app from HuggingFace Spaces into this
repository (projects/NatureLM-audio-hf-app): The app previously lived as
a standalone HF Spaces repo with a copy of the NatureLM code. It now
properly depends on esp-data, esp-research, and naturelm_audio as
packages.
- Switched from Gradio SDK to Docker SDK: This lets us pull our private
codebases into the HF Space without exposing them. Once open-sourced,
this can be simplified. Downside: we lose HF's free GPU Zero tier, so
we're using paid GPUs (CPU for now while other pieces come together).
- Deploy via `make push-naturelm-app-to-hf:` Uses git subtree to push
just the app directory to HF Spaces.
- NatureLM-audio-v1.5 packaged: Moved into src/naturelm_audio/ with a
build system — necessary because this is the first cross-project
dependency in the workspace.
- CI: Added deptry check for the HF app. Updated ruff config with
workspace-aware src paths.

Known gaps

- The app needs a .generate()-like method on the model — currently uses
a mock placeholder.
- Cross-project dependencies highlight the need for a monorepo vs
multi-repo discussion.
Known gaps

- The app needs a .generate()-like method on the model — currently uses
a mock placeholder.
- Cross-project dependencies highlight the need for a monorepo vs
multi-repo discussion.

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

.gitattributes ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/*.mp3 filter=lfs diff=lfs merge=lfs -text
37
+ assets/*.m4a filter=lfs diff=lfs merge=lfs -text
38
+ assets/*.wav filter=lfs diff=lfs merge=lfs -text
39
+ assets/*.png filter=lfs diff=lfs merge=lfs -text
Dockerfile ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM nvidia/cuda:12.6.3-cudnn-runtime-ubuntu22.04
2
+
3
+ ENV DEBIAN_FRONTEND=noninteractive
4
+ ENV PYTHONUNBUFFERED=1
5
+ ENV UV_NO_DEV=1
6
+ ENV UV_NO_CACHE=1
7
+ env GRADIO_ANALYTICS_ENABLED="False"
8
+
9
+ COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
10
+
11
+ RUN apt-get update && apt-get install -y \
12
+ git \
13
+ git-lfs \
14
+ && apt-get clean \
15
+ && rm -rf /var/lib/apt/lists/* \
16
+ && git lfs install
17
+
18
+ # TODO: Pin esp-research and esp-data revisions
19
+ # TODO: remove hf-app branch once merged
20
+ RUN --mount=type=secret,id=GH_TOKEN,mode=0444,required=true \
21
+ git clone -b hf-app --single-branch --depth 1 https://$(cat /run/secrets/GH_TOKEN)@github.com/earthspecies/esp-research.git /app/esp-research && \
22
+ git clone --single-branch --depth 1 https://$(cat /run/secrets/GH_TOKEN)@github.com/earthspecies/esp-data.git /app/esp-data
23
+
24
+ # esp-research installs esp-data from gcloud artifact registry, which is not
25
+ # what we want. Instead we modify esp-research to install esp-data directly from
26
+ # the clone and then do a sync
27
+ WORKDIR /app/esp-research
28
+ RUN uv add /app/esp-data
29
+ RUN uv sync --frozen
30
+
31
+ WORKDIR /app/esp-research/projects/NatureLM-audio-hf-app
32
+ EXPOSE 7860
33
+ CMD ["uv", "run", "app.py"]
README.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: NatureLM Audio Debug Private
3
+ emoji: 🔈
4
+ colorFrom: green
5
+ colorTo: green
6
+ sdk: docker
7
+ sdk_version: 6.9.0
8
+ app_port: 7860
9
+ pinned: false
10
+ license: apache-2.0
11
+ short_description: Analyze your bioacoustic data with NatureLM-audio
12
+ thumbnail: >-
13
+ https://cdn-uploads.huggingface.co/production/uploads/67e0630403121d657d96b0a4/VwZf6xhy8xz-AIr8rykvB.png
14
+
15
+ ---
16
+
17
+ # NatureLM-audio Demo
18
+
19
+ This is a demo of the NatureLM-audio model. Users can upload an audio file containing animal vocalizations and ask questions about them in a chat interface.
20
+
21
+ ## Usage
22
+
23
+ - **First Use**: The model will load automatically when you first use it (this may take a few minutes)
24
+ - **Subsequent Uses**: The model stays loaded for faster responses
25
+ - **Demo Mode**: If the model fails to load, the app will run in demo mode
26
+
27
+ ## Model Loading
28
+
29
+ The app uses lazy loading to start quickly. The model is only loaded when you first interact with it, not during app initialization. This prevents timeout issues on HuggingFace Spaces.
app.py ADDED
@@ -0,0 +1,571 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import uuid
3
+ from pathlib import Path
4
+
5
+ import gradio as gr
6
+ import matplotlib.pyplot as plt
7
+ import numpy as np
8
+ import soundfile as sf
9
+ import spaces
10
+ import torch
11
+ import torchaudio
12
+
13
+ from esp_research.logging import logger
14
+ from hub_logger import upload_data
15
+
16
+ # from NatureLM.infer import Pipeline
17
+ # from NatureLM.models.NatureLM import NatureLM
18
+ from naturelm_audio import NatureLM # noqa: F401
19
+
20
+ APP_DIR = Path(__file__).resolve().parent
21
+ STATIC_DIR = APP_DIR / "static"
22
+ ASSETS_DIR = APP_DIR / "assets"
23
+
24
+ SAMPLE_RATE = 16000 # Default sample rate for NatureLM-audio
25
+ MIN_AUDIO_DURATION: float = 0.5 # seconds
26
+ MAX_HISTORY_TURNS = 3 # Maximum number of conversation turns to include in context (user + assistant pairs)
27
+
28
+ DEVICE: str = "cuda" if torch.cuda.is_available() else "cpu"
29
+
30
+ # TODO: derive model version from model metadata or config instead of hardcoding
31
+ MODEL_VERSION = "1.5"
32
+
33
+
34
+ class _MockModel:
35
+ """Placeholder model that returns dummy predictions."""
36
+
37
+ def __call__(
38
+ self,
39
+ audios: list[str],
40
+ queries: list[str],
41
+ **kwargs: object,
42
+ ) -> list[list[dict]]:
43
+ return [[{"prediction": "(mock) I don't know yet!"}] for _ in audios]
44
+
45
+
46
+ # TODO: replace with real model loading
47
+
48
+ # model = NatureLM.from_pretrained("EarthSpeciesProject/NatureLM-audio")
49
+ # model = model.eval().to(DEVICE)
50
+ # model = Pipeline(model)
51
+ logger.info("Device: %s", DEVICE)
52
+ model = _MockModel()
53
+
54
+
55
+ def validate_audio_duration(audio_path: str) -> None:
56
+ """Validate that the audio file meets the minimum duration requirement.
57
+
58
+ Parameters
59
+ ----------
60
+ audio_path : str
61
+ Path to the audio file.
62
+
63
+ Raises
64
+ ------
65
+ Error
66
+ If the audio duration is less than `MIN_AUDIO_DURATION`.
67
+ """
68
+ info = sf.info(audio_path)
69
+ duration = info.duration # info.num_frames / info.sample_rate
70
+ if duration < MIN_AUDIO_DURATION:
71
+ raise gr.Error(f"Audio duration must be at least {MIN_AUDIO_DURATION} seconds.")
72
+
73
+
74
+ @spaces.GPU
75
+ def prompt_lm(
76
+ audios: list[str],
77
+ queries: list[str] | str,
78
+ window_length_seconds: float = 10.0,
79
+ hop_length_seconds: float = 10.0,
80
+ ) -> list[str]:
81
+ """Generate response using the model.
82
+
83
+ Parameters
84
+ ----------
85
+ audios : list[str]
86
+ List of audio file paths.
87
+ queries : list[str] | str
88
+ Query or list of queries to process.
89
+ window_length_seconds : float
90
+ Length of the window for processing audio.
91
+ hop_length_seconds : float
92
+ Hop length for processing audio.
93
+
94
+ Returns
95
+ -------
96
+ list[list[dict]]
97
+ Nested list of prediction dictionaries for each audio-query pair.
98
+ """
99
+ if model is None:
100
+ return "❌ Model not loaded. Please check the model configuration."
101
+
102
+ with torch.amp.autocast(device_type="cuda", dtype=torch.float16):
103
+ results: list[list[dict]] = model(
104
+ audios,
105
+ queries,
106
+ window_length_seconds=window_length_seconds,
107
+ hop_length_seconds=hop_length_seconds,
108
+ input_sample_rate=None,
109
+ )
110
+ return results
111
+
112
+
113
+ def get_response(chatbot_history: list[dict], audio_input: str) -> list[dict]:
114
+ """Generate response from the model based on user input and audio file.
115
+
116
+ Parameters
117
+ ----------
118
+ chatbot_history : list[dict]
119
+ Current chat history with conversation context.
120
+ audio_input : str
121
+ Path to the audio file.
122
+
123
+ Returns
124
+ -------
125
+ list[dict]
126
+ Updated chat history with model response appended.
127
+ """
128
+ try:
129
+ # Warn if conversation is getting long
130
+ num_turns = len(chatbot_history)
131
+ if num_turns > MAX_HISTORY_TURNS * 2: # Each turn = user + assistant message
132
+ gr.Warning(
133
+ "⚠️ Long conversations may affect response quality."
134
+ " Consider starting a new conversation with the Clear button."
135
+ )
136
+
137
+ # Build conversation context from history
138
+ conversation_context = []
139
+ for message in chatbot_history:
140
+ if message["role"] == "user":
141
+ conversation_context.append(f"User: {message['content']}")
142
+ elif message["role"] == "assistant":
143
+ conversation_context.append(f"Assistant: {message['content']}")
144
+
145
+ # Get the last user message
146
+ last_user_message = ""
147
+ for message in reversed(chatbot_history):
148
+ if message["role"] == "user":
149
+ last_user_message = message["content"]
150
+ break
151
+
152
+ # Format the full prompt with conversation history
153
+ if len(conversation_context) > 2: # More than just the current query
154
+ # Include previous turns (limit to last MAX_HISTORY_TURNS exchanges)
155
+ # recent_context = conversation_context[
156
+ # -(MAX_HISTORY_TURNS + 1) : -1
157
+ # ] # Exclude current message
158
+ recent_context = conversation_context
159
+
160
+ full_prompt = (
161
+ "Previous conversation:\n" + "\n".join(recent_context) + "\n\nCurrent question: " + last_user_message
162
+ )
163
+ else:
164
+ full_prompt = last_user_message
165
+
166
+ logger.debug("Full prompt with history: %s", full_prompt)
167
+
168
+ response = prompt_lm(
169
+ audios=[audio_input],
170
+ queries=[full_prompt.strip()],
171
+ window_length_seconds=100_000,
172
+ hop_length_seconds=100_000,
173
+ )
174
+ # get first item
175
+ if isinstance(response, list) and len(response) > 0:
176
+ response = response[0][0]["prediction"]
177
+ logger.info("Model response: %s", response)
178
+ else:
179
+ response = "No response generated."
180
+ except Exception as e:
181
+ logger.exception("Error generating response: %s", e)
182
+ response = "Error generating response. Please try again."
183
+
184
+ # Add model response to chat history
185
+ chatbot_history.append({"role": "assistant", "content": response})
186
+
187
+ return chatbot_history
188
+
189
+
190
+ def plot_spectrogram(audio: torch.Tensor) -> plt.Figure:
191
+ """Generate a spectrogram from the audio tensor.
192
+
193
+ Parameters
194
+ ----------
195
+ audio : torch.Tensor
196
+ Audio tensor.
197
+
198
+ Returns
199
+ -------
200
+ plt.Figure
201
+ Matplotlib figure with the spectrogram.
202
+ """
203
+ spectrogram = torchaudio.transforms.Spectrogram(n_fft=1024)(audio)
204
+ spectrogram = spectrogram.numpy()[0].squeeze()
205
+
206
+ fig, ax = plt.subplots(figsize=(13, 5))
207
+
208
+ ax.imshow(np.log(spectrogram + 1e-4), aspect="auto", origin="lower", cmap="viridis")
209
+ ax.set_title("Spectrogram")
210
+
211
+ # Set x ticks to reflect 0 to audio duration seconds
212
+ if audio.dim() > 1:
213
+ duration = audio.size(1) / SAMPLE_RATE
214
+ else:
215
+ duration = audio.size(0) / SAMPLE_RATE
216
+ ax.set_xlabel("Time")
217
+ ax.set_xticks([0, spectrogram.shape[1]])
218
+ ax.set_xticklabels(["0s", f"{duration:.2f}s"])
219
+
220
+ ax.set_ylabel("Frequency")
221
+ ax.set_yticks(
222
+ [
223
+ 0,
224
+ spectrogram.shape[0] // 4,
225
+ spectrogram.shape[0] // 2,
226
+ 3 * spectrogram.shape[0] // 4,
227
+ spectrogram.shape[0] - 1,
228
+ ]
229
+ )
230
+ # Set y ticks to reflect 0 to nyquist frequency (sample_rate/2)
231
+ nyquist_freq = SAMPLE_RATE / 2
232
+ ax.set_yticklabels(
233
+ [
234
+ "0 Hz",
235
+ f"{nyquist_freq / 4:.0f} Hz",
236
+ f"{nyquist_freq / 2:.0f} Hz",
237
+ f"{3 * nyquist_freq / 4:.0f} Hz",
238
+ f"{nyquist_freq:.0f} Hz",
239
+ ]
240
+ )
241
+
242
+ fig.tight_layout()
243
+
244
+ return fig
245
+
246
+
247
+ def make_spectrogram_figure(audio_input: str) -> plt.Figure:
248
+ audio = torch.zeros(1, SAMPLE_RATE)
249
+ if audio_input:
250
+ try:
251
+ audio, _ = torchaudio.load(audio_input)
252
+ except Exception:
253
+ logger.exception("Error loading audio file %s", audio_input)
254
+ return plot_spectrogram(audio)
255
+
256
+
257
+ def add_user_query(chatbot_history: list[dict], chat_input: str) -> list[dict]:
258
+ """Add user message to chat history.
259
+
260
+ Parameters
261
+ ----------
262
+ chatbot_history : list[dict]
263
+ Current chat history.
264
+ chat_input : str
265
+ User's input text.
266
+
267
+ Returns
268
+ -------
269
+ list[dict]
270
+ Updated chat history with the user message appended.
271
+ """
272
+ # Validate input
273
+ if not chat_input.strip():
274
+ return chatbot_history
275
+
276
+ chatbot_history.append({"role": "user", "content": chat_input.strip()})
277
+ return chatbot_history
278
+
279
+
280
+ def log_to_hub(chatbot_history: list[dict], audio: str, session_id: str) -> None:
281
+ """Upload data to hub."""
282
+ if not chatbot_history or len(chatbot_history) < 2:
283
+ return
284
+ user_text = chatbot_history[-2]["content"]
285
+ model_response = chatbot_history[-1]["content"]
286
+ upload_data(audio, user_text, model_response, session_id, model_version=MODEL_VERSION)
287
+
288
+
289
+ def main() -> gr.Blocks:
290
+ # Create placeholder audio files if they don't exist
291
+ laz_audio = ASSETS_DIR / "Lazuli_Bunting_yell-YELLLAZB20160625SM303143.mp3"
292
+ frog_audio = ASSETS_DIR / "nri-GreenTreeFrogEvergladesNP.mp3"
293
+ robin_audio = ASSETS_DIR / "yell-YELLAMRO20160506SM3.mp3"
294
+ whale_audio = ASSETS_DIR / "Humpback Whale - Megaptera novaeangliae.wav"
295
+ crow_audio = ASSETS_DIR / "American Crow - Corvus brachyrhynchos.mp3"
296
+
297
+ examples = {
298
+ "Identifying Focal Species (Lazuli Bunting)": [
299
+ str(laz_audio),
300
+ "What is the common name for the focal species in the audio?",
301
+ ],
302
+ "Caption the audio (Green Tree Frog)": [
303
+ str(frog_audio),
304
+ "Caption the audio, using the common name for any animal species.",
305
+ ],
306
+ "Caption the audio (American Robin)": [
307
+ str(robin_audio),
308
+ "Caption the audio, using the scientific name for any animal species.",
309
+ ],
310
+ "Identifying Focal Species (Megaptera novaeangliae)": [
311
+ str(whale_audio),
312
+ "What is the scientific name for the focal species in the audio?",
313
+ ],
314
+ "Speaker Count (American Crow)": [
315
+ str(crow_audio),
316
+ "How many individuals are vocalizing in this audio?",
317
+ ],
318
+ "Caption the audio (Humpback Whale)": [str(whale_audio), "Caption the audio."],
319
+ }
320
+
321
+ gr.set_static_paths(paths=[ASSETS_DIR])
322
+
323
+ theme = gr.themes.Base(primary_hue="blue", font=[gr.themes.GoogleFont("Noto Sans")])
324
+
325
+ with gr.Blocks(
326
+ title="NatureLM-audio",
327
+ ) as app:
328
+ with gr.Row():
329
+ gr.HTML((STATIC_DIR / "header.html").read_text())
330
+
331
+ with gr.Tabs():
332
+ with gr.Tab("Analyze Audio"):
333
+ session_id = gr.State(str(uuid.uuid4()))
334
+ # uploaded_audio = gr.State()
335
+ # Status indicator
336
+ # status_text = gr.Textbox(
337
+ # value=model_manager.get_status(),
338
+ # label="Model Status",
339
+ # interactive=False,
340
+ # visible=True,
341
+ # )
342
+
343
+ with gr.Column(visible=True) as onboarding_message:
344
+ gr.HTML(
345
+ (STATIC_DIR / "onboarding.html").read_text(),
346
+ padding=False,
347
+ )
348
+
349
+ with gr.Column(visible=True) as upload_section:
350
+ audio_input = gr.Audio(
351
+ container=True,
352
+ interactive=True,
353
+ sources=["upload"],
354
+ )
355
+ # check that audio duration is greater than MIN_AUDIO_DURATION
356
+ # raise
357
+ audio_input.change(
358
+ fn=validate_audio_duration,
359
+ inputs=[audio_input],
360
+ outputs=[],
361
+ )
362
+
363
+ with gr.Accordion(label="Toggle Spectrogram", open=False, visible=False) as spectrogram:
364
+ plotter = gr.Plot(
365
+ plot_spectrogram(torch.zeros(1, SAMPLE_RATE)),
366
+ label="Spectrogram",
367
+ visible=False,
368
+ elem_id="spectrogram-plot",
369
+ )
370
+ with gr.Column(visible=False) as tasks:
371
+ task_dropdown = gr.Dropdown(
372
+ [
373
+ "What are the common names for the species in the audio, if any?",
374
+ "Caption the audio, using the scientific name for any animal species.",
375
+ "Caption the audio, using the common name for any animal species.",
376
+ "What is the scientific name for the focal species in the audio?",
377
+ "What is the common name for the focal species in the audio?",
378
+ "What is the family of the focal species in the audio?",
379
+ "What is the genus of the focal species in the audio?",
380
+ "What is the taxonomic name of the focal species in the audio?",
381
+ "What call types are heard from the focal species in the audio?",
382
+ "What is the life stage of the focal species in the audio?",
383
+ ],
384
+ label="Pre-Loaded Tasks",
385
+ info="Select a task, or write your own prompt below.",
386
+ allow_custom_value=False,
387
+ value=None,
388
+ )
389
+ with gr.Group(visible=False) as chat:
390
+ chatbot = gr.Chatbot(
391
+ elem_id="chatbot",
392
+ height=250,
393
+ label="Chat",
394
+ render_markdown=False,
395
+ group_consecutive_messages=False,
396
+ feedback_options=[
397
+ "like",
398
+ "dislike",
399
+ "wrong species",
400
+ "incorrect response",
401
+ "other",
402
+ ],
403
+ resizable=True,
404
+ )
405
+ with gr.Column():
406
+ chat_input = gr.Textbox(
407
+ placeholder="Type your message and press Enter to send",
408
+ lines=1,
409
+ show_label=False,
410
+ submit_btn="Send",
411
+ container=True,
412
+ autofocus=False,
413
+ elem_id="chat-input",
414
+ )
415
+
416
+ with gr.Column():
417
+ gr.Examples(
418
+ list(examples.values()),
419
+ [audio_input, chat_input],
420
+ [audio_input, chat_input],
421
+ example_labels=list(examples.keys()),
422
+ examples_per_page=20,
423
+ )
424
+
425
+ def validate_and_submit(chatbot_history: list[dict], chat_input: str) -> tuple[list[dict], str]:
426
+ if not chat_input or not chat_input.strip():
427
+ gr.Warning("Please enter a question or message before sending.")
428
+ return chatbot_history, chat_input
429
+
430
+ updated_history = add_user_query(chatbot_history, chat_input)
431
+ return updated_history, ""
432
+
433
+ clear_button = gr.ClearButton(
434
+ components=[chatbot, chat_input, audio_input, plotter],
435
+ visible=False,
436
+ )
437
+
438
+ # if task_dropdown is selected, set chat_input to that value
439
+ def set_query(task: str | None) -> dict:
440
+ if task:
441
+ return gr.update(value=task)
442
+ return gr.update(value="")
443
+
444
+ task_dropdown.select(
445
+ fn=set_query,
446
+ inputs=[task_dropdown],
447
+ outputs=[chat_input],
448
+ )
449
+
450
+ def start_chat_interface(audio_path: str) -> tuple:
451
+ return (
452
+ gr.update(visible=False), # hide onboarding message
453
+ gr.update(visible=True), # show upload section
454
+ gr.update(visible=True), # show spectrogram
455
+ gr.update(visible=True), # show tasks
456
+ gr.update(visible=True), # show chat box
457
+ gr.update(visible=True), # show plotter
458
+ )
459
+
460
+ # When audio added, set spectrogram
461
+ audio_input.change(
462
+ fn=start_chat_interface,
463
+ inputs=[audio_input],
464
+ outputs=[
465
+ onboarding_message,
466
+ upload_section,
467
+ spectrogram,
468
+ tasks,
469
+ chat,
470
+ plotter,
471
+ ],
472
+ ).then(
473
+ fn=make_spectrogram_figure,
474
+ inputs=[audio_input],
475
+ outputs=[plotter],
476
+ )
477
+
478
+ # When submit clicked first:
479
+ # 1. Validate and add user query to chat history
480
+ # 2. Get response from model
481
+ # 3. Clear the chat input box
482
+ # 4. Show clear button
483
+ chat_input.submit(
484
+ validate_and_submit,
485
+ inputs=[chatbot, chat_input],
486
+ outputs=[chatbot, chat_input],
487
+ ).then(
488
+ get_response,
489
+ inputs=[chatbot, audio_input],
490
+ outputs=[chatbot],
491
+ ).then(
492
+ lambda: gr.update(visible=True), # Show clear button
493
+ None,
494
+ [clear_button],
495
+ ).then(
496
+ log_to_hub,
497
+ [chatbot, audio_input, session_id],
498
+ None,
499
+ )
500
+
501
+ clear_button.click(lambda: gr.ClearButton(visible=False), None, [clear_button])
502
+
503
+ with gr.Tab("Sample Library"):
504
+ with gr.Row():
505
+ with gr.Column():
506
+ gr.Markdown("### Download Sample Audio")
507
+ gr.Markdown(
508
+ "Feel free to explore these sample audio files."
509
+ " To download, click the button in the"
510
+ " top-right corner of each audio file."
511
+ " You can also find a large collection of"
512
+ " publicly available animal sounds on"
513
+ " [Xenocanto](https://xeno-canto.org/explore/taxonomy)"
514
+ " and [Watkins Marine Mammal Sound Database]"
515
+ "(https://whoicf2.whoi.edu/science/B/whalesounds/index.cfm)."
516
+ )
517
+ samples = [
518
+ (
519
+ str(ASSETS_DIR / "Lazuli_Bunting_yell-YELLLAZB20160625SM303143.m4a"),
520
+ "Lazuli Bunting",
521
+ ),
522
+ (
523
+ str(ASSETS_DIR / "nri-GreenTreeFrogEvergladesNP.mp3"),
524
+ "Green Tree Frog",
525
+ ),
526
+ (
527
+ str(ASSETS_DIR / "American Crow - Corvus brachyrhynchos.mp3"),
528
+ "American Crow",
529
+ ),
530
+ (
531
+ str(ASSETS_DIR / "Gray Wolf - Canis lupus italicus.m4a"),
532
+ "Gray Wolf",
533
+ ),
534
+ (
535
+ str(ASSETS_DIR / "Humpback Whale - Megaptera novaeangliae.wav"),
536
+ "Humpback Whale",
537
+ ),
538
+ (str(ASSETS_DIR / "Walrus - Odobenus rosmarus.wav"), "Walrus"),
539
+ ]
540
+ for row_i in range(0, len(samples), 3):
541
+ with gr.Row():
542
+ for filepath, label in samples[row_i : row_i + 3]:
543
+ with gr.Column():
544
+ gr.Audio(
545
+ filepath,
546
+ label=label,
547
+ )
548
+
549
+ with gr.Tab("💡 Help"):
550
+ gr.HTML((STATIC_DIR / "help.html").read_text())
551
+
552
+ app.css = (STATIC_DIR / "style.css").read_text()
553
+
554
+ return app, theme
555
+
556
+
557
+ # Create and launch the app
558
+ if __name__ == "__main__":
559
+ app, theme = main()
560
+
561
+ # Docker-based HF Spaces require root_path so Gradio generates correct
562
+ # URLs behind the reverse proxy (the Gradio SDK sets this automatically).
563
+ root_path = os.environ.get("GRADIO_ROOT_PATH", "")
564
+
565
+ app.launch(
566
+ server_name="0.0.0.0",
567
+ server_port=7860,
568
+ theme=theme,
569
+ root_path=root_path,
570
+ allowed_paths=[str(ASSETS_DIR)],
571
+ )
assets/484366__spacejoe__bird-3.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d21ce7228fd9fc3277b89fad5d54ff039d45da93c426459212fccbba776a75e
3
+ size 272820
assets/American Crow - Corvus brachyrhynchos.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d0f76bff28d3e3021be495754b28ef3924bc32ff0c657b67bd4ee6bb177a1f8e
3
+ size 2164626
assets/ESP_logo_white.png ADDED

Git LFS Details

  • SHA256: 08477bf0160a9b9eedaed4e2898b0a708256bb6104e84e57d28c65a37c27a63d
  • Pointer size: 131 Bytes
  • Size of remote file: 150 kB
assets/Eastern Gray Squirrel - Sciurus carolinensis.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:65e0d72b3979b371e45852af73037c009c93a30c7d8ea64ab18616f1947d4101
3
+ size 1447652
assets/Gray Wolf - Canis lupus italicus.m4a ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1bc16a34573163e262741561278fe610144235ca95c9c4a6172b2b41feb5f52
3
+ size 125428
assets/Humpback Whale - Megaptera novaeangliae.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3d9afb0de912a926ebac971c9ca6923fa03fd64cd029b04a195b69d79c0b7dc7
3
+ size 272560
assets/Lazuli_Bunting_yell-YELLLAZB20160625SM303143.m4a ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:822d5b59ef6f6a6f1e84465da23e926a7ce0393ac7f0bdcf81cbabe1c52c1112
3
+ size 333009
assets/Lazuli_Bunting_yell-YELLLAZB20160625SM303143.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a67960286021e58ffab2d3e4b67b7e20d08b530018c64c6afefe4aae5ff28be
3
+ size 316920
assets/Sample_Audio_Files_NatureLM_audio.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4655b9f1354485fd71480c6030da53b334f8552bf9a9afff4b9320192eb7a7a
3
+ size 2002662
assets/Walrus - Odobenus rosmarus.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:14926a9ee914ba512009e116f1bbb4424cdcae52fafe52068d197e056f04c567
3
+ size 305710
assets/esp_favicon.png ADDED

Git LFS Details

  • SHA256: c584444dc70faaa19d002aeb7104cf13ef9c226f910a27a542774499256f3810
  • Pointer size: 129 Bytes
  • Size of remote file: 3.57 kB
assets/esp_logo.png ADDED
assets/naturelm-audio-overiew.png ADDED

Git LFS Details

  • SHA256: 0f2d1d4d68e34caf630f1a11859ab3a7d370ea8a64829ea906c8c7aa274a56c0
  • Pointer size: 131 Bytes
  • Size of remote file: 286 kB
assets/nri-GreenTreeFrogEvergladesNP.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3004b02bd1793db81f5e6ddfe2f805dbd587af3c0d03edbedec2ad23e92660dd
3
+ size 162234
assets/nri-SensationJazz.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4dfada309221e16f7c38b569b6c46e78ecd181b4d3bc7a7114bb2384e24b797f
3
+ size 134772
assets/nri-StreamMUWO.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4d55ce7e299b7d2d9ab50aee2d28233f05662886cbb57e792aa210d39dd73744
3
+ size 63536
assets/nri-battlesounds.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:86d491e1b41cfb9f75ce1a51aea3e06b558aef91fb9a88991de0d89cdffd72ae
3
+ size 87838
assets/yell-YELLAMRO20160506SM3.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7a2700bbe2233505ccf592e9e06a4b196a0feb4d2d7a4773ed5f2f110696a001
3
+ size 598352
assets/yell-YELLFLBCSACR20075171.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23371e93ed2dc6c43cfe8ada4125a2f15bcff19946e9efe969c8ca03caa60df8
3
+ size 390212
assets/yell-YELLWolfvCar20160111T22ms2.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1520267dbb85294fbca670c02404d7e64248cec02b29167def519f2e35194a0d
3
+ size 638311
hub_logger.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import os
3
+ import uuid
4
+ from pathlib import Path
5
+
6
+ from huggingface_hub import HfApi, HfFileSystem
7
+
8
+ DATASET_REPO = "EarthSpeciesProject/naturelm-audio-space-logs"
9
+ SPLIT = "test"
10
+ TESTING = os.getenv("TESTING", "0") == "1"
11
+ api = HfApi(token=os.getenv("HF_TOKEN", None))
12
+ # Upload audio
13
+ # check if file exists
14
+ hf_fs = HfFileSystem(token=os.getenv("HF_TOKEN", None))
15
+
16
+
17
+ def upload_data(
18
+ audio: str | Path,
19
+ user_text: str,
20
+ model_response: str,
21
+ session_id: str = "",
22
+ model_version: str = "",
23
+ ) -> None:
24
+ data_id = str(uuid.uuid4())
25
+
26
+ if TESTING:
27
+ data_id = "test-" + data_id
28
+ session_id = "test-" + session_id
29
+
30
+ # Audio path in repo
31
+ suffix = Path(audio).suffix
32
+ audio_p = f"{SPLIT}/audio/" + session_id + suffix
33
+
34
+ if not hf_fs.exists(f"datasets/{DATASET_REPO}/{audio_p}"):
35
+ api.upload_file(
36
+ path_or_fileobj=str(audio),
37
+ path_in_repo=audio_p,
38
+ repo_id=DATASET_REPO,
39
+ repo_type="dataset",
40
+ )
41
+
42
+ text = {
43
+ "user_message": user_text,
44
+ "model_response": model_response,
45
+ "file_name": "audio/" + session_id + suffix, # has to be relative to metadata.jsonl
46
+ "original_fn": os.path.basename(audio),
47
+ "id": data_id,
48
+ "session_id": session_id,
49
+ "model_version": model_version,
50
+ }
51
+
52
+ # Append to a jsonl file in the repo
53
+ # APPEND DOESN'T WORK, have to open first
54
+ if hf_fs.exists(f"datasets/{DATASET_REPO}/{SPLIT}/metadata.jsonl"):
55
+ with hf_fs.open(f"datasets/{DATASET_REPO}/{SPLIT}/metadata.jsonl", "r") as f:
56
+ lines = f.readlines()
57
+ lines.append(json.dumps(text) + "\n")
58
+ with hf_fs.open(f"datasets/{DATASET_REPO}/{SPLIT}/metadata.jsonl", "w") as f:
59
+ f.writelines(lines)
60
+ else:
61
+ with hf_fs.open(f"datasets/{DATASET_REPO}/{SPLIT}/metadata.jsonl", "w") as f:
62
+ f.write(json.dumps(text) + "\n")
63
+
64
+ # Write a separate file instead
65
+ # with hf_fs.open(f"datasets/{DATASET_REPO}/{data_id}.json", "w") as f:
66
+ # json.dump(text, f)
infer.py ADDED
@@ -0,0 +1,347 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # """Run NatureLM-audio over a set of audio files paths or a directory with audio files."""
2
+
3
+ # import argparse
4
+ # from pathlib import Path
5
+
6
+ # import librosa
7
+ # import numpy as np
8
+ # import pandas as pd
9
+ # import torch
10
+
11
+ # from NatureLM.config import Config
12
+ # from NatureLM.models import NatureLM
13
+ # from NatureLM.processors import NatureLMAudioProcessor
14
+ # from NatureLM.utils import move_to_device
15
+
16
+ # _MAX_LENGTH_SECONDS = 10
17
+ # _MIN_CHUNK_LENGTH_SECONDS = 0.5
18
+ # _SAMPLE_RATE = 16000 # Assuming the model uses a sample rate of 16kHz
19
+ # _AUDIO_FILE_EXTENSIONS = [".wav", ".mp3", ".flac", ".ogg", ".mp4"] # Add other audio file formats as needed
20
+ # _DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
21
+ # __root_dir = Path(__file__).parent.parent
22
+ # _DEFAULT_CONFIG_PATH = __root_dir / "configs" / "inference.yml"
23
+
24
+
25
+ # def load_model_and_config(
26
+ # cfg_path: str | Path = _DEFAULT_CONFIG_PATH, device: str = _DEVICE
27
+ # ) -> tuple[NatureLM, Config]:
28
+ # """Load the NatureLM model and configuration.
29
+ # Returns:
30
+ # tuple: The loaded model and configuration.
31
+ # """
32
+ # model = NatureLM.from_pretrained("EarthSpeciesProject/NatureLM-audio")
33
+ # model = model.to(device).eval()
34
+ # model.llama_tokenizer.pad_token_id = model.llama_tokenizer.eos_token_id
35
+ # model.llama_model.generation_config.pad_token_id = model.llama_tokenizer.pad_token_id
36
+
37
+ # cfg = Config.from_sources(cfg_path)
38
+ # return model, cfg
39
+
40
+
41
+ # def output_template(model_output: str, start_time: float, end_time: float) -> str:
42
+ # """Format the output of the model.
43
+
44
+ # Returns
45
+ # -------
46
+ # str
47
+ # Formatted string with timestamps and model output.
48
+ # """
49
+ # return f"#{start_time:.2f}s - {end_time:.2f}s#: {model_output}\n"
50
+
51
+
52
+ # def sliding_window_inference(
53
+ # audio: str | Path | np.ndarray,
54
+ # query: str,
55
+ # processor: NatureLMAudioProcessor,
56
+ # model: NatureLM,
57
+ # cfg: Config,
58
+ # window_length_seconds: float = 10.0,
59
+ # hop_length_seconds: float = 10.0,
60
+ # input_sr: int = _SAMPLE_RATE,
61
+ # device: str = _DEVICE,
62
+ # ) -> list[dict[str, any]]:
63
+ # """Run inference on a long audio file using sliding window approach.
64
+
65
+ # Args:
66
+ # audio (str | Path | np.ndarray): Path to the audio file.
67
+ # query (str): Query for the model.
68
+ # processor (NatureLMAudioProcessor): Audio processor.
69
+ # model (NatureLM): NatureLM model.
70
+ # cfg (Config): Model configuration.
71
+ # window_length_seconds (float): Length of the sliding window in seconds.
72
+ # hop_length_seconds (float): Hop length for the sliding window in seconds.
73
+ # input_sr (int): Sample rate of the audio file.
74
+
75
+ # Returns:
76
+ # str: The output of the model.
77
+
78
+ # Raises:
79
+ # ValueError: If the audio file is too short or if the audio file path is invalid.
80
+ # """
81
+ # if isinstance(audio, str) or isinstance(audio, Path):
82
+ # audio_array, input_sr = librosa.load(str(audio), sr=None, mono=False)
83
+ # elif isinstance(audio, np.ndarray):
84
+ # audio_array = audio
85
+ # print(f"Using provided sample rate: {input_sr}")
86
+
87
+ # audio_array = audio_array.squeeze()
88
+ # if audio_array.ndim > 1:
89
+ # axis_to_average = int(np.argmin(audio_array.shape))
90
+ # audio_array = audio_array.mean(axis=axis_to_average)
91
+ # audio_array = audio_array.squeeze()
92
+
93
+ # # Do initial check that the audio is long enough
94
+ # if audio_array.shape[-1] < int(_MIN_CHUNK_LENGTH_SECONDS * input_sr):
95
+ # raise ValueError(f"Audio is too short. Minimum length is {_MIN_CHUNK_LENGTH_SECONDS} seconds.")
96
+
97
+ # start = 0
98
+ # stride = int(hop_length_seconds * input_sr)
99
+ # window_length = int(window_length_seconds * input_sr)
100
+ # window_id = 0
101
+
102
+ # output = [] # Initialize output list
103
+ # while True:
104
+ # chunk = audio_array[start : start + window_length]
105
+ # if chunk.shape[-1] < int(_MIN_CHUNK_LENGTH_SECONDS * input_sr):
106
+ # break
107
+
108
+ # # Resamples, pads, truncates and creates torch Tensor
109
+ # audio_tensor, prompt_list = processor([chunk], [query], [input_sr])
110
+
111
+ # input_to_model = {
112
+ # "raw_wav": audio_tensor,
113
+ # "prompt": prompt_list[0],
114
+ # "audio_chunk_sizes": 1,
115
+ # "padding_mask": torch.zeros_like(audio_tensor).to(torch.bool),
116
+ # }
117
+ # input_to_model = move_to_device(input_to_model, device)
118
+
119
+ # # generate
120
+ # prediction: str = model.generate(input_to_model, cfg.generate, prompt_list)[0]
121
+
122
+ # # Post-process the prediction
123
+ # # prediction = output_template(prediction, start / input_sr, (start + window_length) / input_sr)
124
+ # # output += prediction
125
+ # output.append(
126
+ # {
127
+ # "start_time": start / input_sr,
128
+ # "end_time": (start + window_length) / input_sr,
129
+ # "prediction": prediction,
130
+ # "window_number": window_id,
131
+ # }
132
+ # )
133
+
134
+ # # Move the window
135
+ # start += stride
136
+
137
+ # if start + window_length > audio_array.shape[-1]:
138
+ # break
139
+
140
+ # return output
141
+
142
+
143
+ # class Pipeline:
144
+ # """Pipeline for running NatureLM-audio inference on a list of audio files or audio arrays"""
145
+
146
+ # def __init__(self, model: NatureLM = None, cfg_path: str | Path = _DEFAULT_CONFIG_PATH) -> None:
147
+ # self.cfg_path = cfg_path
148
+
149
+ # # Load model and config
150
+ # if model is not None:
151
+ # self.cfg = Config.from_sources(cfg_path)
152
+ # self.model = model
153
+ # else:
154
+ # # Download model from hub
155
+ # self.model, self.cfg = load_model_and_config(cfg_path)
156
+
157
+ # self.processor = NatureLMAudioProcessor(sample_rate=_SAMPLE_RATE, max_length_seconds=_MAX_LENGTH_SECONDS)
158
+
159
+ # def __call__(
160
+ # self,
161
+ # audios: list[str | Path | np.ndarray],
162
+ # queries: str | list[str],
163
+ # window_length_seconds: float = 10.0,
164
+ # hop_length_seconds: float = 10.0,
165
+ # input_sample_rate: int = _SAMPLE_RATE,
166
+ # verbose: bool = False,
167
+ # ) -> list[str]:
168
+ # """Run inference on a list of audio file paths or a single audio file with a
169
+ # single query or a list of queries. If multiple queries are provided,
170
+ # we assume that they are in the same order as the audio files. If a single query
171
+ # is provided, it will be used for all audio files.
172
+
173
+ # Args:
174
+ # audios (list[str | Path | np.ndarray]): List of audio file paths or a single audio
175
+ # file path or audio array(s)
176
+ # queries (str | list[str]): Queries for the model.
177
+ # window_length_seconds (float): Length of the sliding window in seconds. Defaults to 10.0.
178
+ # hop_length_seconds (float): Hop length for the sliding window in seconds. Defaults to 10.0.
179
+ # input_sample_rate (int): Sample rate of the audio. Defaults to 16000, which is the model's sample rate.
180
+ # verbose (bool): If True, print the output of the model for each audio file.
181
+ # Defaults to False.
182
+
183
+ # Returns:
184
+ # list[list[dict]]: List of model outputs for each audio file. Each output is a list of dictionaries
185
+ # containing the start time, end time, and prediction for each chunk of audio.
186
+
187
+ # Raises:
188
+ # ValueError: If the number of audio files and queries do not match.
189
+ # """
190
+ # if isinstance(audios, str) or isinstance(audios, Path):
191
+ # audios = [audios]
192
+
193
+ # if isinstance(queries, str):
194
+ # queries = [queries] * len(audios)
195
+
196
+ # if len(audios) != len(queries):
197
+ # raise ValueError("Number of audio files and queries must match.")
198
+
199
+ # # Run inference
200
+ # results = []
201
+ # for audio, query in zip(audios, queries, strict=False):
202
+ # output = sliding_window_inference(
203
+ # audio,
204
+ # query,
205
+ # self.processor,
206
+ # self.model,
207
+ # self.cfg,
208
+ # window_length_seconds,
209
+ # hop_length_seconds,
210
+ # input_sr=input_sample_rate,
211
+ # )
212
+ # results.append(output)
213
+ # if verbose:
214
+ # print(f"Processed {audio}, model output:\n=======\n{output}\n=======")
215
+ # return results
216
+
217
+
218
+ # def parse_args() -> argparse.Namespace:
219
+ # parser = argparse.ArgumentParser("Run NatureLM-audio inference")
220
+ # parser.add_argument(
221
+ # "-a",
222
+ # "--audio",
223
+ # type=str,
224
+ # required=True,
225
+ # help="Path to an audio file or a directory containing audio files",
226
+ # )
227
+ # parser.add_argument("-q", "--query", type=str, required=True, help="Query for the model")
228
+ # parser.add_argument(
229
+ # "--cfg-path",
230
+ # type=str,
231
+ # default="configs/inference.yml",
232
+ # help="Path to the configuration file for the model",
233
+ # )
234
+ # parser.add_argument(
235
+ # "--output_path",
236
+ # type=str,
237
+ # default="inference_output.jsonl",
238
+ # help="Output path for the results",
239
+ # )
240
+ # parser.add_argument(
241
+ # "--window_length_seconds",
242
+ # type=float,
243
+ # default=10.0,
244
+ # help="Length of the sliding window in seconds",
245
+ # )
246
+ # parser.add_argument(
247
+ # "--hop_length_seconds",
248
+ # type=float,
249
+ # default=10.0,
250
+ # help="Hop length for the sliding window in seconds",
251
+ # )
252
+ # args = parser.parse_args()
253
+
254
+ # return args
255
+
256
+
257
+ # def main(
258
+ # cfg_path: str | Path,
259
+ # audio_path: str | Path,
260
+ # query: str,
261
+ # output_path: str,
262
+ # window_length_seconds: float,
263
+ # hop_length_seconds: float,
264
+ # ) -> None:
265
+ # """Main function to run the NatureLM-audio inference script.
266
+ # It takes command line arguments for audio file path, query, output path,
267
+ # window length, and hop length. It processes the audio files and saves the
268
+ # results to a CSV file.
269
+
270
+ # Args:
271
+ # cfg_path (str | Path): Path to the configuration file.
272
+ # audio_path (str | Path): Path to the audio file or directory.
273
+ # query (str): Query for the model.
274
+ # output_path (str): Path to save the output results.
275
+ # window_length_seconds (float): Length of the sliding window in seconds.
276
+ # hop_length_seconds (float): Hop length for the sliding window in seconds.
277
+
278
+ # Raises:
279
+ # ValueError: If the audio file path is invalid or if the query is empty.
280
+ # ValueError: If no audio files are found.
281
+ # ValueError: If the audio file extension is not supported.
282
+ # """
283
+
284
+ # # Prepare sample
285
+ # audio_path = Path(audio_path)
286
+ # if audio_path.is_dir():
287
+ # audio_paths = []
288
+ # print(f"Searching for audio files in {str(audio_path)} with extensions {', '.join(_AUDIO_FILE_EXTENSIONS)}")
289
+ # for ext in _AUDIO_FILE_EXTENSIONS:
290
+ # audio_paths.extend(list(audio_path.rglob(f"*{ext}")))
291
+
292
+ # print(f"Found {len(audio_paths)} audio files in {str(audio_path)}")
293
+ # else:
294
+ # # check that the extension is valid
295
+ # if not any(audio_path.suffix == ext for ext in _AUDIO_FILE_EXTENSIONS):
296
+ # raise ValueError(
297
+ # f"Invalid audio file extension. Supported extensions are: {', '.join(_AUDIO_FILE_EXTENSIONS)}"
298
+ # )
299
+ # audio_paths = [audio_path]
300
+
301
+ # # check that query is not empty
302
+ # if not query:
303
+ # raise ValueError("Query cannot be empty")
304
+ # if not audio_paths:
305
+ # raise ValueError("No audio files found. Please check the path or file extensions.")
306
+
307
+ # # Load model and config
308
+ # model, cfg = load_model_and_config(cfg_path)
309
+
310
+ # # Load audio processor
311
+ # processor = NatureLMAudioProcessor(sample_rate=_SAMPLE_RATE, max_length_seconds=_MAX_LENGTH_SECONDS)
312
+
313
+ # # Run inference
314
+ # results = {"audio_path": [], "output": []}
315
+ # for path in audio_paths:
316
+ # output = sliding_window_inference(
317
+ # path,
318
+ # query,
319
+ # processor,
320
+ # model,
321
+ # cfg,
322
+ # window_length_seconds,
323
+ # hop_length_seconds,
324
+ # )
325
+ # results["audio_path"].append(str(path))
326
+ # results["output"].append(output)
327
+ # print(f"Processed {path}, model output:\n=======\n{output}\n=======\n")
328
+
329
+ # # Save results as a csv
330
+ # output_path = Path(output_path)
331
+ # output_path.parent.mkdir(parents=True, exist_ok=True)
332
+
333
+ # df = pd.DataFrame(results)
334
+ # df.to_json(output_path, orient="records", lines=True)
335
+ # print(f"Results saved to {output_path}")
336
+
337
+
338
+ # if __name__ == "__main__":
339
+ # args = parse_args()
340
+ # main(
341
+ # cfg_path=args.cfg_path,
342
+ # audio_path=args.audio,
343
+ # query=args.query,
344
+ # output_path=args.output_path,
345
+ # window_length_seconds=args.window_length_seconds,
346
+ # hop_length_seconds=args.hop_length_seconds,
347
+ # )
pyproject.toml ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "naturelm-audio-hf-app"
3
+ version = "0.1.0"
4
+ description = "Add your description here"
5
+ readme = "README.md"
6
+ requires-python = ">=3.12"
7
+ dependencies = [
8
+ "esp-research",
9
+ "naturelm-audio",
10
+ "gradio>=6.9.0",
11
+ "spaces>=0.47.0",
12
+ "huggingface-hub>=1.5.0",
13
+ "soundfile>=0.13.1",
14
+ "torch>=2.7.1",
15
+ "torchaudio>=2.7.1",
16
+ "matplotlib>=3.10.8",
17
+ "numpy>=2.3.5",
18
+ ]
19
+
20
+ [tool.uv.sources]
21
+ esp-research = { workspace = true }
22
+ naturelm-audio = { workspace = true } #TODO
static/header.html ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div style="display: flex; align-items: center; gap: 12px;">
2
+ <picture>
3
+ <source srcset="/gradio_api/file=assets/ESP_logo_white.png"
4
+ media="(prefers-color-scheme: dark)">
5
+ <source srcset="/gradio_api/file=assets/esp_logo.png"
6
+ media="(prefers-color-scheme: light)">
7
+ <img src="/gradio_api/file=assets/esp_logo.png"
8
+ alt="ESP Logo"
9
+ style="height: 40px; width: auto;">
10
+ </picture>
11
+ <h2 style="margin: 0;">NatureLM-audio<span style="
12
+ font-size: 0.55em; color: #28a745; background: #e6f4ea;
13
+ padding: 2px 6px; border-radius: 4px; margin-left: 8px;
14
+ display: inline-block; vertical-align: top;"
15
+ >BETA</span></h2>
16
+ </div>
static/help.html ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="banner">
2
+ <div style="display: flex; padding: 0px; align-items: center; flex: 1;">
3
+ <div style="font-size: 20px; margin-right: 12px;"></div>
4
+ <div style="flex: 1;">
5
+ <div class="banner-header">Help us improve the model!</div>
6
+ <div class="banner-text">
7
+ Found an issue or have suggestions?
8
+ Join us on Discourse to share feedback and questions.
9
+ </div>
10
+ </div>
11
+ </div>
12
+ <a href="https://earthspeciesproject.discourse.group/t/feedback-for-naturelm-audio-ui-hugging-face-spaces-demo/17"
13
+ target="_blank" class="link-btn">Share Feedback</a>
14
+ </div>
15
+ <div class="guide-section">
16
+ <h3>Getting Started</h3>
17
+ <ol style="margin-top: 12px; padding-left: 20px;
18
+ color: #6b7280; font-size: 14px; line-height: 1.6;">
19
+ <li style="margin-bottom: 8px;">
20
+ <strong>Upload your audio</strong> or click on a pre-loaded example.
21
+ Drag and drop your audio file containing animal vocalizations,
22
+ or click on an example.
23
+ </li>
24
+ <li style="margin-bottom: 8px;">
25
+ <strong>Trim your audio (if needed)</strong> by clicking the scissors
26
+ icon on the bottom right of the audio panel. Try to keep your audio
27
+ to 10 seconds or less.
28
+ </li>
29
+ <li style="margin-bottom: 8px;">
30
+ <strong>View the Spectrogram (optional)</strong>. You can easily
31
+ view/hide the spectrogram of your audio for closer analysis.
32
+ </li>
33
+ <li style="margin-bottom: 8px;">
34
+ <strong>Select a task or write your own</strong>. Select an option
35
+ from pre-loaded tasks. This will auto-fill the text box with a prompt,
36
+ so all you have to do is hit Send. Or, type a custom prompt directly
37
+ into the chat.
38
+ </li>
39
+ <li style="margin-bottom: 0;">
40
+ <strong>Send and Analyze Audio</strong>. Press "Send" or type Enter
41
+ to begin processing your audio. Ask follow-up questions or press
42
+ "Clear" to start a new conversation.
43
+ </li>
44
+ </ol>
45
+ </div>
46
+ <div class="guide-section">
47
+ <h3>Tips</h3>
48
+ <b>Prompting Best Practices</b>
49
+ <ul style="margin-top: 12px; padding-left: 20px;
50
+ color: #6b7280; font-size: 14px; line-height: 1.6;">
51
+ <li>
52
+ When possible, use scientific or taxonomic names and mention
53
+ the context if known (geographic area/location, time of day
54
+ or year, habitat type)
55
+ </li>
56
+ <li>Ask one question at a time, and be specific about what
57
+ you want to know</li>
58
+ <ul>&#10060; Don't ask:
59
+ <i>"Analyze this audio and tell me all you know about it."</i>
60
+ </ul>
61
+ <ul>&#9989; Do ask:
62
+ <i>"What species made this sound?"</i>
63
+ </ul>
64
+ <li>Keep prompts more open-ended and avoid asking Yes/No
65
+ or very targeted questions</li>
66
+ <ul>&#10060; Don't ask:
67
+ <i>"Is there a bottlenose dolphin vocalizing in the audio?
68
+ Yes or No."</i>
69
+ </ul>
70
+ <ul>&#9989; Do ask:
71
+ <i>"What focal species, if any, are heard in the audio?"</i>
72
+ </ul>
73
+ <li>Giving the model options to choose works well for broader
74
+ categories (less so for specific species)</li>
75
+ <ul>&#10060; Don't ask:
76
+ <i>"Classify the audio into one of the following species:
77
+ Bottlenose Dolphin, Orca, Great Gray Owl"</i>
78
+ </ul>
79
+ <ul>&#9989; Do ask:
80
+ <i>"Classify the audio into one of the following categories:
81
+ Cetaceans, Aves, or None."</i>
82
+ </ul>
83
+ </ul>
84
+ <br>
85
+ <b>Audio Files</b>
86
+ <ul style="margin-top: 12px; padding-left: 20px;
87
+ color: #6b7280; font-size: 14px; line-height: 1.6;">
88
+ <li>Supported formats: .wav, .mp3, .aac, .flac, .ogg, .webm,
89
+ .midi, .aiff, .wma, .opus, .amr</li>
90
+ <li>If you are uploading an .mp4, please check that it is not
91
+ an MPEG-4 Movie file.</li>
92
+ <li>For best results, use high-quality recordings with minimal
93
+ background noise.</li>
94
+ </ul>
95
+ </div>
96
+ <div class="guide-section">
97
+ <h3>Learn More</h3>
98
+ <ul style="margin-top: 12px; padding-left: 20px;
99
+ color: #6b7280; font-size: 14px; line-height: 1.6;">
100
+ <li>Read our
101
+ <a href="https://huggingface.co/blog/EarthSpeciesProject/nature-lm-audio-ui-demo/"
102
+ target="_blank">recent blog post</a>
103
+ with a step-by-step tutorial</li>
104
+ <li>Check out the
105
+ <a href="https://arxiv.org/abs/2411.07186"
106
+ target="_blank">published paper</a>
107
+ for a deeper technical dive on NatureLM-audio.</li>
108
+ <li>Visit the
109
+ <a href="https://earthspecies.github.io/naturelm-audio-demo/"
110
+ target="_blank">NatureLM-audio Demo Page</a>
111
+ for additional context, a demo video, and more examples
112
+ of the model in action.</li>
113
+ <li>Sign up for our
114
+ <a href="https://forms.gle/WjrbmFhKkzmEgwvY7"
115
+ target="_blank">closed beta waitlist</a>,
116
+ if you're interested in testing upcoming features like
117
+ longer audio files and batch processing.</li>
118
+ </ul>
119
+ </div>
static/onboarding.html ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="banner">
2
+ <div style="display: flex; padding: 0px; align-items: center; flex: 1;">
3
+ <div style="font-size: 20px; margin-right: 12px;">&#128075;</div>
4
+ <div style="flex: 1;">
5
+ <div class="banner-header">Welcome to NatureLM-audio!</div>
6
+ <div class="banner-text">
7
+ Upload your first audio file below or select a pre-loaded example below.
8
+ </div>
9
+ </div>
10
+ </div>
11
+ <a href="https://huggingface.co/blog/EarthSpeciesProject/nature-lm-audio-ui-demo/"
12
+ target="_blank" class="link-btn">View Tutorial</a>
13
+ </div>
static/style.css ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #chat-input textarea {
2
+ background: white;
3
+ flex: 1;
4
+ }
5
+ #chat-input .submit-button {
6
+ padding: 10px;
7
+ margin: 2px 6px;
8
+ align-self: center;
9
+ }
10
+ #spectrogram-plot {
11
+ padding: 12px;
12
+ margin: 12px;
13
+ }
14
+ .banner {
15
+ background: white;
16
+ border: 1px solid #e5e7eb;
17
+ border-radius: 8px;
18
+ padding: 16px 20px;
19
+ display: flex;
20
+ align-items: center;
21
+ justify-content: space-between;
22
+ margin-bottom: 16px;
23
+ margin-left: 0;
24
+ margin-right: 0;
25
+ box-shadow: 0 1px 3px rgba(0, 0, 0, 0.1);
26
+ }
27
+ .banner .banner-header {
28
+ font-size: 16px;
29
+ font-weight: 600;
30
+ color: #374151;
31
+ margin-bottom: 4px;
32
+ }
33
+ .banner .banner-text {
34
+ font-size: 14px;
35
+ color: #6b7280;
36
+ line-height: 1.4;
37
+ }
38
+ .link-btn {
39
+ padding: 6px 12px;
40
+ border-radius: 6px;
41
+ font-size: 13px;
42
+ font-weight: 500;
43
+ cursor: pointer;
44
+ border: none;
45
+ background: #3b82f6;
46
+ color: white;
47
+ text-decoration: none;
48
+ display: inline-block;
49
+ transition: background 0.2s ease;
50
+ }
51
+ .link-btn:hover {
52
+ background: #2563eb;
53
+ }
54
+
55
+ .guide-section {
56
+ margin-bottom: 32px;
57
+ border-radius: 8px;
58
+ padding: 14px;
59
+ border: 1px solid #e5e7eb;
60
+ }
61
+
62
+ .guide-section h3 {
63
+ margin-top: 4px;
64
+ margin-bottom: 16px;
65
+ border-bottom: 1px solid #e5e7eb;
66
+ padding-bottom: 12px;
67
+ }
68
+ .guide-section h4 {
69
+ color: #1f2937;
70
+ margin-top: 4px;
71
+ }
72
+ @media (prefers-color-scheme: dark) {
73
+ #chat-input {
74
+ background: #1e1e1e;
75
+ }
76
+ #chat-input textarea {
77
+ background: #1e1e1e;
78
+ color: white;
79
+ }
80
+ .banner {
81
+ background: #1e1e1e;
82
+ color: white;
83
+ }
84
+ .banner .banner-header {
85
+ color: white;
86
+ }
87
+ }