English
code
GuiminHu commited on
Commit
6fd85f5
·
verified ·
1 Parent(s): 8dea19c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +211 -1
README.md CHANGED
@@ -12,4 +12,214 @@ base_model:
12
  - meta-llama/Llama-3.1-8B
13
  tags:
14
  - code
15
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  - meta-llama/Llama-3.1-8B
13
  tags:
14
  - code
15
+ ---
16
+
17
+ # 📌 HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning
18
+
19
+ Arxiv: https://arxiv.org/pdf/2508.06475?
20
+
21
+ ---
22
+
23
+ ## 📖 Introduction
24
+ **HapticLLaMA** is a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category.
25
+ HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement
26
+ learning from human feedback (RLHF).
27
+
28
+ ---
29
+
30
+ ## 🧩 Tasks
31
+ - Given a vibration signal S and a target category c ∈ {sensory, emotional, associative}, where sensory refers to physical attributes (e.g.,intensity of tapping), emotional denotes affective
32
+ impressions (e.g., the mood of a scene), and associative indicates real-world familiar experiences (e.g., buzzing of a bee, a heartbeat), the goal is to generate a caption corresponding to the specified category of haptic experience.
33
+
34
+ ---
35
+
36
+ ## 📂 Training
37
+ HapticLLaMA training is consist of (1) supervised fine-tuning with LoRA adaptation and (2) subsequent fine-tuning based on human feedback on generated captions.
38
+
39
+ <img width="925" height="557" alt="image" src="https://github.com/user-attachments/assets/28a0aa75-d011-4870-b9ec-b9b3607eb8d8" />
40
+
41
+
42
+ - ## 📂 Models
43
+ - **Frequency-based Model**:
44
+
45
+ - **Encodec-based Model**:
46
+ ---
47
+
48
+ ## 📂 Haptic Tokenizer
49
+ - **Frequency-based Tokenizer**:
50
+
51
+ <img width="361" height="211" alt="image" src="https://github.com/user-attachments/assets/ca848d0b-18d5-4ad5-89e4-268399aad801" />
52
+
53
+ Frequency-based Tokenizer divides the frequency range into logarithmically spaced bins that correspond to just-noticeable ifferences in human frequency perception. Similarly, the amplitude range is segmented into normalized levels. The tokenizer then assigns a unique
54
+ token (e.g., FREQ_3_AMP_2) to each frequencyamplitude pair, encoding the signal’s spectral content into a form interpretable by LLMs.
55
+ ```python
56
+ import librosa
57
+
58
+ def steps_binning(frequencies, amplitudes, freq_bins=10, amp_levels=5):
59
+
60
+ freq_min, freq_max = np.min(frequencies), np.max(frequencies)
61
+ freq_min = freq_max / (1.2**(freq_bins-1))
62
+ freq_edges = np.geomspace(freq_min, freq_min * 1.2**(freq_bins-1), num=freq_bins)
63
+ freq_labels = [f"FREQ_{i+1}" for i in range(freq_bins)]
64
+ amp_min, amp_max = np.min(amplitudes), np.max(amplitudes)
65
+ if amp_min == amp_max:
66
+ # breakpoint()
67
+ amplitudes = np.zeros_like(frequencies)
68
+ amp_edges = np.linspace(0, 1, amp_levels + 1)
69
+ else:
70
+ amplitudes = (amplitudes - amp_min) / (amp_max - amp_min)
71
+ amp_min = amp_max / (1.2**(amp_levels-1))
72
+ amp_edges = np.geomspace(amp_min, amp_max, num=amp_levels)
73
+
74
+ amp_labels = [f"AMP_{i+1}" for i in range(amp_levels)]
75
+
76
+ tokens = []
77
+ for f, a in zip(frequencies, amplitudes):
78
+ freq_bin = np.digitize(f, freq_edges) - 1
79
+ freq_bin = min(freq_bin, freq_bins - 1)
80
+ freq_token = freq_labels[freq_bin]
81
+
82
+ amp_bin = np.digitize(a, amp_edges) - 1
83
+ amp_bin = min(amp_bin, amp_levels - 1)
84
+ amp_token = amp_labels[amp_bin]
85
+
86
+ tokens.append(f"{freq_token}_{amp_token}")
87
+ return tokens
88
+
89
+ ### start load .wav file and tokenize
90
+ y, sr = librosa.load(wav_file, sr=None)
91
+
92
+ D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
93
+ frequencies = librosa.fft_frequencies(sr=sr, n_fft=n_fft)
94
+ magnitudes = np.abs(D)
95
+ magnitudes = magnitudes / np.max(magnitudes)
96
+ frame_idx = 10
97
+ amplitudes = magnitudes[:, frame_idx]
98
+ mask = frequencies < 500
99
+ frequencies_filtered = frequencies[mask]
100
+ amplitudes_filtered = amplitudes[mask]
101
+ ###haptic tokens based on Frequency-base haptic tokenizer
102
+ tokens = steps_binning(frequencies_filtered, amplitudes_filtered, freq_bins=freq_bins,amp_levels=amp_levels)
103
+
104
+ ```
105
+ ---
106
+ - **EnCodec-based Tokenizer**:
107
+
108
+ <img width="317" height="172" alt="image" src="https://github.com/user-attachments/assets/35e50d2e-c21f-4fc1-8953-74305a752ee0" />
109
+
110
+ EnCodec is a neural audio codec that compresses audio using deep learning (Défossez et al., 2023). It consists of three
111
+ main components: (1) an encoder that transforms raw audio into a lower-dimensional latent representation, (2) a quantizer that discretizes the latent features via residual vector quantization, and (3) a decoder that reconstructs the waveform from the quantized codes. EnCodec-based tokenizer extract the codes from residual vector quantization in the audio compression architecture.
112
+
113
+ ```python
114
+ from transformers import AutoTokenizer,AutoProcessor,EncodecModel
115
+
116
+ encodec_model = EncodecModel.from_pretrained("facebook/encodec_24khz")
117
+ processor = AutoProcessor.from_pretrained("facebook/encodec_24khz")
118
+
119
+ ### EnCodec-based Tokenizer
120
+ def encodec_token(wav_file):
121
+ data_dict = {"audio": [wav_file]}
122
+ data_dataset = Dataset.from_dict(data_dict).cast_column("audio", Audio())
123
+ audio_sample = data_dataset[-1]["audio"]["array"]
124
+ inputs = processor(raw_audio=audio_sample, sampling_rate=24000, return_tensors="pt")
125
+ with torch.no_grad():
126
+ encoded_frames = encodec_model.encode(inputs["input_values"], inputs["padding_mask"])
127
+ tokens = encoded_frames.audio_codes[0][0]
128
+ tokens_list = [str(token) for token in tokens[0].tolist()]
129
+
130
+ return tokens_list
131
+ ```
132
+ ---
133
+ ## 📂 Inference
134
+
135
+ Given a haptic signal, we prompt HapticLLaMA to generate captions from sensory, emotional, and associative perspectives.
136
+
137
+ <img width="448" height="329" alt="image" src="https://github.com/user-attachments/assets/2ea17083-5da3-47f2-9781-7f17912d08cc" />
138
+
139
+ ```python
140
+ import torch
141
+ from torch import nn
142
+ import librosa
143
+
144
+ #load model--HapticLLaMA
145
+ def load_model(stage, device, mode, model_file_url):
146
+ if os.path.exists(model_file_url):
147
+ model = Model(args, mode=mode)
148
+ lora_state_dict = torch.load(model_file_url)
149
+ state_name, model_name = [], []
150
+ for name, param in model.named_parameters():
151
+ model_name.append(name)
152
+ for name in lora_state_dict.keys():
153
+ state_name.append(name)
154
+ missing_keys, unexpected_keys = model.load_state_dict(lora_state_dict, strict=False)
155
+ model.to(device)
156
+ else:
157
+ print('invalid model url!')
158
+ model = None
159
+ return model
160
+
161
+ ###load pretrained haptic tokenizer
162
+
163
+ frequency_tokenizer = AutoTokenizer.from_pretrained(r"./updated_llama_tokenizer_steps_binning.pt/")
164
+ encodec_tokenizer = AutoTokenizer.from_pretrained(r"./updated_llama_tokenizer_encodec.pt/")
165
+
166
+ #formalize input for inference
167
+ def tokenizer_haptic(haptic, prompt, mode):
168
+
169
+ def formalize_input(haptic_tokens, tokenizer, prompt):
170
+ tokenizer.pad_token = tokenizer.eos_token
171
+
172
+ inputs = tokenizer(haptic_tokens, padding=True, truncation=True, return_tensors="pt")
173
+ input_ids = inputs.input_ids
174
+ input_atts = inputs.attention_mask
175
+
176
+ prompt_enc = tokenizer(prompt, padding=True, truncation=True, return_tensors="pt")
177
+ prompt_ids = prompt_enc.input_ids
178
+ prompt_atts = prompt_enc.attention_mask
179
+
180
+ prompt_ids = torch.cat((input_ids,prompt_ids),dim=1)
181
+ prompt_atts = torch.cat((input_atts,prompt_atts),dim=1)
182
+
183
+
184
+ return input_ids,input_atts, prompt_ids, prompt_atts
185
+
186
+ ###Frequency-based token formalization
187
+ if mode == 'frequency':
188
+ freq_haptic_tokens = frequency_tokenizer(haptic, mode='frequency)
189
+ freq_haptic_tokens = [' '.join(freq_haptic_tokens)]
190
+ freq_input_ids,freq_input_atts, freq_prompt_ids, freq_prompt_atts = formalize_input(freq_haptic_tokens, frequency_tokenizer, prompt=prompt)
191
+ return freq_input_ids, freq_input_atts, freq_prompt_ids, freq_prompt_atts
192
+ elif mode == 'encodec':
193
+ ###Encodec-based token formalization
194
+ encodec_haptic_tokens = encodec_token(haptic, mode='encodec')
195
+ encodec_haptic_tokens = [' '.join(encodec_haptic_tokens)]
196
+ encodec_input_ids, encodec_input_atts, encodec_prompt_ids, prompt_atts = formalize_input(encodec_haptic_tokens, encodec_tokenizer, prompt=prompt)
197
+ return encodec_input_ids, encodec_input_atts, encodec_prompt_ids, prompt_atts
198
+
199
+ ```
200
+ Inference for one sample
201
+
202
+ ```python
203
+ haptic_signal = r'./F211_loop.wav'
204
+ sensory_prompt = 'its sensory description is'
205
+ ##for emotional and associative
206
+ ##emotional_prompt = 'its emotional description is'
207
+ ##associative_prompt = 'its associative description is'
208
+ input_ids, input_atts, prompt_ids, prompt_atts = tokenizer_haptic(haptic_signal, sensory_prompt, mode='encodec')
209
+ hapticllama = load_model(stage=1, device='cuda', mode='encodec', model_file_url=encodec_model_file_url)
210
+ caption = hapticllama.generate(inputs = prompt_ids,input_atts=prompt_atts)
211
+ print(caption)
212
+ ```
213
+ ---
214
+
215
+ ## 🚀 Citation
216
+ If you find this dataset useful for your research, please cite our paper:
217
+
218
+ ```bibtex
219
+ @article{hu2025hapticllama,
220
+ title={HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning},
221
+ author={Hu, Guimin and Hershcovich, Daniel and Seifi, Hasti},
222
+ journal={arXiv preprint arXiv:2508.06475},
223
+ year={2025}
224
+ }
225
+ ```