English
code
File size: 9,226 Bytes
8ac62f6
 
 
 
f4658ef
8ac62f6
 
 
 
 
 
 
 
 
 
6fd85f5
 
 
 
 
5aee646
cb77e50
6fd85f5
 
 
 
 
 
 
 
 
 
 
ed3d3f8
6fd85f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f4658ef
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
license: apache-2.0
datasets:
- GuiminHu/HapticCap
- GuiminHu/VibRate
language:
- en
metrics:
- bleu
- meteor
- rouge
base_model:
- meta-llama/Llama-3.1-8B
tags:
- code
---

# 📌 HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning

Arxiv: https://arxiv.org/pdf/2508.06475?

Codes: https://github.com/LeMei/HapticLLaMA

---

## 📖 Introduction
**HapticLLaMA** is a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. 
HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement
learning from human feedback (RLHF). 

---

## 🧩 Tasks
Given a vibration signal S and a target category c ∈ {sensory, emotional, associative}, where sensory refers to physical attributes (e.g.,intensity of tapping), emotional denotes affective
impressions (e.g., the mood of a scene), and associative indicates real-world familiar experiences (e.g., buzzing of a bee, a heartbeat), the goal is to generate a caption corresponding to the specified category of haptic experience.

---

## 📂 Training
HapticLLaMA training is consist of (1) supervised fine-tuning with LoRA adaptation and (2) subsequent fine-tuning based on human feedback on generated captions.

<img width="925" height="557" alt="image" src="https://github.com/user-attachments/assets/28a0aa75-d011-4870-b9ec-b9b3607eb8d8" />

---

## 📂 Haptic Tokenizer
- **Frequency-based Tokenizer**:
  
  <img width="361" height="211" alt="image" src="https://github.com/user-attachments/assets/ca848d0b-18d5-4ad5-89e4-268399aad801" />

Frequency-based Tokenizer divides the frequency range into logarithmically spaced bins that correspond to just-noticeable ifferences in human frequency perception. Similarly, the amplitude range is segmented into normalized levels. The tokenizer then assigns a unique
token (e.g., FREQ_3_AMP_2) to each frequencyamplitude pair, encoding the signal’s spectral content into a form interpretable by LLMs.
```python
import librosa

def steps_binning(frequencies, amplitudes, freq_bins=10, amp_levels=5):

    freq_min, freq_max = np.min(frequencies), np.max(frequencies)
    freq_min = freq_max / (1.2**(freq_bins-1))
    freq_edges = np.geomspace(freq_min, freq_min * 1.2**(freq_bins-1), num=freq_bins)
    freq_labels = [f"FREQ_{i+1}" for i in range(freq_bins)] 
    amp_min, amp_max = np.min(amplitudes), np.max(amplitudes)
    if amp_min == amp_max:
        # breakpoint()
        amplitudes = np.zeros_like(frequencies)
        amp_edges = np.linspace(0, 1, amp_levels + 1)
    else:
        amplitudes = (amplitudes - amp_min) / (amp_max - amp_min)  
        amp_min = amp_max / (1.2**(amp_levels-1))
        amp_edges = np.geomspace(amp_min, amp_max, num=amp_levels)

    amp_labels = [f"AMP_{i+1}" for i in range(amp_levels)]  

    tokens = []
    for f, a in zip(frequencies, amplitudes):
        freq_bin = np.digitize(f, freq_edges) - 1
        freq_bin = min(freq_bin, freq_bins - 1)  
        freq_token = freq_labels[freq_bin]

        amp_bin = np.digitize(a, amp_edges) - 1
        amp_bin = min(amp_bin, amp_levels - 1)  
        amp_token = amp_labels[amp_bin]

        tokens.append(f"{freq_token}_{amp_token}")
    return tokens

    ### start load .wav file and tokenize
    y, sr = librosa.load(wav_file, sr=None)  

    D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)  
    frequencies = librosa.fft_frequencies(sr=sr, n_fft=n_fft)  
    magnitudes = np.abs(D)  
    magnitudes = magnitudes / np.max(magnitudes)  
    frame_idx = 10  
    amplitudes = magnitudes[:, frame_idx]  
    mask = frequencies < 500
    frequencies_filtered = frequencies[mask]
    amplitudes_filtered = amplitudes[mask]
    ###haptic tokens based on Frequency-base haptic tokenizer
    tokens = steps_binning(frequencies_filtered, amplitudes_filtered, freq_bins=freq_bins,amp_levels=amp_levels)

```
---
- **EnCodec-based Tokenizer**:

<img width="317" height="172" alt="image" src="https://github.com/user-attachments/assets/35e50d2e-c21f-4fc1-8953-74305a752ee0" />

EnCodec is a neural audio codec that compresses audio using deep learning (Défossez et al., 2023). It consists of three
main components: (1) an encoder that transforms raw audio into a lower-dimensional latent representation, (2) a quantizer that discretizes the latent features via residual vector quantization, and (3) a decoder that reconstructs the waveform from the quantized codes. EnCodec-based tokenizer extract the codes from residual vector quantization in the audio compression architecture. 
  
```python
from transformers import AutoTokenizer,AutoProcessor,EncodecModel

encodec_model = EncodecModel.from_pretrained("facebook/encodec_24khz")
processor = AutoProcessor.from_pretrained("facebook/encodec_24khz")

### EnCodec-based Tokenizer
def encodec_token(wav_file):
    data_dict = {"audio": [wav_file]}
    data_dataset = Dataset.from_dict(data_dict).cast_column("audio", Audio())
    audio_sample = data_dataset[-1]["audio"]["array"]
    inputs = processor(raw_audio=audio_sample, sampling_rate=24000, return_tensors="pt")
    with torch.no_grad():
        encoded_frames = encodec_model.encode(inputs["input_values"], inputs["padding_mask"])
    tokens = encoded_frames.audio_codes[0][0]
    tokens_list = [str(token) for token in tokens[0].tolist()]

    return tokens_list
```
---
## 📂 Inference

Given a haptic signal, we prompt HapticLLaMA to generate captions from sensory, emotional, and associative perspectives.

<img width="448" height="329" alt="image" src="https://github.com/user-attachments/assets/2ea17083-5da3-47f2-9781-7f17912d08cc" />

```python
import torch
from torch import nn
import librosa

#load model--HapticLLaMA
def load_model(stage, device, mode, model_file_url):
        if os.path.exists(model_file_url):
            model = Model(args, mode=mode)
            lora_state_dict = torch.load(model_file_url)
            state_name, model_name = [], []
            for name, param in model.named_parameters():
                model_name.append(name)
            for name in lora_state_dict.keys():
                state_name.append(name)
            missing_keys, unexpected_keys = model.load_state_dict(lora_state_dict, strict=False)
            model.to(device)
        else:
            print('invalid model url!')
            model = None
        return model

###load pretrained haptic tokenizer

frequency_tokenizer = AutoTokenizer.from_pretrained(r"./updated_llama_tokenizer_steps_binning.pt/")
encodec_tokenizer = AutoTokenizer.from_pretrained(r"./updated_llama_tokenizer_encodec.pt/")

#formalize input for inference
def tokenizer_haptic(haptic, prompt, mode):

    def formalize_input(haptic_tokens, tokenizer, prompt):
        tokenizer.pad_token = tokenizer.eos_token

        inputs = tokenizer(haptic_tokens, padding=True, truncation=True, return_tensors="pt")
        input_ids = inputs.input_ids
        input_atts = inputs.attention_mask

        prompt_enc = tokenizer(prompt, padding=True, truncation=True, return_tensors="pt")
        prompt_ids = prompt_enc.input_ids
        prompt_atts = prompt_enc.attention_mask

        prompt_ids = torch.cat((input_ids,prompt_ids),dim=1)
        prompt_atts = torch.cat((input_atts,prompt_atts),dim=1)


        return input_ids,input_atts, prompt_ids, prompt_atts

    ###Frequency-based token formalization
    if mode == 'frequency':
      freq_haptic_tokens = frequency_tokenizer(haptic, mode='frequency)
      freq_haptic_tokens = [' '.join(freq_haptic_tokens)]
      freq_input_ids,freq_input_atts, freq_prompt_ids, freq_prompt_atts = formalize_input(freq_haptic_tokens, frequency_tokenizer, prompt=prompt)
      return freq_input_ids, freq_input_atts, freq_prompt_ids, freq_prompt_atts
    elif mode == 'encodec':
      ###Encodec-based token formalization
      encodec_haptic_tokens = encodec_token(haptic, mode='encodec')
      encodec_haptic_tokens = [' '.join(encodec_haptic_tokens)]
      encodec_input_ids, encodec_input_atts, encodec_prompt_ids, prompt_atts = formalize_input(encodec_haptic_tokens, encodec_tokenizer, prompt=prompt)
      return encodec_input_ids, encodec_input_atts, encodec_prompt_ids, prompt_atts

```
Inference for one sample

```python
haptic_signal = r'./F211_loop.wav'
sensory_prompt = 'its sensory description is'
##for emotional and associative
##emotional_prompt = 'its emotional description is'
##associative_prompt = 'its associative description is'
input_ids, input_atts, prompt_ids, prompt_atts = tokenizer_haptic(haptic_signal, sensory_prompt, mode='encodec')
hapticllama = load_model(stage=1, device='cuda', mode='encodec', model_file_url=encodec_model_file_url)
caption = hapticllama.generate(inputs = prompt_ids,input_atts=prompt_atts)
print(caption)
```
---

## 🚀 Citation
If you find this dataset useful for your research, please cite our paper:

```bibtex
@article{hu2025hapticllama,
  title={HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning},
  author={Hu, Guimin and Hershcovich, Daniel and Seifi, Hasti},
  journal={arXiv preprint arXiv:2508.06475},
  year={2025}
}
```