--- license: apache-2.0 datasets: - GuiminHu/HapticCap - GuiminHu/VibRate language: - en metrics: - bleu - meteor - rouge base_model: - meta-llama/Llama-3.1-8B tags: - code --- # ๐Ÿ“Œ HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning Arxiv: https://arxiv.org/pdf/2508.06475? Codes: https://github.com/LeMei/HapticLLaMA --- ## ๐Ÿ“– Introduction **HapticLLaMA** is a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback (RLHF). --- ## ๐Ÿงฉ Tasks Given a vibration signal S and a target category c โˆˆ {sensory, emotional, associative}, where sensory refers to physical attributes (e.g.,intensity of tapping), emotional denotes affective impressions (e.g., the mood of a scene), and associative indicates real-world familiar experiences (e.g., buzzing of a bee, a heartbeat), the goal is to generate a caption corresponding to the specified category of haptic experience. --- ## ๐Ÿ“‚ Training HapticLLaMA training is consist of (1) supervised fine-tuning with LoRA adaptation and (2) subsequent fine-tuning based on human feedback on generated captions. image --- ## ๐Ÿ“‚ Haptic Tokenizer - **Frequency-based Tokenizer**: image Frequency-based Tokenizer divides the frequency range into logarithmically spaced bins that correspond to just-noticeable ifferences in human frequency perception. Similarly, the amplitude range is segmented into normalized levels. The tokenizer then assigns a unique token (e.g., FREQ_3_AMP_2) to each frequencyamplitude pair, encoding the signalโ€™s spectral content into a form interpretable by LLMs. ```python import librosa def steps_binning(frequencies, amplitudes, freq_bins=10, amp_levels=5): freq_min, freq_max = np.min(frequencies), np.max(frequencies) freq_min = freq_max / (1.2**(freq_bins-1)) freq_edges = np.geomspace(freq_min, freq_min * 1.2**(freq_bins-1), num=freq_bins) freq_labels = [f"FREQ_{i+1}" for i in range(freq_bins)] amp_min, amp_max = np.min(amplitudes), np.max(amplitudes) if amp_min == amp_max: # breakpoint() amplitudes = np.zeros_like(frequencies) amp_edges = np.linspace(0, 1, amp_levels + 1) else: amplitudes = (amplitudes - amp_min) / (amp_max - amp_min) amp_min = amp_max / (1.2**(amp_levels-1)) amp_edges = np.geomspace(amp_min, amp_max, num=amp_levels) amp_labels = [f"AMP_{i+1}" for i in range(amp_levels)] tokens = [] for f, a in zip(frequencies, amplitudes): freq_bin = np.digitize(f, freq_edges) - 1 freq_bin = min(freq_bin, freq_bins - 1) freq_token = freq_labels[freq_bin] amp_bin = np.digitize(a, amp_edges) - 1 amp_bin = min(amp_bin, amp_levels - 1) amp_token = amp_labels[amp_bin] tokens.append(f"{freq_token}_{amp_token}") return tokens ### start load .wav file and tokenize y, sr = librosa.load(wav_file, sr=None) D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length) frequencies = librosa.fft_frequencies(sr=sr, n_fft=n_fft) magnitudes = np.abs(D) magnitudes = magnitudes / np.max(magnitudes) frame_idx = 10 amplitudes = magnitudes[:, frame_idx] mask = frequencies < 500 frequencies_filtered = frequencies[mask] amplitudes_filtered = amplitudes[mask] ###haptic tokens based on Frequency-base haptic tokenizer tokens = steps_binning(frequencies_filtered, amplitudes_filtered, freq_bins=freq_bins,amp_levels=amp_levels) ``` --- - **EnCodec-based Tokenizer**: image EnCodec is a neural audio codec that compresses audio using deep learning (Dรฉfossez et al., 2023). It consists of three main components: (1) an encoder that transforms raw audio into a lower-dimensional latent representation, (2) a quantizer that discretizes the latent features via residual vector quantization, and (3) a decoder that reconstructs the waveform from the quantized codes. EnCodec-based tokenizer extract the codes from residual vector quantization in the audio compression architecture. ```python from transformers import AutoTokenizer,AutoProcessor,EncodecModel encodec_model = EncodecModel.from_pretrained("facebook/encodec_24khz") processor = AutoProcessor.from_pretrained("facebook/encodec_24khz") ### EnCodec-based Tokenizer def encodec_token(wav_file): data_dict = {"audio": [wav_file]} data_dataset = Dataset.from_dict(data_dict).cast_column("audio", Audio()) audio_sample = data_dataset[-1]["audio"]["array"] inputs = processor(raw_audio=audio_sample, sampling_rate=24000, return_tensors="pt") with torch.no_grad(): encoded_frames = encodec_model.encode(inputs["input_values"], inputs["padding_mask"]) tokens = encoded_frames.audio_codes[0][0] tokens_list = [str(token) for token in tokens[0].tolist()] return tokens_list ``` --- ## ๐Ÿ“‚ Inference Given a haptic signal, we prompt HapticLLaMA to generate captions from sensory, emotional, and associative perspectives. image ```python import torch from torch import nn import librosa #load model--HapticLLaMA def load_model(stage, device, mode, model_file_url): if os.path.exists(model_file_url): model = Model(args, mode=mode) lora_state_dict = torch.load(model_file_url) state_name, model_name = [], [] for name, param in model.named_parameters(): model_name.append(name) for name in lora_state_dict.keys(): state_name.append(name) missing_keys, unexpected_keys = model.load_state_dict(lora_state_dict, strict=False) model.to(device) else: print('invalid model url!') model = None return model ###load pretrained haptic tokenizer frequency_tokenizer = AutoTokenizer.from_pretrained(r"./updated_llama_tokenizer_steps_binning.pt/") encodec_tokenizer = AutoTokenizer.from_pretrained(r"./updated_llama_tokenizer_encodec.pt/") #formalize input for inference def tokenizer_haptic(haptic, prompt, mode): def formalize_input(haptic_tokens, tokenizer, prompt): tokenizer.pad_token = tokenizer.eos_token inputs = tokenizer(haptic_tokens, padding=True, truncation=True, return_tensors="pt") input_ids = inputs.input_ids input_atts = inputs.attention_mask prompt_enc = tokenizer(prompt, padding=True, truncation=True, return_tensors="pt") prompt_ids = prompt_enc.input_ids prompt_atts = prompt_enc.attention_mask prompt_ids = torch.cat((input_ids,prompt_ids),dim=1) prompt_atts = torch.cat((input_atts,prompt_atts),dim=1) return input_ids,input_atts, prompt_ids, prompt_atts ###Frequency-based token formalization if mode == 'frequency': freq_haptic_tokens = frequency_tokenizer(haptic, mode='frequency) freq_haptic_tokens = [' '.join(freq_haptic_tokens)] freq_input_ids,freq_input_atts, freq_prompt_ids, freq_prompt_atts = formalize_input(freq_haptic_tokens, frequency_tokenizer, prompt=prompt) return freq_input_ids, freq_input_atts, freq_prompt_ids, freq_prompt_atts elif mode == 'encodec': ###Encodec-based token formalization encodec_haptic_tokens = encodec_token(haptic, mode='encodec') encodec_haptic_tokens = [' '.join(encodec_haptic_tokens)] encodec_input_ids, encodec_input_atts, encodec_prompt_ids, prompt_atts = formalize_input(encodec_haptic_tokens, encodec_tokenizer, prompt=prompt) return encodec_input_ids, encodec_input_atts, encodec_prompt_ids, prompt_atts ``` Inference for one sample ```python haptic_signal = r'./F211_loop.wav' sensory_prompt = 'its sensory description is' ##for emotional and associative ##emotional_prompt = 'its emotional description is' ##associative_prompt = 'its associative description is' input_ids, input_atts, prompt_ids, prompt_atts = tokenizer_haptic(haptic_signal, sensory_prompt, mode='encodec') hapticllama = load_model(stage=1, device='cuda', mode='encodec', model_file_url=encodec_model_file_url) caption = hapticllama.generate(inputs = prompt_ids,input_atts=prompt_atts) print(caption) ``` --- ## ๐Ÿš€ Citation If you find this dataset useful for your research, please cite our paper: ```bibtex @article{hu2025hapticllama, title={HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning}, author={Hu, Guimin and Hershcovich, Daniel and Seifi, Hasti}, journal={arXiv preprint arXiv:2508.06475}, year={2025} } ```