MODF-SIR: a Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

MODF-SIR is a lightweight MLLM-based, distillation-augmented, multi-agent collaborative framework for social intelligence reasoning.

🔖 Model Details

Model type: Omni-modal Large Language Model
License: BSD-3-Clause

👀 MODF-SIR Overview

We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework.

🌟 Contributions in MODF-SIR

We propose MODF-SIR, a unified omni-modal reasoning framework that pioneers the application of multi-agent collaboration in the field of social intelligence reasoning. Our framework introduces dynamic strategy selection via a routing agent, enabling the model to adaptively determine whether to perform temporal grounding or direct reasoning based on input complexity.
We introduce GRPO Grounder and TTA Reviser. We train the video locator implemented by the autoregressive method using the GRPO algorithm and fine-tune the reasoning module during testing using the test-time adaption and REINFORCE with Baseline algorithms. This method enables our framework to have sample-level answering capabilities.
MODF-SIR achieves state-of-the-art results across three Benchmarks: IntentBench, Daily-Omni, WorldSense. Notably, our approach surpasses a host of commercial closed-source and open-source models, including GPT-4o, Gemini-2.5-Pro (think). Extensive ablations further confirm its effectiveness.

💻 Code Repository

The code for MAOmni, including training and evaluation scripts, can be found on GitHub: https://github.com/eeee-sys/MODF-SIR

📈 Experimental Results

📍 Results

🚀 Quick Start

Install the environment

Clone the repository from GitHub.

git clone git@github.com:eeee-sys/MODF-SIR.git
cd MODF-SIR

Initialize conda environment.

conda create -n grpo_grounder python=3.11 -y
conda activate grpo_grounder
pip install -r src/requirements_grpo_grounder.txt

conda create -n modfsir_main python=3.10 -y
conda activate modfsir_main
pip install -r src/requirements_main.txt

Quick Inference Demo

The script below showcases how to perform inference with MODF-SIR's different roles. Please refer to our GitHub Repository for more details about this framework.

import torch

from transformers import (
    Qwen2_5OmniForConditionalGeneration,
    Qwen2_5OmniThinkerForConditionalGeneration,
    Qwen2_5OmniProcessor,
)
from peft import LoraConfig, get_peft_model, PeftModel

from qwen_omni_utils import process_mm_info

# ============================================================
# Main Process
# ============================================================
def main():
   
    # ---- Initialize Models ----
    print(f"\n[INIT] Loading Base Model ({args.base_model_path}) on {args.main_gpu}")
    base_model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
        args.base_model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2"
    ).to(args.main_gpu)
    base_processor = Qwen2_5OmniProcessor.from_pretrained(args.base_model_path)

    # Load Planner LoRA onto thinker submodule
    print(f"[INIT] Loading Planner LoRA onto base_model.thinker")
    base_model.thinker.load_adapter(args.planner_lora_path, adapter_name="planner")
    base_model.eval()

    print(f"[INIT] Loading HumanOmniV2 ({args.humanomni_path}) on {args.humanomni_gpu}")
    humanomni_model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
        args.humanomni_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2"
    ).to(args.humanomni_gpu)
    humanomni_processor = Qwen2_5OmniProcessor.from_pretrained(args.humanomni_path)

    lora_config = LoraConfig(
        r=64, lora_alpha=128,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
    )
    humanomni_model = get_peft_model(humanomni_model, lora_config, adapter_name="initial_dummy")

    humanomni_model.enable_input_require_grads()

    humanomni_model.gradient_checkpointing_enable()
    print(f"[INIT] Starting Grounder process on {args.grounder_gpu}...")
    grounder_script = os.path.join(SCRIPT_DIR, "grounder_worker_grpo.py")
    grounder_env = os.environ.copy()
    grounder_env["CUDA_VISIBLE_DEVICES"] = args.grounder_gpu.replace("cuda:", "")
    grounder_proc = subprocess.Popen([
        args.grounder_python, grounder_script,
        "--model_path", args.grounder_path,
        "--grpo_adapter_path", args.grpo_adapter_path,
        "--device", "cuda:0"
    ], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=None, text=True, bufsize=1, 
                            env=grounder_env)

    ready_line = grounder_proc.stdout.readline().strip()
    if not ready_line or json.loads(ready_line).get("status") != "ready":
        print("[ERROR] Grounder worker failed to start.")
        sys.exit(1)

    print("[INIT] All models ready!")
    os.makedirs(args.lora_save_dir, exist_ok=True)
    tmp_dir = tempfile.mkdtemp(prefix="idea3_reviser7b_")

    # ---- 3. Loop through dataset ----
    for sample in samples_to_process:
        try:
            # ====== PLANNER STAGE ======
            # a) Collector Phase (LoRA disabled)
            base_model.thinker.set_adapter("planner")  # Ensure adapter is active before disabling
            base_model.thinker.disable_adapters()
            collector_text = stage1_collector(base_model.thinker, base_processor, video_path, query, args.main_gpu)
            print(f"[Collector output] {collector_text}")

            # b) Planner Phase (LoRA enabled)
            base_model.thinker.enable_adapters()
            (use_grounder, gnd_query), planner_raw = stage2_planner(base_model.thinker, base_processor, video_path, query, collector_text, args.main_gpu)
            print(f"[Planner output] {planner_raw}")
            print(f"[Planner] Use Grounder: {use_grounder} | query: {gnd_query}")

            # ====== GROUNDER STAGE ======
            generation_video = video_path
            grounded_span = None
            if use_grounder:
                pred_spans, success = stage3_grounder(grounder_proc, video_path, gnd_query or query, duration)
                print(f"[Grounder output] {pred_spans}")
                grounded_span = pred_spans[0]
                trim_path = os.path.join(tmp_dir, f"trim_{dataset_id}.mp4")
                trim_video_ffmpeg(video_path, grounded_span[0], grounded_span[1], trim_path)
                generation_video = trim_path
                print(f"[Grounder] Grounded to {grounded_span[0]:.1f}s - {grounded_span[1]:.1f}s")

            # ====== HUMANOMNI & REINFORCE STAGE ======
            humanomni_query = build_humanomni_query(sample)

            adapter_name = f"sample_{dataset_id}".replace(".", "_")
            humanomni_model.add_adapter(adapter_name, lora_config)
            humanomni_model.set_adapter(adapter_name)

            # Ensure adapter parameters require gradients
            for n, p in humanomni_model.named_parameters():
                if adapter_name in n:
                    p.requires_grad = True

            humanomni_model.train()

            trainable_params = [
                p for n, p in humanomni_model.named_parameters()
                if p.requires_grad and adapter_name in n
            ]
            optimizer = torch.optim.AdamW(trainable_params, lr=args.lr)

            b = args.b0
            best_score = -1
            best_answer = ""
            best_raw_resp = ""
            all_history = []
            early_stop = False

            for t in range(1, args.t_max + 1):
                gc.collect(); torch.cuda.empty_cache()

                humanomni_model.eval()
                inputs = get_humanomni_inputs(humanomni_processor, generation_video, humanomni_query, sample, args.humanomni_gpu)

                with torch.no_grad():
                    output_ids = humanomni_model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.85)

                generated_sequence = output_ids[0][inputs.input_ids.size(1):]
                y_t_text = humanomni_processor.decode(generated_sequence, skip_special_tokens=True)
                print(f"  [Iter {t}/{args.t_max}] Answer = {y_t_text}")

                base_model.thinker.disable_adapters()
                score_t, reviser_raw = revise_answer(base_model.thinker, base_processor, video_path, query, y_t_text, args.main_gpu)

                all_history.append({"iter": t, "answer": y_t_text, "score": score_t, "reviser_raw": reviser_raw})


                # --- RL Update (REINFORCE) ---
                humanomni_model.train()
                optimizer.zero_grad()

                advantage = float(score_t - b)
                advantage_tensor = torch.tensor([advantage], device=args.humanomni_gpu, dtype=torch.bfloat16)

                outputs = humanomni_model(**forward_kwargs)

                nll_loss = outputs.loss
                final_loss = nll_loss * advantage_tensor.detach()

                final_loss.backward()
                optimizer.step()

                b = args.alpha * b + (1.0 - args.alpha) * score_t

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Harry-1234/MODF-SIR

Base model

Qwen/Qwen2-VL-7B

Finetuned

Qwen/Qwen2-VL-7B-Instruct