Using Multimodal Large Language Models for False Alarm Reduction in Image-based Fire Detection

https://doi.org/10.21203/rs.3.rs-8847038/v1

Existing vision-based methods suffer from high false alarm rates in urban flame detection. Applying Multimodal Large Language Models (MLLMs) for secondary filtering shows great potential in reducing false alarms, yet they have high inference latency and are prone to reasoning collapse on negative samples without explicit Chain-of-Thought (CoT) guidance. To overcome these challenges, this study proposed Flash-Cascade, the first sub-second MLLM-based firewall to leverage CoT to efficiently filter false alarms. We deconstructed the flame detection process into four logical stages (planning, observation, analysis, and judgment), which informed the design of three switchable reasoning modes (Detailed, Quick, and Rapid) to achieve inference acceleration via CoT compression. We fine-tuned Qwen2-VL-7B-Instruct on a multi-grained instruction dataset via Low-Rank Adaptation. This process internalizes explicit reasoning logic into implicit parameter representations, enabling the model to maintain robust reasoning capability even without explicit CoT guidance. On our newly constructed benchmark incorporating real-world hard negatives, Flash-Cascade achieves an accuracy of 97.79% and an F1-score of 0.9767 in Rapid mode, outperforming the baseline by 61.63 percentage points (pp) and 0.5152, respectively. Furthermore, it outperforms the state-of-the-art object detector DEIMv2 by 14.64 pp in accuracy. The method exhibits exceptional sample efficiency, converging with only 600 samples and 2 epochs, and improves inference speed by 810% over standard CoT. This study will open a door for robust and efficient flame detection in high-interference scenarios.

1. Quick Start

Installation

For installation instructions, please refer to deepseek-ai/deepseek-vl-7b-chat.

Simple Inference Example

import torch
from transformers import AutoModelForCausalLM
from PIL import Image
from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
from deepseek_vl.utils.io import load_pil_images

import os
# specify the path to the model
model_path = "" # */gaoqie/DeepSeekVL-7B-Chat-fire
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True,low_cpu_mem_usage=True, 
                                                                      torch_dtype=torch.bfloat16,    
                                                                        device_map="auto")
vl_gpt = vl_gpt.eval()


def infer(img_path):
    # 模式1
    messages = [
        {
            "role": "User",
            "content":  f"<image_placeholder>图像中是否存在火焰？详细分析。",
            "images": [f"{img_path}"]
        },
        {
            "role": "Assistant",
            "content": ""
        }
    ]

    # 模式2
    messages = [
        {
            "role": "User",
            "content":  f"<image_placeholder>图像中是否存在火焰？简单回答。",
            "images": [f"{img_path}"]
        },
        {
            "role": "Assistant",
            "content": ""
        }
    ]

    # 模式3
    messages = [
        {
            "role": "User",
            "content":  f"<image_placeholder>图像中是否存在火焰？快速回答。",
            "images": [f"{img_path}"]
        },
        {
            "role": "Assistant",
            "content": ""
        }
    ]

    # load images and prepare for inputs
    pil_images = load_pil_images(messages)

    prepare_inputs = vl_chat_processor(
        conversations=messages,
        images=pil_images,
        force_batchify=True
    ).to(vl_gpt.device)

    # run image encoder to get the image embeddings
    inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

    # run the model to get the response
    outputs = vl_gpt.language_model.generate(
        inputs_embeds=inputs_embeds,
        attention_mask=prepare_inputs.attention_mask,
        pad_token_id=tokenizer.eos_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        max_new_tokens=512,
        do_sample=False,
        use_cache=True
    )

    output_text = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)

    print(output_text)
    
image_path = ""
infer(image_path)

2. License

This code repository is licensed under the Apache license 2.0.

3. Citation

Downloads last month: 10

Safetensors

Model size

7B params

Tensor type

BF16

Model tree for gaoqie/DeepSeekVL-7B-Chat-fire

Base model

deepseek-ai/deepseek-vl-7b-chat

Finetuned

(1)

this model

Collection including gaoqie/DeepSeekVL-7B-Chat-fire

Flash-Cascade

Collection

Using Multimodal Large Language Models (MLLMs) for false alarm reduction in image-based fire detection.Doi:https://doi.org/10.21203/rs.3.rs-8847038/v1 • 10 items • Updated Mar 16 • 2

Using Multimodal Large Language Models for False Alarm Reduction in Image-based Fire Detection

1. Quick Start

Installation

Simple Inference Example

2. License

3. Citation

Model tree for gaoqie/DeepSeekVL-7B-Chat-fire

Collection including gaoqie/DeepSeekVL-7B-Chat-fire