Using Multimodal Large Language Models for False Alarm Reduction in Image-based Fire Detection

https://doi.org/10.21203/rs.3.rs-8847038/v1

Existing vision-based methods suffer from high false alarm rates in urban flame detection. Applying Multimodal Large Language Models (MLLMs) for secondary filtering shows great potential in reducing false alarms, yet they have high inference latency and are prone to reasoning collapse on negative samples without explicit Chain-of-Thought (CoT) guidance. To overcome these challenges, this study proposed Flash-Cascade, the first sub-second MLLM-based firewall to leverage CoT to efficiently filter false alarms. We deconstructed the flame detection process into four logical stages (planning, observation, analysis, and judgment), which informed the design of three switchable reasoning modes (Detailed, Quick, and Rapid) to achieve inference acceleration via CoT compression. We fine-tuned Qwen2-VL-7B-Instruct on a multi-grained instruction dataset via Low-Rank Adaptation. This process internalizes explicit reasoning logic into implicit parameter representations, enabling the model to maintain robust reasoning capability even without explicit CoT guidance. On our newly constructed benchmark incorporating real-world hard negatives, Flash-Cascade achieves an accuracy of 97.79% and an F1-score of 0.9767 in Rapid mode, outperforming the baseline by 61.63 percentage points (pp) and 0.5152, respectively. Furthermore, it outperforms the state-of-the-art object detector DEIMv2 by 14.64 pp in accuracy. The method exhibits exceptional sample efficiency, converging with only 600 samples and 2 epochs, and improves inference speed by 810% over standard CoT. This study will open a door for robust and efficient flame detection in high-interference scenarios.

1. Quick Start

Installation

For installation instructions, please refer to Qwen/Qwen2-VL-2B-Instruct.

Simple Inference Example

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info


model_dir = "" # */gaoqie/Qwen2VL-2B-Instruct-fire
device = "cuda:0"
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype="bfloat16", device_map=device
)



# default processer
processor = AutoProcessor.from_pretrained(model_dir)

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)



def infer(img_path):
     # 模式一
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": img_path,
                },
                {
                    "type": "text", 
                    "text": "图像中是否存在火焰？详细分析。"
                }
            ],
        }
    ]
     # 模式二
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": img_path,
                },
                {
                    "type": "text", 
                    "text": "图像中是否存在火焰？简单回答。"
                }
            ],
        }
    ]
    # 模式三
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": img_path,
                },
                {
                    "type": "text", 
                    "text": "图像中是否存在火焰？快速回答。"
                }
            ],
        }
    ]
  
    # Preparation for inference
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    image_inputs, video_inputs = process_vision_info(messages)
    
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(device)

    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, max_new_tokens=500)
    # print(processor.batch_decode(
    #     generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    # ))
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    output_text = output_text[0]
    # print(output_text)

image_path = ""
infer(image_path)

2. License

This code repository is licensed under the Apache license 2.0.

3. Citation

Downloads last month: 3

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for gaoqie/Qwen2VL-2B-Instruct-fire

Base model

Qwen/Qwen2-VL-2B

Finetuned

Qwen/Qwen2-VL-2B-Instruct

Finetuned

(358)

this model

Quantizations

1 model

Collection including gaoqie/Qwen2VL-2B-Instruct-fire

Flash-Cascade

Collection

Using Multimodal Large Language Models (MLLMs) for false alarm reduction in image-based fire detection.Doi:https://doi.org/10.21203/rs.3.rs-8847038/v1 • 10 items • Updated May 15 • 2

Using Multimodal Large Language Models for False Alarm Reduction in Image-based Fire Detection

1. Quick Start

Installation

Simple Inference Example

2. License

3. Citation

Model tree for gaoqie/Qwen2VL-2B-Instruct-fire

Collection including gaoqie/Qwen2VL-2B-Instruct-fire