Using Multimodal Large Language Models for False Alarm Reduction in Image-based Fire Detection
Existing vision-based methods suffer from high false alarm rates in urban flame detection. Applying Multimodal Large Language Models (MLLMs) for secondary filtering shows great potential in reducing false alarms, yet they have high inference latency and are prone to reasoning collapse on negative samples without explicit Chain-of-Thought (CoT) guidance. To overcome these challenges, this study proposed Flash-Cascade, the first sub-second MLLM-based firewall to leverage CoT to efficiently filter false alarms. We deconstructed the flame detection process into four logical stages (planning, observation, analysis, and judgment), which informed the design of three switchable reasoning modes (Detailed, Quick, and Rapid) to achieve inference acceleration via CoT compression. We fine-tuned Qwen2-VL-7B-Instruct on a multi-grained instruction dataset via Low-Rank Adaptation. This process internalizes explicit reasoning logic into implicit parameter representations, enabling the model to maintain robust reasoning capability even without explicit CoT guidance. On our newly constructed benchmark incorporating real-world hard negatives, Flash-Cascade achieves an accuracy of 97.79% and an F1-score of 0.9767 in Rapid mode, outperforming the baseline by 61.63 percentage points (pp) and 0.5152, respectively. Furthermore, it outperforms the state-of-the-art object detector DEIMv2 by 14.64 pp in accuracy. The method exhibits exceptional sample efficiency, converging with only 600 samples and 2 epochs, and improves inference speed by 810% over standard CoT. This study will open a door for robust and efficient flame detection in high-interference scenarios.
1. Quick Start
For installation instructions, please refer to Qwen/Qwen2-VL-2B-Instruct.
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model_dir = "" # */gaoqie/Qwen2VL-2B-Instruct-fire
device = "cuda:0"
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_dir, torch_dtype="bfloat16", device_map=device
)
# default processer
processor = AutoProcessor.from_pretrained(model_dir)
# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)
def infer(img_path):
# 模式一
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": img_path,
},
{
"type": "text",
"text": "图像中是否存在火焰?详细分析。"
}
],
}
]
# 模式二
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": img_path,
},
{
"type": "text",
"text": "图像中是否存在火焰?简单回答。"
}
],
}
]
# 模式三
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": img_path,
},
{
"type": "text",
"text": "图像中是否存在火焰?快速回答。"
}
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=500)
# print(processor.batch_decode(
# generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
# ))
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
output_text = output_text[0]
# print(output_text)
image_path = ""
infer(image_path)
2. License
This code repository is licensed under the Apache license 2.0.
3. Citation
- Downloads last month
- 51

