SpeechGuard Fusion โ€” DCA + FiLM Module

Part of the SpeechGuard AI system submitted to Samsung EnnovateX AX Hackathon 2026.

Model Description

Deep Cross-Attention + FiLM fusion module that jointly processes keyword spotting features and speaker verification embeddings.

Primary architectural contribution of SpeechGuard AI.

Architecture

  • Input Q: KWS features (B, T, 16) from BC-ResNet-8
  • Input K,V: SV embedding (B, 192) from ECAPA-TDNN
  • Cross-attention: 4 heads, 64-dim attention space
  • FiLM conditioning: speaker d-vector modulates KWS features
  • Output: fused score in [0, 1]

Performance

Metric Value
Parameters 56,706
Latency (CPU) 0.1ms

Usage

import torch
from speechguard.fusion.dca import DCAFusionModule

module = DCAFusionModule(d_kws=16, d_sv=192, d_attn=64)
# Load weights from checkpoint if available

kws_features = torch.randn(1, 20, 16)
sv_embedding = torch.randn(1, 192)
cosine_sim   = torch.tensor([0.7])

result = module(kws_features, sv_embedding, cosine_sim)
print(result["fused_score"])   # tensor([0.XXXX])

Citation

Samsung EnnovateX AX Hackathon 2026 โ€” Problem #04 Team: Placecomm Prophets (IIT Kharagpur)

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support