Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.14.0
CAM++ MLX Converter - Architecture Analysis & Findings
Date: 2026-01-16 Status: Critical Architecture Mismatch Identified
Executive Summary
This document summarizes the comprehensive evaluation, fixes, and critical findings from analyzing the CAM++ MLX Converter project. The most significant discovery is a fundamental architecture mismatch between the expected MLX model structure and the actual ModelScope CAM++ models.
Work Completed
Phase 1: Code Evaluation & Fixes
Performed deep code analysis and fixed 19 critical issues across the codebase:
- Fixed broken quantization - Implemented custom int8 symmetric quantization
- Comprehensive parameter mapping - Complete rewrite with 200+ lines of detailed mappings
- Shape-based layer detection - Automatic detection of Conv1d vs Linear based on tensor dimensions
- Conv1d kernel_size=1 handling - Special case for Conv1d used as Linear layers
- Security validation - Added repository safety checks
- Memory optimization - Removed redundant .copy() calls
- Error handling improvements - Standardized exception handling
- Type hints - Added comprehensive type annotations throughout
- Code organization - Extracted magic numbers to constants
- Documentation - Created FIXES_SUMMARY.md and CHANGES.md
See FIXES_SUMMARY.md for complete details of all fixes.
Phase 2: CLI Development
Created complete command-line interface for batch conversions:
convert_cli.py - Standalone CLI tool (8.7KB)
- Argument parsing with --input, --output, --token flags
- Quantization control: --q2, --q4, --q8, --no-quantization
- Dry-run mode for testing: --dry-run
- Verbose logging: --verbose
- Exit code handling for scripts
batch_convert.sh - Bash batch script (2.3KB)
- Preset model configurations
- Progress tracking
- Summary reports
batch_convert.py - Python batch script (5.9KB)
- Cross-platform compatibility
- Flexible model configuration
- Error handling per model
CLI_README.md - Comprehensive documentation (11KB)
- Usage examples
- Troubleshooting guide
- Best practices
Phase 3: Model Structure Investigation
Created inspect_model.py to analyze actual ModelScope models and discovered critical architecture differences.
Critical Finding: Architecture Mismatch
Expected MLX Architecture
The mlx_campp.py implementation expects:
Input Layer (TDNN)
β
Dense Block 0 (4 layers) β Transition 0
β
Dense Block 1 (6 layers) β Transition 1
β
Dense Block 2 (8 layers)
β
CAM Layer (SEPARATE, SHARED)
ββ context_conv1 (1x1 convolution)
ββ context_conv3 (3x3 convolution)
ββ context_conv5 (5x5 convolution)
ββ mask_conv (masking convolution)
ββ fusion (feature fusion)
ββ bn (batch normalization)
β
Channel Gating (3-layer FC)
ββ layers.0 (reduction)
ββ layers.1 (hidden)
ββ layers.2 (expansion)
β
Multi-Granularity Pooling
ββ attention_weights
ββ projection
β
Final BatchNorm
Actual ModelScope Architecture
Model: iic/speech_campplus_sv_zh-cn_16k-common (937 parameters)
Input Layer (TDNN)
β
Block 1 (4 layers: tdnnd1-4)
ββ tdnnd1
β ββ linear1 (main convolution)
β ββ nonlinear1.batchnorm
β ββ cam_layer (EMBEDDED)
β ββ linear1 (1x1 conv + bias)
β ββ linear2 (1x1 conv + bias)
β ββ linear_local (3x3 conv, no bias)
ββ tdnnd2 (same structure with CAM)
ββ tdnnd3 (same structure with CAM)
ββ tdnnd4 (same structure with CAM)
β
Transit1
β
Block 2 (9 layers: tdnnd1-9, each with embedded CAM)
β
Transit2
β
Block 3 (16 layers: tdnnd1-16, each with embedded CAM)
β
Dense Layer (single Conv1d with kernel_size=1)
β
[END - No pooling, no attention, no output layers]
Key Differences
| Component | MLX Model Expects | ModelScope Has | Impact |
|---|---|---|---|
| CAM Layer | ONE shared layer after blocks | EMBEDDED in each layer | β οΈ Critical |
| CAM Structure | 1x1, 3x3, 5x5 convs + fusion + BN | 1x1, 1x1, 3x3 convs only | β οΈ Major |
| Block Sizes | 4, 6, 8 layers | 4, 9, 16 layers | β οΈ Mismatch |
| Channel Gating | 3-layer FC network | Single Conv1d layer | β οΈ Major |
| Pooling | Multi-granularity with attention | None (feature extractor) | β οΈ Critical |
| Output Layer | Separate attention + projection | None | β οΈ Critical |
Parameter Mapping Strategy
Given the architecture mismatch, the current mapping strategy is:
1. Input Layer
β WORKING - Maps correctly
xvector.tdnn.* β input_conv.* / input_bn.*
2. Dense Blocks
β PARTIAL - Maps first N layers from each block
xvector.block1.tdnnd{1-4}.* β dense_blocks.0.layers.{0-3}.*
xvector.block2.tdnnd{1-6}.* β dense_blocks.1.layers.{0-5}.* (skips 7-9)
xvector.block3.tdnnd{1-8}.* β dense_blocks.2.layers.{0-7}.* (skips 9-16)
3. Transition Layers
β WORKING - Maps correctly
xvector.transit1.* β transitions.0.*
xvector.transit2.* β transitions.1.*
4. CAM Layers
β οΈ COMPROMISE - Uses first occurrence only
xvector.block1.tdnnd1.cam_layer.* β cam.*
(All other embedded CAM layers skipped)
Mapping:
cam_layer.linear1βcam.context_conv1(1x1 β 1x1) βcam_layer.linear2βcam.context_conv3(1x1 β 3x3) β οΈ Shape mismatchcam_layer.linear_localβcam.mask_conv(3x3 β 3x3) β- No mapping for
cam.context_conv5β (doesn't exist in source) - No mapping for
cam.fusionβ (doesn't exist in source) - No mapping for
cam.bnβ (doesn't exist in source)
5. Channel Gating
β οΈ INCOMPLETE - Only first layer mapped
xvector.dense.linear.* β channel_gating.fc.layers.0.*
(Conv1d kernel_size=1 automatically squeezed to 2D)
Missing in source model:
channel_gating.fc.layers.1.*βchannel_gating.fc.layers.2.*β
6. Pooling & Output
β MISSING - Not present in source model
pooling.attention_weights.* β
pooling.projection.* β
final_bn.* β
Technical Improvements Applied
1. Shape-Based Layer Type Detection
Problem: xvector.dense.linear.weight has shape (192, 1024, 1) (Conv1d) but named like Linear layer.
Solution: (conversion_utils.py:466-475)
# Override layer type based on actual tensor shape
if numpy_tensor.ndim == 3:
layer_type = 'conv1d' # 3D must be Conv1d
elif numpy_tensor.ndim == 2 and layer_type == 'conv1d':
layer_type = 'linear' # 2D can't be Conv1d
2. Conv1d kernel_size=1 Handling
Problem: Conv1d with kernel_size=1 used as Linear but MLX expects 2D.
Solution: (conversion_utils.py:529-533)
if kernel_size == 1:
logger.debug(f"Converting Conv1d(kernel_size=1) to Linear: {weight.shape} -> {(out, in)}")
return weight.squeeze(-1) # Remove last dimension
3. Embedded CAM Layer Handling
Problem: Real models have CAM in each layer, MLX model has one shared CAM.
Solution: (conversion_utils.py:321-325)
is_first_cam = 'block1.tdnnd1.cam_layer' in xvector_name
if not is_first_cam:
logger.debug(f"Skipping embedded CAM (only using first): {xvector_name}")
return None # Skip all except first CAM
4. Comprehensive Warnings
Added detailed logging for:
- Skipped embedded CAM layers
- Missing pooling/attention layers
- Conv1d kernel_size=1 conversions
- Unexpected layer types
- Shape mismatches
Conversion Status
Based on iic/speech_campplus_sv_zh-cn_16k-common model:
Parameters Successfully Mapped
- β Input layer (5 parameters)
- β Dense blocks (partial - ~100 parameters)
- β Transition layers (~10 parameters)
- β οΈ First CAM layer only (~6 parameters)
- β οΈ Channel gating first layer only (~2 parameters)
Parameters Missing in Source
- β Additional CAM context paths (~10 parameters)
- β CAM fusion layer (~2 parameters)
- β CAM batch normalization (~4 parameters)
- β Channel gating layers 1-2 (~4 parameters)
- β Pooling attention weights (~2 parameters)
- β Pooling projection (~2 parameters)
- β Final batch normalization (~4 parameters)
Estimated mapping coverage: ~60-70%
Recommendations
Option 1: Rewrite MLX Model (RECOMMENDED)
Create new mlx_campp_modelscope.py that matches actual architecture:
class CAMPPModelScope:
def __init__(self, config):
self.input_layer = TDNNLayer(...)
# Dense blocks with embedded CAM
self.block1 = [TDNNLayerWithCAM(...) for _ in range(4)]
self.transit1 = TransitionLayer(...)
self.block2 = [TDNNLayerWithCAM(...) for _ in range(9)]
self.transit2 = TransitionLayer(...)
self.block3 = [TDNNLayerWithCAM(...) for _ in range(16)]
# Simple feature extraction (no pooling)
self.dense = Conv1d(kernel_size=1)
Advantages:
- β Perfect parameter mapping (100% coverage)
- β Matches actual model architecture
- β No missing parameters
- β Can use all embedded CAM layers
Disadvantages:
- β οΈ Requires significant development effort
- β οΈ Different API than current model
Option 2: Find Compatible Models
Search for ModelScope models that match current MLX architecture:
- Single shared CAM layer after blocks
- Multi-granularity pooling with attention
- 3-layer channel gating
Advantages:
- β No code changes needed
- β Existing MLX model works as-is
Disadvantages:
- β οΈ May not exist
- β οΈ Limited model selection
Option 3: Hybrid Approach
Keep current MLX model for models that match, create new variant for ModelScope models:
mlx_campp.py β Original architecture (with pooling)
mlx_campp_simple.py β ModelScope architecture (feature extractor)
Update converter to detect architecture and choose appropriate model.
Advantages:
- β Supports both architectures
- β Maximum flexibility
- β Backward compatible
Disadvantages:
- β οΈ More complex codebase
- β οΈ Need architecture detection logic
Files Modified/Created
Modified Files
app.py(593 β 667 lines) - Added constants, security validation, context managersconversion_utils.py(795 β 883 lines) - Complete parameter mapping rewrite, shape detectionmlx_campp.py(413 lines) - No changes (architecture mismatch identified)
Created Files
FIXES_SUMMARY.md- Detailed breakdown of all 19 fixesCHANGES.md- Quick reference with before/after examplesconvert_cli.py- Standalone CLI tool (8.7KB)batch_convert.sh- Bash batch script (2.3KB)batch_convert.py- Python batch script (5.9KB)CLI_README.md- CLI documentation (11KB)inspect_model.py- Model structure analysis tooltest_real_conversion.py- Real model conversion testARCHITECTURE_ANALYSIS.md- This document
Next Steps
Immediate: Decide on architectural approach (Option 1, 2, or 3)
If Option 1 (Rewrite MLX Model):
- Design new MLX model matching ModelScope architecture
- Implement
TDNNLayerWithCAMclass - Update forward pass logic
- Test with real models
- Update conversion mapping
If Option 2 (Find Compatible Models):
- Survey ModelScope CAM++ models
- Test with different model variants
- Document compatible models
If Option 3 (Hybrid):
- Implement architecture detection
- Create mlx_campp_simple.py
- Update converter routing logic
- Add model selection to CLI
Testing:
- End-to-end conversion test
- Inference accuracy validation
- Performance benchmarking
- Compare PyTorch vs MLX outputs
Conclusion
The CAM++ MLX Converter project has received comprehensive improvements in code quality, error handling, and CLI capabilities. However, a critical architecture mismatch was discovered between the MLX model implementation and actual ModelScope CAM++ models.
The current parameter mapping achieves ~60-70% coverage through intelligent workarounds (using first CAM occurrence, squeezing Conv1d dimensions, etc.), but full conversion requires either rewriting the MLX model to match ModelScope architecture or finding models that match the current MLX architecture.
The project is not production-ready until this architectural discrepancy is resolved.
Prepared by: Claude Code Session Date: 2026-01-16