Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +80 -0
config.json +49 -0
model.safetensors +3 -0
preprocessor_config.json +11 -0

README.md ADDED Viewed

	@@ -0,0 +1,80 @@

+---
+license: mit
+language: en
+tags:
+  - audio-classification
+  - carnatic-music
+  - raga-classification
+  - indian-classical-music
+datasets:
+  - sarayusapa/carnatic-ragas
+metrics:
+  - accuracy
+  - f1
+pipeline_tag: audio-classification
+---
+# SAM-Audio: Carnatic Raga Classifier
+A CNN + Segment Attention model for classifying Carnatic ragas from audio.
+## Model Details
+- **Architecture**: SAM-Audio (CNN mel-spectrogram encoder + latent segmentation tokens + masked segment prediction + contrastive learning)
+- **Parameters**: 2.6M
+- **Training data**: [sarayusapa/carnatic-ragas](https://huggingface.co/datasets/sarayusapa/carnatic-ragas) with 3x pitch-shift augmentation
+- **Best validation accuracy**: 99.62%
+- **Best epoch**: 17
+## Supported Ragas
+| ID | Raga |
+|----|------|
+| 0 | Amritavarshini |
+| 1 | Hamsanaadam |
+| 2 | Kalyani |
+| 3 | Kharaharapriya |
+| 4 | Mayamalavagoulai |
+| 5 | Sindhubhairavi |
+| 6 | Todi |
+| 7 | Varali |
+## Usage
+```python
+import torch
+import librosa
+from safetensors.torch import load_file
+# Load model
+from train import SAMAudioModel
+config = json.load(open("config.json"))
+model = SAMAudioModel(
+    encoder_config=config["encoder"],
+    num_classes=config["num_classes"],
+    num_segments=config["num_segments"],
+)
+state_dict = load_file("model.safetensors")
+model.load_state_dict(state_dict)
+model.eval()
+# Load audio
+y, sr = librosa.load("audio.mp3", sr=16000, mono=True)
+waveform = torch.from_numpy(y[:320000]).float().unsqueeze(0)
+# Predict
+with torch.no_grad():
+    outputs = model(input_audio=waveform)
+    probs = torch.softmax(outputs["raga_logits"], dim=-1)
+    pred = probs.argmax(dim=-1).item()
+    print(f"Predicted: {config['id2label'][str(pred)]} ({probs[0][pred]:.1%})")
+```
+## Training
+- 3x pitch-shift augmentation (original + random up [1-4 semitones] + random down [1-4 semitones])
+- Tanpura reference pitch shifts with audio, forcing the model to learn relative intervals
+- BFloat16 mixed precision on RTX 4090
+- Cosine annealing LR with warmup
+- Early stopping with patience=5

config.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "model_type": "sam-audio",
+  "architectures": [
+    "SAMAudioModel"
+  ],
+  "num_classes": 8,
+  "id2label": {
+    "0": "Amritavarshini",
+    "1": "Hamsanaadam",
+    "2": "Kalyani",
+    "3": "Kharaharapriya",
+    "4": "Mayamalavagoulai",
+    "5": "Sindhubhairavi",
+    "6": "Todi",
+    "7": "Varali"
+  },
+  "label2id": {
+    "Amritavarshini": 0,
+    "Hamsanaadam": 1,
+    "Kalyani": 2,
+    "Kharaharapriya": 3,
+    "Mayamalavagoulai": 4,
+    "Sindhubhairavi": 5,
+    "Todi": 6,
+    "Varali": 7
+  },
+  "encoder": {
+    "input_dim": 1,
+    "hidden_dims": [
+      64,
+      128,
+      256,
+      512
+    ],
+    "kernel_size": 3,
+    "stride": 2,
+    "dropout_rate": 0.25,
+    "use_layer_norm": true,
+    "use_mel": true,
+    "n_mels": 80,
+    "sample_rate": 16000
+  },
+  "num_segments": 64,
+  "mask_ratio": 0.0,
+  "contrastive_temperature": 0.07,
+  "hidden_size": 512,
+  "best_val_accuracy": 0.9961977186311787,
+  "best_epoch": 17
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c3b3928ed89212e8f272945fffa9d77e8977f7d9d667ab06f145693bdacc05af
+size 10597768

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "processor_type": "AudioPreprocessor",
+  "sample_rate": 16000,
+  "max_length": 320000,
+  "chunk_duration_s": 20,
+  "feature_extractor_type": "MelSpectrogram",
+  "n_fft": 1024,
+  "hop_length": 256,
+  "n_mels": 80,
+  "normalize": true
+}