Atotti
/

Google-USM

@@ -5,24 +5,29 @@ license: gemma
 # google-usm: Extracted Gemma-3n Audio Encoder (USM)
-## モデル概要 (Model Description)
-このモデルは、Googleのマルチモーダルモデル [`google/gemma-3n-e2b-it`](https://huggingface.co/google/gemma-3n-e2b-it) から、**音声エンコーダー部分 (`audio_tower`) のみ**を抽出したものです。
-アーキテクチャは、論文 [Universal Speech Model](https://arxiv.org/abs/2303.01037) に基づく**Gemma3nAudioEncoder**です。
 このエンコーダーは、音声波形データを受け取り、その内容を表現する高次元の特徴量（エンコーディング）のシーケンスに変換する役割を果たします。
-## 主な用途 (Intended Use)
 このモデルは単体で音声認識（文字起こし）などを行うものではなく、より大きなモデルのコンポーネントとして使用されることを想定しています。
-* **マルチモーダルモデルの音声入力部として**: 生成AIに音声情報を与えるための特徴量を抽出します。
-* **音声分類**: このモデルの出力に分類ヘッドを追加して、特定の音声（例：笑い声、拍手、特定の単語）を分類するタスクでファインチューニングします。
-* **音声類似度検索**: 音声のエンコーディングをベクトルとして扱い、意味的に似た音声を検索します。
-* **話者認識**: 音声から話者を識別するタスクのベースモデルとして利用します。
-## 使用方法 (How to Use)
 このモデル（音声エンコーダー）と、元モデルの`Feature Extractor`を組み合わせて使用します。
@@ -74,3 +79,60 @@ print(audio_encodings[0, :5, :10])
 #          -0.0080, -0.0233]], device='cuda:0')
 ```

 # google-usm: Extracted Gemma-3n Audio Encoder (USM)
+> [!Note]
+> このモデルの実態は不明確です。[Introducing Gemma 3n: The developer guide](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/#:~:text=Gemma%203n%20uses%20an%20advanced%20audio%20encoder%20based%20on%20the%20Universal%20Speech%20Model%20(USM).)には、USMに基づくエンコーダーが使用されていると記述されていますが、USMの論文とこのモデルはいくつかの異なる点が存在します。
+> このモデルは0.6Bですが、USMの論文の0.6Bモデルとは層の数と隠れ層の数が異なります。
+> このモデルは Gemma 3n の AudioEncoder であり、本来の USM とは異なる可能性があります。
+## Model Description
+このモデルは、Googleのマルチモーダルモデル [google/gemma-3n-e2b-it](https://huggingface.co/google/gemma-3n-e2b-it) から、音声エンコーダー部分 (`audio_tower`) のみを抽出したものです。
+アーキテクチャは、論文 [Universal Speech Model](https://arxiv.org/abs/2303.01037) に基づくGemma3nAudioEncoderです。
 このエンコーダーは、音声波形データを受け取り、その内容を表現する高次元の特徴量（エンコーディング）のシーケンスに変換する役割を果たします。
+## Intended Use
 このモデルは単体で音声認識（文字起こし）などを行うものではなく、より大きなモデルのコンポーネントとして使用されることを想定しています。
+* マルチモーダルモデルの音声入力部として: 生成AIに音声情報を与えるための特徴量を抽出します。
+* 音声分類: このモデルの出力に分類ヘッドを追加して、特定の音声（例：笑い声、拍手、特定の単語）を分類するタスクでファインチューニングします。
+* 音声類似度検索: 音声のエンコーディングをベクトルとして扱い、意味的に似た音声を検索します。
+* 話者認識: 音声から話者を識別するタスクのベースモデルとして利用します。
+## How to Use
 このモデル（音声エンコーダー）と、元モデルの`Feature Extractor`を組み合わせて使用します。
 #          -0.0080, -0.0233]], device='cuda:0')
 ```
+## Model Architecture
+```
+Gemma3nAudioEncoder(
+  (subsample_conv_projection): Gemma3nAudioSubSampleConvProjection(
+    (conv_0): Gemma3nAudioSSCPConvBlock(
+      (conv): Conv2d(1, 128, kernel_size=(3, 3), stride=(2, 2), bias=False)
+      (norm): Gemma3nAudioCumulativeGroupNorm()
+      (activation): ReLU()
+    )
+    (conv_1): Gemma3nAudioSSCPConvBlock(
+      (conv): Conv2d(128, 32, kernel_size=(3, 3), stride=(2, 2), bias=False)
+      (norm): Gemma3nAudioCumulativeGroupNorm()
+      (activation): ReLU()
+    )
+    (input_proj_linear): Linear(in_features=1024, out_features=1536, bias=False)
+  )
+  (conformer): ModuleList(
+    (0-11): 12 x Gemma3nAudioConformerBlock(
+      (ffw_layer_start): Gemma3nAudioConformerFeedForward(
+        (pre_layer_norm): Gemma3nRMSNorm((1536,), eps=1e-06)
+        (ffw_layer_1): Linear(in_features=1536, out_features=6144, bias=False)
+        (ffw_layer_2): Linear(in_features=6144, out_features=1536, bias=False)
+        (post_layer_norm): Gemma3nRMSNorm((1536,), eps=1e-06)
+      )
+      (attention): Gemma3nAudioConformerAttention(
+        (pre_attn_norm): Gemma3nRMSNorm((1536,), eps=1e-06)
+        (attn): Gemma3nAudioAttention(
+          (relative_position_embedding): Gemma3nAudioRelativePositionEmbedding(
+            (pos_proj): Linear(in_features=1536, out_features=1536, bias=False)
+          )
+          (q_proj): Linear(in_features=1536, out_features=1536, bias=False)
+          (k_proj): Linear(in_features=1536, out_features=1536, bias=False)
+          (v_proj): Linear(in_features=1536, out_features=1536, bias=False)
+        )
+        (post): Linear(in_features=1536, out_features=1536, bias=False)
+        (post_norm): Gemma3nRMSNorm((1536,), eps=1e-06)
+      )
+      (lconv1d): Gemma3nAudioConformerLightConv1d(
+        (pre_layer_norm): Gemma3nRMSNorm((1536,), eps=1e-06)
+        (linear_start): Linear(in_features=1536, out_features=3072, bias=False)
+        (depthwise_conv1d): Conv1d(1536, 1536, kernel_size=(5,), stride=(1,), groups=1536, bias=False)
+        (conv_norm): Gemma3nRMSNorm((1536,), eps=1e-06)
+        (linear_end): Linear(in_features=1536, out_features=1536, bias=False)
+      )
+      (ffw_layer_end): Gemma3nAudioConformerFeedForward(
+        (pre_layer_norm): Gemma3nRMSNorm((1536,), eps=1e-06)
+        (ffw_layer_1): Linear(in_features=1536, out_features=6144, bias=False)
+        (ffw_layer_2): Linear(in_features=6144, out_features=1536, bias=False)
+        (post_layer_norm): Gemma3nRMSNorm((1536,), eps=1e-06)
+      )
+      (norm): Gemma3nRMSNorm((1536,), eps=1e-06)
+    )
+  )
+)
+```