RobbiePasquale
/

lightbulb

Model card Files Files and versions

xet

Community

RobbiePasquale commited on Oct 10, 2024

Commit

8793340

verified ·

1 Parent(s): 155f547

Update README.md

Browse files

Files changed (1) hide show

README.md +51 -130

README.md CHANGED Viewed

@@ -315,104 +315,6 @@ After each epoch, the model is evaluated on the validation set, computing the av
 ### Checkpoints
 At the end of each epoch, the model saves checkpoints of all components, enabling easy resumption or further fine-tuning as needed.
-## Language Model Architecture
-### Transformer Architecture
-The Transformer architecture is foundational to the LightBulb model, facilitating efficient sequence processing through self-attention mechanisms and feedforward networks enhanced by Mixture of Experts (MoE).
-#### TransformerBlock
-Each `TransformerBlock` consists of the following components:
-1. **Self-Attention (`self_attention`)**
-2. **Layer Normalization (`norm1`)**
-3. **Cross-Attention (`cross_attention`)**
-4. **Layer Normalization (`norm2`)**
-5. **Mixture of Experts (`moe`)**
-6. **Layer Normalization (`norm3`)**
-**Mathematical Operations:**
-1. **Self-Attention:**
-   \[
-   \text{Attn}_{\text{self}} = \text{SelfAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
-   \]
-2. **Residual Connection and Layer Norm:**
-   \[
-   x = \text{LayerNorm}(x + \text{Attn}_{\text{self}})
-   \]
-3. **Cross-Attention (if applicable):**
-   \[
-   \text{Attn}_{\text{cross}} = \text{CrossAttention}(Q, K_{\text{enc}}, V_{\text{enc}}) = \text{softmax}\left(\frac{QK_{\text{enc}}^\top}{\sqrt{d_k}}\right)V_{\text{enc}}
-   \]
-   \[
-   x = \text{LayerNorm}(x + \text{Attn}_{\text{cross}})
-   \]
-4. **Mixture of Experts:**
-   \[
-   \text{MoE}_{\text{output}} = \sum_{i=1}^k g_i(x) \cdot \text{Expert}_i(x)
-   \]
-5. **Residual Connection and Layer Norm:**
-   \[
-   x = \text{LayerNorm}(x + \text{MoE}_{\text{output}})
-   \]
-**Key Parameters:**
-- \( d_{\text{model}} \): Dimensionality of the model embeddings.
-- \( d_k \): Dimensionality of the key vectors in attention.
-- \( \text{num\_heads} \): Number of attention heads.
-- \( \text{num\_experts} \): Number of experts in the MoE layer.
-- \( \text{top\_k} \): Number of top experts to activate in MoE.
-- \( \text{dropout} \): Dropout rate for regularization.
-#### Transformer
-The `Transformer` class orchestrates multiple `TransformerBlock` instances within encoder and decoder stacks.
-**Components:**
-1. **Embedding Layer:**
-   \[
-   E = \text{Embedding}(input\_ids) \times \sqrt{d_{\text{model}}}
-   \]
-2. **Rotary Positional Encoding (`rotary_positional_encoding`):**
-   - Injects positional information by rotating the embeddings based on token positions.
-3. **Encoder and Decoder Layers:**
-   - Multiple `TransformerBlock` instances processing the embedded inputs.
-4. **Output Layer:**
-   \[
-   \text{Output} = \text{Linear}(d_{\text{model}}, \text{output\_dim})(\text{Decoder Output})
-   \]
-5. **Beam Search with Multi-Token Prediction (`generate_with_beam_search`):**
-   - Generates sequences by predicting multiple tokens at each step, maintaining a beam of top candidates.
-**Forward Pass:**
-\[
-\begin{align*}
-\text{Encoder:} & \quad X_{\text{enc}} = \text{Embedding}(src) \times \sqrt{d_{\text{model}}} \\
-& \quad X_{\text{enc}} = \text{RotaryPositionalEncoding}(X_{\text{enc}}) \\
-& \quad X_{\text{enc}} = \text{EncoderLayers}(X_{\text{enc}}) \\
-\\
-\text{Decoder:} & \quad X_{\text{dec}} = \text{Embedding}(tgt) \times \sqrt{d_{\text{model}}} \\
-& \quad X_{\text{dec}} = \text{RotaryPositionalEncoding}(X_{\text{dec}}) \\
-& \quad X_{\text{dec}} = \text{DecoderLayers}(X_{\text{dec}}, X_{\text{enc}}) \\
-\\
-\text{Output:} & \quad \text{output} = \text{Linear}(X_{\text{dec}})
-\end{align*}
-\]
 ---
 ### World Model Components
@@ -425,10 +327,11 @@ The World Model encapsulates components that model state representations, dynami
 Transforms the transformer's output embeddings into a compact state representation suitable for modeling and prediction tasks.
 **Mathematical Operation:**
 \[
 \text{State} = \text{LayerNorm}\left(\text{Linear}(d_{\text{model}} \rightarrow d_{\text{state}})\left(\text{Linear}(vocab\_dim \rightarrow d_{\text{model}})(\text{Transformer Output})\right)\right)
 \]
 **Explanation:**
 Sequential linear transformations project high-dimensional embeddings into a lower-dimensional state space, followed by layer normalization for stability.
@@ -438,9 +341,11 @@ Sequential linear transformations project high-dimensional embeddings into a low
 Models how the state evolves in response to actions (thoughts) taken by the model.
 **Mathematical Operation:**
 \[
 \text{Next State} = \text{DynamicsNetwork}(\text{Current State}, \text{Action Embedding})
 \]
 **Explanation:**
 Predicts the subsequent state by combining the current state representation with an encoded action, effectively simulating the consequences of actions within the Tree of Thought.
@@ -451,10 +356,11 @@ Predicts the subsequent state by combining the current state representation with
 Predicts policy logits (action probabilities) and value estimates (state evaluations) based on the current state.
 **Mathematical Operation:**
 \[
 (\text{Policy Logits}, \text{Value Estimate}) = \text{PredictionNetwork}(\text{State})
 \]
 **Explanation:**
 - **Policy Logits:** Used to derive action probabilities via softmax.
 - **Value Estimate:** Represents the expected reward or quality of the current state.
@@ -465,10 +371,11 @@ Predicts policy logits (action probabilities) and value estimates (state evaluat
 Encodes discrete actions (thoughts) into continuous embeddings compatible with the DynamicsNetwork.
 **Mathematical Operation:**
 \[
 \text{Action Embedding} = \text{ActionEncoder}(\text{Action Index})
 \]
 **Explanation:**
 Converts action indices into dense vector representations, facilitating their integration into state transition modeling.
@@ -492,10 +399,11 @@ Represents a node in the Tree of Thought, corresponding to a specific action or
 **Mathematical Representation:**
 Each `ThoughtNode` can be represented as a tree node in a directed graph:
 \[
 \text{ThoughtNode} = (\text{name}, \{\text{children}\})
 \]
 #### State
 **Function:**
@@ -509,39 +417,47 @@ Represents the current state within the MCTS and Tree of Thought framework.
 - `thought_node`: Reference to the current `ThoughtNode` in the Tree of Thought.
 **Action Application (`apply_action`):**
 \[
 \text{Next State} = \text{DynamicsNetwork}(\text{Current State}, \text{Action Embedding})
 \]
 \[
 \text{New Representation} = \text{Concat}(\text{Current Representation}, \text{Next State} \rightarrow \text{unsqueeze}(1))
 \]
 **Procedure:**
 1. **Action Encoding:**
    \[
    \text{Action Index} = \text{Index of Action}
    \]
    \[
    \text{Action Embedding} = \text{ActionEncoder}(\text{Action Index})
    \]
 2. **State Extraction:**
    \[
    \text{Current State} = \text{representation}[:, -1, :]
    \]
 3. **State Transition:**
    \[
    \text{Next State Representation} = \text{DynamicsNetwork}(\text{Current State}, \text{Action Embedding})
    \]
 4. **Representation Update:**
    \[
    \text{New Representation} = \text{Concat}(\text{representation}, \text{Next State Representation} \times \text{unsqueeze}(1))
    \]
 5. **Thought Node Update:**
    - Navigate to the child `ThoughtNode` corresponding to the applied action.
@@ -571,10 +487,11 @@ Represents a node in the MCTS search tree, encapsulating a specific state in the
 **Mathematical Representation:**
 Each `MCTSNode` can be considered as:
 \[
 \text{MCTSNode} = (\text{state}, \text{parent}, \text{action}, \{\text{children}\}, \text{visit\_count}, \text{value\_sum}, \text{prior}, \text{entropy}, \text{variance})
 \]
 #### MCTS Algorithm
 The `MCTS` class implements the Monte Carlo Tree Search algorithm tailored to the LightBulb model's architecture.
@@ -604,9 +521,11 @@ The `MCTS` class implements the Monte Carlo Tree Search algorithm tailored to th
                 - Add the candidate sequence to `all_candidates`.
           - **Beam Pruning:**
             - Sort all candidates based on a combined score:
               \[
               \text{Combined Score} = \text{Score} - 0.1 \times \text{Entropy} + 0.05 \times \text{Variance}
               \]
             - Retain the top `beam_size` candidates for the next iteration.
      4. **Result Extraction:**
         - After completing iterations, select the best action sequence from the final beam.
@@ -620,6 +539,7 @@ The `MCTS` class implements the Monte Carlo Tree Search algorithm tailored to th
      - Calculate entropy and variance of the policy distribution.
      - Expand the node by creating child nodes based on the Tree of Thought and assign priors from policy probabilities.
    - **Mathematical Operations:**
      \[
      (\text{Policy Logits}, \text{Value Estimate}) = \text{PredictionNetwork}(\text{State})
      \]
@@ -632,19 +552,22 @@ The `MCTS` class implements the Monte Carlo Tree Search algorithm tailored to th
      \[
      \text{Variance} = \text{Var}(P)
      \]
 4. **Backpropagation (`backpropagate`):**
    - **Function:** Updates the `visit_count` and `value_sum` for nodes along the path from the evaluated node back to the root.
    - **Procedure:**
      \[
      \text{For each node in the path:} \\
      \quad \text{node.visit\_count} \mathrel{+}= 1 \\
      \quad \text{node.value\_sum} \mathrel{+}= \text{Value Estimate}
      \]
 5. **Upper Confidence Bound (UCB) Score (`ucb_score`):**
    - **Function:** Balances exploration of less-visited nodes and exploitation of high-value nodes.
    - **Mathematical Operation:**
      \[
      \text{UCB Score} = \text{Average Value} + \text{Exploration Term} + \text{Entropy Term} + \text{Variance Term}
      \]
@@ -661,7 +584,7 @@ The `MCTS` class implements the Monte Carlo Tree Search algorithm tailored to th
      \[
      \text{Variance Term} = 0.05 \times \text{variance}
      \]
 6. **Best Action Sequence Extraction (`best_action_sequence`):**
    - **Function:** Extracts the most promising action sequence from the MCTS tree after all iterations.
    - **Procedure:**
@@ -671,21 +594,6 @@ The `MCTS` class implements the Monte Carlo Tree Search algorithm tailored to th
 ---
-### Mixture of Experts (MoE)
-\[
-\text{MoE}(x) = \sum_{i=1}^k g_i(x) \cdot \text{Expert}_i(x)
-\]
-Where:
-- \( g_i(x) \): Gating weights ensuring sparsity (only top-k experts are active).
-- \( \text{Expert}_i(x) \): Outputs from the expert networks.
-- \( k \): Number of top experts to activate.
-**Explanation:**
-- For each input, only the top-k experts (based on gating scores) process the data.
-- Reduces computational load while maintaining high capacity.
 ### Beam Search with Multi-Token Prediction
 **Purpose:** Efficiently explores multiple possible token sequences to generate coherent and diverse outputs by predicting multiple tokens at each step.
@@ -693,15 +601,26 @@ Where:
 **Procedure:**
 1. **Beam Initialization:**
    - Start with a beam containing the start-of-sequence (BOS) token.
    \[
    \text{beam} = \left\{ \left( \text{seq} = [\text{BOS}], \text{score} = 0, \text{cum\_entropy} = 0, \text{cum\_variance} = 0 \right) \right\}
    \]
 2. **Iterative Expansion:**
-   - For each iteration up to \( \frac{\text{max\_length}}{n\_tokens\_predict} \):
      - For each sequence in the beam:
-       - Predict the next \( n\_tokens\_predict \) tokens.
        - Calculate their probabilities.
        - Select top-k token sequences based on cumulative scores.
@@ -712,6 +631,7 @@ Where:
    - Continue until the maximum length is reached or all sequences end with the end-of-sequence (EOS) token.
 **Mathematical Operations:**
 \[
 \text{Score} = \sum_{t=1}^{n} \log P(\text{token}_t | \text{tokens}_{<t})
 \]
@@ -721,6 +641,7 @@ Where:
 \[
 \text{Variance} = \text{Var}(P)
 \]
 Where \( P \) is the probability distribution over the vocabulary.
 ### Upper Confidence Bound (UCB) in MCTS

 ### Checkpoints
 At the end of each epoch, the model saves checkpoints of all components, enabling easy resumption or further fine-tuning as needed.
 ---
 ### World Model Components
 Transforms the transformer's output embeddings into a compact state representation suitable for modeling and prediction tasks.
 **Mathematical Operation:**
+```
 \[
 \text{State} = \text{LayerNorm}\left(\text{Linear}(d_{\text{model}} \rightarrow d_{\text{state}})\left(\text{Linear}(vocab\_dim \rightarrow d_{\text{model}})(\text{Transformer Output})\right)\right)
 \]
+```
 **Explanation:**
 Sequential linear transformations project high-dimensional embeddings into a lower-dimensional state space, followed by layer normalization for stability.
 Models how the state evolves in response to actions (thoughts) taken by the model.
 **Mathematical Operation:**
+```
 \[
 \text{Next State} = \text{DynamicsNetwork}(\text{Current State}, \text{Action Embedding})
 \]
+```
 **Explanation:**
 Predicts the subsequent state by combining the current state representation with an encoded action, effectively simulating the consequences of actions within the Tree of Thought.
 Predicts policy logits (action probabilities) and value estimates (state evaluations) based on the current state.
 **Mathematical Operation:**
+```
 \[
 (\text{Policy Logits}, \text{Value Estimate}) = \text{PredictionNetwork}(\text{State})
 \]
+```
 **Explanation:**
 - **Policy Logits:** Used to derive action probabilities via softmax.
 - **Value Estimate:** Represents the expected reward or quality of the current state.
 Encodes discrete actions (thoughts) into continuous embeddings compatible with the DynamicsNetwork.
 **Mathematical Operation:**
+```
 \[
 \text{Action Embedding} = \text{ActionEncoder}(\text{Action Index})
 \]
+```
 **Explanation:**
 Converts action indices into dense vector representations, facilitating their integration into state transition modeling.
 **Mathematical Representation:**
 Each `ThoughtNode` can be represented as a tree node in a directed graph:
+```
 \[
 \text{ThoughtNode} = (\text{name}, \{\text{children}\})
 \]
+```
 #### State
 **Function:**
 - `thought_node`: Reference to the current `ThoughtNode` in the Tree of Thought.
 **Action Application (`apply_action`):**
+```
 \[
 \text{Next State} = \text{DynamicsNetwork}(\text{Current State}, \text{Action Embedding})
 \]
 \[
 \text{New Representation} = \text{Concat}(\text{Current Representation}, \text{Next State} \rightarrow \text{unsqueeze}(1))
 \]
+```
 **Procedure:**
 1. **Action Encoding:**
+```
    \[
    \text{Action Index} = \text{Index of Action}
    \]
    \[
    \text{Action Embedding} = \text{ActionEncoder}(\text{Action Index})
    \]
+ ```
 2. **State Extraction:**
+```
    \[
    \text{Current State} = \text{representation}[:, -1, :]
    \]
+```
 3. **State Transition:**
+```
    \[
    \text{Next State Representation} = \text{DynamicsNetwork}(\text{Current State}, \text{Action Embedding})
    \]
+```
 4. **Representation Update:**
+```
    \[
    \text{New Representation} = \text{Concat}(\text{representation}, \text{Next State Representation} \times \text{unsqueeze}(1))
    \]
+```
 5. **Thought Node Update:**
    - Navigate to the child `ThoughtNode` corresponding to the applied action.
 **Mathematical Representation:**
 Each `MCTSNode` can be considered as:
+```
 \[
 \text{MCTSNode} = (\text{state}, \text{parent}, \text{action}, \{\text{children}\}, \text{visit\_count}, \text{value\_sum}, \text{prior}, \text{entropy}, \text{variance})
 \]
+```
 #### MCTS Algorithm
 The `MCTS` class implements the Monte Carlo Tree Search algorithm tailored to the LightBulb model's architecture.
                 - Add the candidate sequence to `all_candidates`.
           - **Beam Pruning:**
             - Sort all candidates based on a combined score:
+              ```
               \[
               \text{Combined Score} = \text{Score} - 0.1 \times \text{Entropy} + 0.05 \times \text{Variance}
               \]
+              ```
             - Retain the top `beam_size` candidates for the next iteration.
      4. **Result Extraction:**
         - After completing iterations, select the best action sequence from the final beam.
      - Calculate entropy and variance of the policy distribution.
      - Expand the node by creating child nodes based on the Tree of Thought and assign priors from policy probabilities.
    - **Mathematical Operations:**
+     ```
      \[
      (\text{Policy Logits}, \text{Value Estimate}) = \text{PredictionNetwork}(\text{State})
      \]
      \[
      \text{Variance} = \text{Var}(P)
      \]
+   ```
 4. **Backpropagation (`backpropagate`):**
    - **Function:** Updates the `visit_count` and `value_sum` for nodes along the path from the evaluated node back to the root.
    - **Procedure:**
+```
      \[
      \text{For each node in the path:} \\
      \quad \text{node.visit\_count} \mathrel{+}= 1 \\
      \quad \text{node.value\_sum} \mathrel{+}= \text{Value Estimate}
      \]
+```
 5. **Upper Confidence Bound (UCB) Score (`ucb_score`):**
    - **Function:** Balances exploration of less-visited nodes and exploitation of high-value nodes.
    - **Mathematical Operation:**
+```
      \[
      \text{UCB Score} = \text{Average Value} + \text{Exploration Term} + \text{Entropy Term} + \text{Variance Term}
      \]
      \[
      \text{Variance Term} = 0.05 \times \text{variance}
      \]
+```
 6. **Best Action Sequence Extraction (`best_action_sequence`):**
    - **Function:** Extracts the most promising action sequence from the MCTS tree after all iterations.
    - **Procedure:**
 ---
 ### Beam Search with Multi-Token Prediction
 **Purpose:** Efficiently explores multiple possible token sequences to generate coherent and diverse outputs by predicting multiple tokens at each step.
 **Procedure:**
 1. **Beam Initialization:**
+```
    - Start with a beam containing the start-of-sequence (BOS) token.
    \[
    \text{beam} = \left\{ \left( \text{seq} = [\text{BOS}], \text{score} = 0, \text{cum\_entropy} = 0, \text{cum\_variance} = 0 \right) \right\}
    \]
+```
 2. **Iterative Expansion:**
+   - For each iteration up to
+```
+\( \frac{\text{max\_length}}{n\_tokens\_predict} \)
+```
+:
      - For each sequence in the beam:
+       - Predict the next:
+```
+\( n\_tokens\_predict \)
+```
+tokens.
        - Calculate their probabilities.
        - Select top-k token sequences based on cumulative scores.
    - Continue until the maximum length is reached or all sequences end with the end-of-sequence (EOS) token.
 **Mathematical Operations:**
+```
 \[
 \text{Score} = \sum_{t=1}^{n} \log P(\text{token}_t | \text{tokens}_{<t})
 \]
 \[
 \text{Variance} = \text{Var}(P)
 \]
+```
 Where \( P \) is the probability distribution over the vocabulary.
 ### Upper Confidence Bound (UCB) in MCTS