RobbiePasquale
/

lightbulb

Model card Files Files and versions

xet

Community

RobbiePasquale commited on Oct 10, 2024

Commit

155f547

verified ·

1 Parent(s): 9bfcbb7

Update README.md

Browse files

Files changed (1) hide show

README.md +84 -8

README.md CHANGED Viewed

@@ -224,14 +224,90 @@ The model is trained with the following components and techniques:
 - **Optimization**: Training uses an **AdamW** optimizer with **CosineAnnealingLR** scheduler for learning rate adjustments. The **Gradient Scaler** helps prevent overflow when training with mixed precision.
 - **Gradient Accumulation**: Since the model can be computationally heavy, gradients are accumulated over several steps to reduce memory usage.
 - **Loss Functions**: The training process leverages a comprehensive set of custom loss functions:
-  - **InfoNCE Loss**: A contrastive loss to encourage representation similarity between related pairs.
-  - **Covariance Regularization**: Encourages diverse state representations by minimizing co-linearity in embeddings.
-  - **Dynamics Performance Loss**: Combines MSE and variance losses to penalize incorrect state predictions.
-  - **Thought Consistency Loss**: Encourages the model to output consistent states for similar actions.
-  - **Policy Value Joint Loss**: A weighted combination of policy and value loss for the PPO agent.
-  - **Action Diversity Reward**: Rewards diverse action embeddings to avoid mode collapse.
-  - **Exploration Regularization**: Encourages exploration by penalizing high visitation counts.
-  - **KL Divergence Loss**: Keeps the policy update close to the previous policy to stabilize training.
 ### Evaluation
 After each epoch, the model is evaluated on the validation set, computing the average loss over the dataset. The evaluation function utilizes the same loss functions as training but does not backpropagate, allowing it to be run in inference mode.

 - **Optimization**: Training uses an **AdamW** optimizer with **CosineAnnealingLR** scheduler for learning rate adjustments. The **Gradient Scaler** helps prevent overflow when training with mixed precision.
 - **Gradient Accumulation**: Since the model can be computationally heavy, gradients are accumulated over several steps to reduce memory usage.
 - **Loss Functions**: The training process leverages a comprehensive set of custom loss functions:
+**1. InfoNCE Loss (Info Noise Contrastive Estimation Loss):**
+Definition: This loss function is used for contrastive learning, encouraging similar samples to be close in the embedding space while pushing dissimilar samples apart.
+Formula:
+```
+L_InfoNCE = -log[ exp(sim(z_i, z_j) / τ) / Σ_k exp(sim(z_i, z_k) / τ) ]
+```
+where sim() is the cosine similarity, τ is the temperature parameter, z_i and z_j are paired samples, and the sum in the denominator is over all other samples in the batch.
+**2. Covariance Regularization:**
+Definition: This regularization term encourages the learned representations to have uncorrelated dimensions, promoting more diverse and informative embeddings.
+Formula:
+```
+L_cov = λ * (Σ_i,j (Cov(i,j)^2 - diag(Cov(i,j))^2))
+```
+where Cov is the covariance matrix of the embeddings, and λ is a regularization coefficient.
+**3. Dynamics Performance Loss:**
+Definition: This loss measures the accuracy of predicted next states while also encouraging diverse predictions.
+Formula:
+```
+L_dynamics = MSE(true_next_state, predicted_next_state) + λ * Var(predicted_next_state)
+```
+where MSE is the mean squared error, Var is the variance, and λ is a weighting factor.
+**4. Thought Consistency Loss:**
+Definition: This loss encourages consistency between true next states and perturbed next states.
+Formula:
+```
+L_consistency = MSE(true_next_state, perturbed_next_state)
+```
+**5. Policy Value Joint Loss:**
+Definition: This loss combines policy and value losses for reinforcement learning tasks.
+Formula:
+```
+L_joint = CrossEntropy(policy_logits, true_policy) + λ * MSE(value_pred, true_value)
+```
+where λ is a weighting factor balancing policy and value losses.
+**6. Action Diversity Reward:**
+Definition: This reward encourages diversity in action embeddings.
+Formula:
+```
+R_diversity = λ * Σ_i,j (cos_sim(a_i, a_j)^2)
+```
+where cos_sim is the cosine similarity between action embeddings, and λ is a scaling factor.
+**7. Expected Thought Value Loss:**
+Definition: This loss aims to maximize the expected value from Monte Carlo Tree Search.
+Formula:
+```
+L_ETV = -mean(mcts_best_values)
+```
+**8. Exploration Regularization:**
+Definition: This regularization encourages exploration by rewarding less-visited actions.
+Formula:
+```
+R_exploration = λ * mean(Σ_a (1 / (visit_count(a) + 1)))
+```
+where λ is a scaling factor.
+**9. KL Divergence Loss:**
+Definition: This loss measures the difference between old and new policies in policy optimization.
+Formula:
+```
+L_KL = KL(new_policy || old_policy) = \sum_{i=1}^{n} old\_policy_i \cdot \log\left(\frac{old\_policy_i}{new\_policy_i}\right)
+```
+where KL is the Kullback-Leibler divergence.
+L_KL is the KL divergence loss
+old_policy and new_policy are probability distributions
+i represents each possible outcome or action
+n is the total number of possible outcomes or actions
 ### Evaluation
 After each epoch, the model is evaluated on the validation set, computing the average loss over the dataset. The evaluation function utilizes the same loss functions as training but does not backpropagate, allowing it to be run in inference mode.