liberal commited on
Commit
c4c48e3
·
verified ·
1 Parent(s): 06dbf3b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -0
README.md CHANGED
@@ -334,3 +334,9 @@ This approach enables self-corrective, explainable, and meta-aware learning, pus
334
  <p align="center">
335
  <img src="https://huggingface.co/liberalusa/liberalmind_bin/resolve/main/lora_training_diagramab.png" width="600"/>
336
  </p>
 
 
 
 
 
 
 
334
  <p align="center">
335
  <img src="https://huggingface.co/liberalusa/liberalmind_bin/resolve/main/lora_training_diagramab.png" width="600"/>
336
  </p>
337
+
338
+ We use a reinforcement learning method based on a GMPo reasoning loop (Generate–Match–Plan–Optimize), where each step structures the model’s decision process. A separate Critic module evaluates the output, providing a scalar reward and analysis of reasoning quality, KL divergence, and a novel intuition metric—measuring how close the model’s confidence was to actual correctness. Only LoRA adapters are updated, using KL-regularized policy optimization to ensure stable learning. The same setup is applied to long, 1000-line prompt traces, where the model learns to reflect on structured hints and task sequences during training.
339
+
340
+ <p align="center">
341
+ <img src="https://huggingface.co/liberalusa/liberalmind_bin/blob/main/understanding_alignment_charta.png" width="600"/>
342
+ </p>