augustocsc commited on Feb 10

Commit

5faf2eb

verified ·

1 Parent(s): 90d11a7

GPT-2 Base trained on prefix dataset (682K)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitignore +107 -0
ANALYSIS_REPORT.md +283 -0
EXPERIMENT_PLAN.md +195 -0
README.md +106 -0
classes/__init__.py +0 -0
classes/dataset.py +48 -0
classes/expression.py +403 -0
configs/eval_dataset_download.sh +6 -0
configs/model_config.json +1 -0
configs/peft_config.json +1 -0
configs/training.sh +82 -0
configs/training_args.json +29 -0
configs/training_large.json +65 -0
configs/training_medium.json +65 -0
configs/training_small.json +65 -0
configs/training_v3.json +78 -0
create_structure.sh +171 -0
notebooks/.gitkeep +0 -0
notebooks/01_data_exploration.ipynb +0 -0
notebooks/02_finetuning_avaliation.ipynb +568 -0
notebooks/03_RL.ipynb +338 -0
notebooks/04_merging_model.ipynb +206 -0
out.txt +7 -0
out2.txt +0 -0
requirements.txt +30 -0
scripts/aws/analyze_model.sh +203 -0
scripts/aws/evaluate_models.sh +62 -0
scripts/aws/launch_evaluation_instance.sh +299 -0
scripts/aws/launch_instance.sh +196 -0
scripts/aws/launch_instance_fixed.sh +371 -0
scripts/aws/monitor_evaluation.sh +116 -0
scripts/aws/monitor_training_auto.sh +179 -0
scripts/aws/run_all_training.sh +365 -0
scripts/aws/setup_and_train_exp_a.sh +83 -0
scripts/aws/setup_and_train_exp_b.sh +83 -0
scripts/aws/setup_aws.sh +87 -0
scripts/aws/train_exp_a.sh +57 -0
scripts/aws/train_exp_b.sh +58 -0
scripts/aws/train_fixed_model.sh +144 -0
scripts/aws/train_v3_model.sh +144 -0
scripts/aws/validate_setup.sh +285 -0
scripts/compare_models.py +271 -0
scripts/compare_v1_v2_simple.py +240 -0
scripts/data/data_augmentation.py +63 -0
scripts/data/data_cleaning.py +90 -0
scripts/data/data_processing.py +108 -0
scripts/data/parallel_utils.py +31 -0
scripts/data/prepare_experiment_data.py +513 -0
scripts/data/prepare_training_data_fixed.py +408 -0
scripts/evaluate.py +432 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,107 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+# Usually these files are written by a python script from a template
+# before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+# Environments
+.env
+.venv
+.seriguela
+venv/
+ENV/
+env/
+env.bak/
+venv.bak/
+# IDEs / Editors
+.idea/
+.vscode/
+*.suo
+*.ntvs*
+*.njsproj
+*.sln
+*.sw?
+# Jupyter Notebook
+.ipynb_checkpoints
+# Output folder (geralmente grande demais para Git)
+output/*
+!output/.gitkeep # Não ignore um .gitkeep se precisar manter a pasta
+# Dados (podem ser grandes, usar Git LFS ou armazenar fora se necessário)
+# Note: CSV files in data/processed/ can be 100MB+ and are excluded from git
+# Run scripts/data/prepare_training_data_fixed.py on target system to generate them
+data/*
+data/raw/*
+data/processed/*
+!data/raw/.gitkeep
+!data/processed/.gitkeep
+# OS generated files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+.env
+wandb
+# AWS credentials and keys
+aws/keys/*.pem
+aws/keys/*.key
+aws/.env
+aws/credentials
+*.pem
+*.key

ANALYSIS_REPORT.md ADDED Viewed

	@@ -0,0 +1,283 @@

+# Seriguela - Relatório Consolidado de Análise
+**Data:** 2026-02-01
+**Status:** ⚠️ BLOCK 2 PRECISA RETREINO
+---
+## Resumo Executivo
+Projeto Seriguela tem 3 blocos:
+1. **Block 1 - Dados:** Preparação e análise ⚠️ **CAUSA RAIZ AQUI**
+2. **Block 2 - Treino Supervisionado:** Treinar LLM para gerar expressões ❌ PROBLEMA
+3. **Block 3 - PPO Finetuning:** Otimizar para symbolic regression ⛔ BLOQUEADO
+**Causa raiz identificada:** Dados de treino **NÃO TÊM `<|endofex|>` markers**. 0% dos 758,255 exemplos têm o marker. Modelo nunca aprendeu a parar.
+---
+## Investigação da Causa Raiz (2026-02-01)
+### Descoberta 1: Validação Original Era Frouxa
+Script `test_inference_configs.py` reporta **95% válidas**, mas aceita:
+```
+✅ VALID: C*x_1 + C*x_6 - tan(x_9) - Cainers: C9999(x
+✅ VALID: C*x_1 + C*x_2 + C*x_1 + C Pressure, sin, sqrt, tan
+```
+Validação original só verifica:
+- Tem operador? ✓
+- Tem variável? ✓
+- Não tem "Buyable"? ✓
+**NÃO verifica:**
+- Se usa variáveis do prompt
+- Se pode ser parseada
+- Se tem outros garbage tokens
+### Descoberta 2: Dados de Treino SEM Markers
+```python
+# Dataset: augustocsc/sintetico_natural (700K)
+Total de exemplos: 758,255
+Exemplos com <|endofex|>: 0 (0.0%)
+Exemplos com <|startofex|>: 0 (0.0%)
+```
+**O modelo NUNCA viu `<|endofex|>` durante treino!**
+### Descoberta 3: Origem do Garbage
+Garbage tokens (Stockholm, Pressure, XP, etc.) vêm do **vocabulário GPT-2 base**.
+Como modelo não sabe parar, eventualmente gera tokens aleatórios.
+### Conclusão da Investigação
+| Problema | Causa |
+|----------|-------|
+| Modelo não para | Dados sem `<|endofex|>` |
+| Garbage tokens | GPT-2 base vaza sem stopping |
+| Variáveis erradas | Dados têm x_1-x_10, modelo não aprende restrição |
+| 95% vs 0% válidas | Validação original era frouxa |
+### Solução Necessária
+1. **Preparar dados** com `<|endofex|>` em 100% dos exemplos
+2. **Retreinar modelo** com dados corrigidos
+3. **Validação rigorosa** durante treino
+---
+## Modelos Testados
+| Modelo | HuggingFace Hub | Esperado | Real | Status |
+|--------|-----------------|----------|------|--------|
+| V1 | augustocsc/Se124M_700K_infix | 83.3% válidas | **0%** | ❌ Falha |
+| V2 | augustocsc/Se124M_700K_infix_v2 | 90% válidas | **0%** | ❌ Falha |
+---
+## Testes Realizados
+### Teste 1: Comparação V1 vs V2 (mesmo prompt)
+**Prompt:**
+```
+vars: x_1, x_2
+oper: *, +, -, sin, cos
+cons: C
+expr:
+```
+**Configurações ótimas usadas:**
+- V1: temp=0.5, top_k=40, top_p=0.9, rep_penalty=1.15
+- V2: temp=0.7, top_k=0, top_p=0.8, rep_penalty=1.0
+**Resultados (20 gerações cada):**
+| Métrica | V1 | V2 |
+|---------|----|----|
+| Expressões Válidas | 0% | 0% |
+| Símbolos Corretos | 0% | 45% |
+### Teste 2: PPO Evaluation
+**Objetivo:** Verificar se modelo pode ser usado para PPO (symbolic regression)
+**Resultados:**
+- Valid Rate: 6.7% (muito baixo)
+- Best R²: N/A (não conseguiu computar)
+- **Conclusão:** PPO inviável com modelo atual
+---
+## Problemas Identificados
+### 1. Modelos Não Param Corretamente
+**Sintoma:** Expressões continuam além do esperado
+```
+Esperado: C*x_1 + sin(x_2)<|endofex|>
+Gerado:   C*x_1 + sin(x_2) + C Stockholmvars: x_1, x_2, x_3...
+```
+**Causa:** Modelo não aprendeu a gerar `<|endofex|>`
+### 2. Garbage Tokens na Saída
+**Exemplos de lixo gerado:**
+- "BuyableInstoreAndOnline"
+- "Stockholm", "GREEN", "Muslims"
+- "intuition", "records", "crash"
+- "xstatics", "xid", "sinmod"
+**Causa:** Dados de treino contaminados OU modelo não convergiu
+### 3. Variáveis Erradas
+**Sintoma:** Usa variáveis não permitidas
+```
+Prompt pede: x_1, x_2
+Modelo gera: x_9, x_10, x_3, x_4
+```
+**Causa:** Modelo não aprendeu a respeitar o prompt
+### 4. Discrepância com Documentação
+**Documentação dizia:**
+- V1: 83.3% válidas com config otimizada
+- V2: 90% válidas com nucleus sampling
+**Realidade:**
+- V1: 0% válidas
+- V2: 0% válidas
+**Possíveis causas:**
+1. Modelos no Hub não são os mesmos testados
+2. Testes anteriores tinham bug
+3. Forma de carregar modelo está errada
+---
+## Configurações de Inferência Testadas
+### V1 Config Ótima (segundo docs)
+```python
+{
+    "temperature": 0.5,
+    "top_k": 40,
+    "top_p": 0.9,
+    "repetition_penalty": 1.15,
+    "max_new_tokens": 100,
+    "do_sample": True,
+}
+```
+### V2 Config Ótima (segundo docs)
+```python
+{
+    "temperature": 0.7,
+    "top_k": 0,
+    "top_p": 0.8,
+    "repetition_penalty": 1.0,
+    "max_new_tokens": 128,
+    "do_sample": True,
+}
+```
+**Resultado:** Mesmo com configs ótimas, 0% válidas.
+---
+## Forma de Carregar Modelos
+```python
+# 1. Carregar base GPT-2
+model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype=torch.float16)
+# 2. Configurar tokenizer com tokens especiais
+tokenizer = AutoTokenizer.from_pretrained("gpt2")
+tokenizer.add_special_tokens({
+    "additional_special_tokens": ["<|startofex|>", "<|endofex|>"]
+})
+# 3. Redimensionar embeddings
+model.resize_token_embeddings(len(tokenizer))
+# 4. Carregar adapter LoRA
+model = PeftModel.from_pretrained(model, "augustocsc/Se124M_700K_infix_v2")
+# 5. Merge adapter no modelo base
+model = model.merge_and_unload()
+model.eval()
+```
+---
+## Conclusões
+### Block 2 (Treino) - PRECISA RETREINO
+**Problemas no treino:**
+1. Modelo não aprendeu `<|endofex|>` marker
+2. Dados podem estar contaminados com garbage
+3. Modelo não respeita variáveis do prompt
+**Ações necessárias:**
+1. Validar dados de treino (100% devem ter `<|endofex|>`)
+2. Limpar garbage tokens dos dados
+3. Monitorar valid rate durante treino
+4. Só considerar treino bem-sucedido se valid rate > 80%
+### Block 3 (PPO) - BLOQUEADO
+**Pré-requisitos para PPO:**
+- ✅ Base model gera >80% expressões válidas
+- ✅ Expressões podem ser avaliadas (R² computável)
+- ✅ Modelo para corretamente em boundaries
+**Status atual:** ❌ Nenhum pré-requisito atendido
+---
+## Próximos Passos
+1. **Investigar dados de treino**
+   - Verificar se `<|endofex|>` está presente
+   - Identificar fonte de garbage tokens
+2. **Retreinar modelo (V3)**
+   - Usar dados validados
+   - Monitorar valid rate durante treino
+   - Validar antes de fazer push pro Hub
+3. **Só então testar PPO**
+   - Após valid rate > 80%
+   - Com modelo que para corretamente
+---
+## Arquivos de Código Relevantes
+- `scripts/train.py` - Script de treino
+- `scripts/generate.py` - Geração com stopping criteria
+- `scripts/evaluate.py` - Avaliação de modelo
+- `scripts/compare_v1_v2_simple.py` - Comparação V1 vs V2
+- `scripts/evaluate_ppo.py` - Avaliação para PPO
+- `scripts/data/prepare_training_data_fixed.py` - Preparação de dados
+- `classes/expression.py` - Parsing e validação de expressões
+---
+## Infraestrutura AWS
+- **Instance:** g5.xlarge (NVIDIA A10G, 24GB)
+- **Instance ID:** i-0377b6c8de3660a82
+- **Custo:** ~$1/hora
+- **Status atual:** Stopped (para economizar)
+---
+**Última atualização:** 2026-02-01

EXPERIMENT_PLAN.md ADDED Viewed

	@@ -0,0 +1,195 @@

+# Plano de Experimentos: Formatos de Treino
+**Data:** 2026-02-01
+**Objetivo:** Testar duas abordagens para resolver o problema de stopping
+---
+## Contexto
+### Problema Identificado
+- Dados de treino não têm marcador de fim (0% com qualquer marker)
+- Modelo não aprende quando parar
+- Gera garbage tokens do vocabulário GPT-2
+### Experimentos Propostos
+1. **EXP-A:** Formato estruturado (JSON-like)
+2. **EXP-B:** Token EOS do GPT-2 (`<|endoftext|>`)
+---
+## EXP-A: Formato Estruturado
+### Formato dos Dados
+```json
+{"vars": ["x_1", "x_2"], "ops": ["*", "+", "sin"], "expr": "C*sin(x_1) + x_2"}
+```
+### Vantagens
+- Estrutura clara e parseável
+- Fácil validação (JSON válido = formato correto)
+- Modelo aprende estrutura rígida
+### Desvantagens
+- Mais tokens por exemplo
+- Pode ser mais difícil de aprender
+### Preparação de Dados
+```python
+# Transformar de:
+"vars: x_1, x_2\noper: *, +, sin\ncons: C\nexpr: C*sin(x_1) + x_2"
+# Para:
+'{"vars": ["x_1", "x_2"], "ops": ["*", "+", "sin"], "cons": "C", "expr": "C*sin(x_1) + x_2"}'
+```
+### Inferência
+```python
+prompt = '{"vars": ["x_1", "x_2"], "ops": ["*", "+", "sin"], "cons": "C", "expr": "'
+# Modelo completa com: C*sin(x_1) + x_2"}
+# Extrair: tudo entre 'expr": "' e '"}'
+```
+### Critério de Sucesso
+- JSON parseável em >90% dos casos
+- Expressão extraída válida em >80% dos casos
+---
+## EXP-B: Token EOS do GPT-2
+### Formato dos Dados
+```
+vars: x_1, x_2
+oper: *, +, sin
+cons: C
+expr: C*sin(x_1) + x_2<|endoftext|>
+```
+### Vantagens
+- Token já existe no modelo (ID 50256)
+- GPT-2 já entende como "fim de sequência"
+- Não precisa resize de embeddings
+- Formato similar ao atual
+### Desvantagens
+- Pode conflitar com outros usos do EOS
+- Menos explícito que marker dedicado
+### Preparação de Dados
+```python
+# Adicionar <|endoftext|> no final de cada expressão
+text = original_text + "<|endoftext|>"
+```
+### Inferência
+```python
+# Usar eos_token_id como stopping criteria
+output = model.generate(
+    **inputs,
+    eos_token_id=tokenizer.eos_token_id,  # 50256
+    max_new_tokens=128
+)
+```
+### Critério de Sucesso
+- Modelo gera `<|endoftext|>` em >90% dos casos
+- Expressão antes do EOS válida em >80% dos casos
+---
+## Plano de Execução
+### Fase 1: Preparação de Dados (Local)
+#### 1.1 Criar script de preparação
+```
+scripts/data/prepare_experiment_data.py
+```
+- Entrada: dataset augustocsc/sintetico_natural (700K)
+- Saída A: data/exp_a_json/train.csv, validation.csv
+- Saída B: data/exp_b_eos/train.csv, validation.csv
+#### 1.2 Validar dados preparados
+- Verificar formato correto em 100% dos exemplos
+- Amostrar e inspecionar manualmente
+### Fase 2: Treino (AWS)
+#### 2.1 Treinar EXP-A (JSON)
+```bash
+python scripts/train.py \
+  --use_local_csvs \
+  --train_file ./data/exp_a_json/train.csv \
+  --output_dir ./output/exp_a_json \
+  --num_train_epochs 3
+```
+#### 2.2 Treinar EXP-B (EOS)
+```bash
+python scripts/train.py \
+  --use_local_csvs \
+  --train_file ./data/exp_b_eos/train.csv \
+  --output_dir ./output/exp_b_eos \
+  --num_train_epochs 3
+```
+### Fase 3: Avaliação
+#### 3.1 Métricas
+- **Valid Rate:** % expressões parseáveis
+- **Stopping Rate:** % que param corretamente (JSON fechado ou EOS)
+- **Symbol Accuracy:** % que usam apenas símbolos do prompt
+- **Garbage Rate:** % com tokens não-matemáticos
+#### 3.2 Comparação
+| Métrica | EXP-A (JSON) | EXP-B (EOS) |
+|---------|--------------|-------------|
+| Valid Rate | ? | ? |
+| Stopping Rate | ? | ? |
+| Symbol Accuracy | ? | ? |
+| Garbage Rate | ? | ? |
+### Fase 4: Decisão
+- Se EXP-A melhor → usar formato JSON
+- Se EXP-B melhor → usar EOS token
+- Se ambos ruins → investigar outras opções
+---
+## Estimativas
+| Fase | Tempo | Custo AWS |
+|------|-------|-----------|
+| Preparação dados | 30 min | $0 |
+| Treino EXP-A | 2-3h | ~$3 |
+| Treino EXP-B | 2-3h | ~$3 |
+| Avaliação | 30 min | ~$0.50 |
+| **Total** | **6-7h** | **~$6.50** |
+---
+## Arquivos a Criar
+```
+scripts/data/prepare_experiment_data.py  # Preparação
+data/exp_a_json/train.csv                # Dados JSON
+data/exp_a_json/validation.csv
+data/exp_b_eos/train.csv                 # Dados EOS
+data/exp_b_eos/validation.csv
+scripts/evaluate_experiments.py          # Avaliação
+```
+---
+## Critério de Sucesso Final
+**Experimento bem-sucedido se:**
+- Valid Rate > 80%
+- Stopping Rate > 90%
+- Garbage Rate < 5%
+**Próximo passo após sucesso:**
+- Usar formato vencedor para treinar modelo final
+- Prosseguir para Block 3 (PPO)

README.md ADDED Viewed

	@@ -0,0 +1,106 @@

+*# Nome do Seu Projeto de Fine-Tuning
+(Breve descrição do objetivo do projeto)
+## Estrutura de Pastas
+Aqui está a organização das pastas e seus propósitos:
+```
+seu_projeto_finetuning/
+│
+├── data/                     # Todos os dados relacionados ao projeto
+│   ├── raw/                  # Dados originais, não processados
+│   └── processed/            # Dados limpos, formatados e divididos (train/val/test)
+│
+├── scripts/                  # Scripts Python principais
+│   ├── preprocess_data.py    # (Opcional) Script para limpar e formatar dados
+│   ├── train.py              # Script principal para rodar o Trainer do HF
+│   ├── evaluate.py           # (Opcional) Script para avaliação customizada
+│   └── generate.py           # (Opcional) Script para gerar texto com modelo treinado
+│
+├── configs/                  # Arquivos de configuração (JSON, YAML, etc.)
+│   ├── training_args.json    # Argumentos de treino (passados para TrainingArguments)
+│   ├── peft_config.json      # (Se usar PEFT) Configuração LoRA, Adapter, etc.
+│   └── model_config.json     # (Opcional) Nome do modelo base, caminhos, etc.
+│
+├── output/                   # Todos os outputs gerados (modelos, logs, resultados)
+│   └── {nome_experimento}/   # Subpasta para cada execução/experimento
+│       ├── checkpoints/      # Checkpoints salvos pelo Trainer
+│       ├── final_model/      # Modelo final treinado
+│       ├── logs/             # Logs do TensorBoard ou outros
+│       └── ...               # Outros resultados (métricas, amostras)
+│
+├── notebooks/                # (Opcional) Jupyter notebooks para exploração e testes
+│
+├── .gitignore                # Especifica arquivos/pastas a serem ignorados pelo Git
+├── requirements.txt          # Dependências Python do projeto
+└── README.md                 # Documentação do projeto (este arquivo)
+```
+* **`data/`**: Contém todos os dados.
+    * `raw/`: Armazena os dados originais, sem modificações.
+    * `processed/`: Guarda os dados após limpeza, formatação e divisão (treino, validação, teste), prontos para serem usados pelo script de treinamento.
+* **`scripts/`**: Onde fica o código Python.
+    * `train.py`: O coração do projeto, responsável por carregar dados, modelo, configurações e executar o fine-tuning com o `Trainer`.
+    * Scripts auxiliares para pré-processamento, avaliação ou geração podem ser incluídos aqui.
+* **`configs/`**: Centraliza as configurações do projeto, como hiperparâmetros de treinamento (`training_args.json`), configurações PEFT (`peft_config.json`) ou detalhes do modelo base. Isso facilita a alteração de parâmetros sem modificar o código principal.
+* **`output/`**: Diretório para todos os artefatos gerados durante o treinamento. É **altamente recomendado** criar uma subpasta para cada experimento (identificada por nome ou timestamp) para manter os resultados organizados (checkpoints, modelo final, logs, métricas). O `output_dir` do `TrainingArguments` deve apontar para essa subpasta específica do experimento.
+* **`notebooks/`**: Espaço para prototipagem, análise exploratória de dados e testes rápidos usando Jupyter Notebooks.
+* **`.gitignore`**: Configura o Git para ignorar arquivos e pastas desnecessários (ambientes virtuais, caches, outputs grandes, dados brutos grandes, etc.).
+* **`requirements.txt`**: Lista as bibliotecas Python necessárias para que o projeto funcione, permitindo recriar o ambiente facilmente (`pip install -r requirements.txt`).
+* **`README.md`**: Documentação essencial explicando o projeto, como configurá-lo e executá-lo.
+## Como Usar
+1.  **Setup:** Crie um ambiente virtual e instale as dependências:
+    ```bash
+    python -m venv venv
+    source venv/bin/activate  # Linux/macOS
+    # venv\Scripts\activate  # Windows
+    pip install -r requirements.txt
+    ```
+2.  **Dados:** Coloque seus dados brutos em `data/raw/` e execute (ou crie) o script `scripts/preprocess_data.py` para gerar os arquivos em `data/processed/`.
+3.  **Configuração:** Ajuste os arquivos em `configs/` (argumentos de treino, modelo base, PEFT se aplicável).
+4.  **Treinamento:** Execute o script principal:
+    ```bash
+    python scripts/train.py --args_config configs/training_args.json --model_config configs/model_config.json
+    ```
+    *(Adapte os argumentos conforme necessário)*
+## Dependências
+As dependências Python estão listadas no arquivo `requirements.txt`.
+```
+Claro! Aqui está um bloco de instruções pronto para ser adicionado ao seu `README.md`, explicando como configurar o ambiente com `venv`, instalar as dependências e configurar o uso de GPU e Weights & Biases (W&B):
+---
+### 🚀 Setup do Ambiente (com suporte a GPU e W&B)
+Siga os passos abaixo para configurar o ambiente de desenvolvimento com `venv`, `pip`, suporte a GPU (CUDA 11.8) e monitoramento com Weights & Biases:
+```bash
+# 1. Crie o ambiente virtual
+python -m venv .seriguela
+# 2. Ative o ambiente virtual
+# No Linux/macOS:
+source .seriguela/bin/activate
+# No Windows:
+.seriguela\Scripts\activate
+# 3. Instale as dependências principais
+pip install -r requirements.txt
+# 4. Instale PyTorch com suporte a CUDA 11.8 (para uso com GPU)
+pip install torch==2.2.1+cu118 torchvision==0.17.1+cu118 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu118
+# 5. (Opcional) Faça login no Weights & Biases para monitorar seus experimentos
+wandb login
+```
+> ⚠️ Certifique-se de que sua GPU e drivers estão atualizados e compatíveis com CUDA 11.8.
+> 💡 Para ambientes 100% reprodutíveis, use sempre o mesmo `requirements.txt` e registre os experimentos com `wandb`.
+*

classes/__init__.py ADDED Viewed

File without changes

classes/dataset.py ADDED Viewed

	@@ -0,0 +1,48 @@

+import pandas as pd
+import torch
+class RegressionDataset:
+    def __init__(self, path: str, file_name: str = 'train.csv', delimiter: str = ',', header: int = 0,
+                 encoding: str = 'utf-8', target_col: str = None):
+        """
+        Initializes the RegressionDataset by loading data from a CSV file.
+        Args:
+            path (str): Path to the directory containing the CSV file.
+            file_name (str): Name of the CSV file. Defaults to 'train.csv'.
+            delimiter (str): Delimiter used in the CSV file. Defaults to ','.
+            header (int): Row number to use as the column names. Defaults to 0.
+            encoding (str): Encoding of the CSV file. Defaults to 'utf-8'.
+            target_col (str): Name of the target column. If None, the last column is used.
+        """
+        self.data = pd.read_csv(f"{path}/{file_name}", delimiter=delimiter, header=header, encoding=encoding)
+        if self.data.empty:
+            raise ValueError("CSV file is empty.")
+        if target_col is None:
+            target_col = self.data.columns[-1]
+        if target_col not in self.data.columns:
+            raise ValueError(f"CSV must contain a column named '{target_col}'.")
+        self.X = self.data.drop(columns=[target_col]).apply(pd.to_numeric, errors='coerce').values
+        self.y = pd.to_numeric(self.data[target_col], errors='coerce').values
+    def get_data(self):
+        """
+        Returns the data as PyTorch tensors (X, y).
+        """
+        X_tensor = torch.tensor(self.X, dtype=torch.float32)
+        y_tensor = torch.tensor(self.y, dtype=torch.float32)
+        return X_tensor, y_tensor
+    def get_numpy(self):
+        """
+        Returns the data as NumPy arrays (useful for sympy and R² calculations).
+        """
+        return self.X, self.y

classes/expression.py ADDED Viewed

	@@ -0,0 +1,403 @@

+import sympy
+import numpy as np
+from sklearn.metrics import r2_score, mean_squared_error
+from sklearn.metrics import mean_absolute_error
+from scipy.optimize import minimize
+import math
+import re
+class Expression:
+    SAFE_FUNCTIONS = {
+        'sqrt': np.sqrt,
+        'log': np.log,
+        'exp': np.exp,
+        'sin': np.sin,
+        'cos': np.cos,
+        'tan': np.tan,
+        'asin': np.arcsin, # Corrected to np.arcsin
+        'abs': np.abs,
+        'pow': np.power, # Use np.power for vectorization and NaN handling
+        # '**' is handled by Python's eval; if operands are numpy arrays, np.power is used.
+    }
+    OPERATOR_ARITY = {
+        '+': 2,
+        '-': 2,
+        '*': 2,
+        '/': 2,
+        '**': 2,  # Changed from '^' to '**'
+        'sin': 1,
+        'cos': 1,
+        'tan': 1,
+        'log': 1,
+        'sqrt': 1,
+        'exp': 1
+    }
+    OPERATOR_FUNCS = {
+        '+': sympy.Add,
+        '-': lambda x, y: x - y,
+        '*': sympy.Mul,
+        '/': lambda x, y: x / y,
+        '**': sympy.Pow, # Changed from '^' to '**', sympy.Pow handles both
+        'sin': sympy.sin,
+        'cos': sympy.cos,
+        'tan': sympy.tan,
+        'log': sympy.log,
+        'sqrt': sympy.sqrt,
+        'exp': sympy.exp
+    }
+    def parse_prefix(self, tokens):
+        """Parse prefix notation expression to SymPy.
+        Example: ['*', 'x_1', '+', 'x_2', 'C'] -> x_1*(x_2 + C)
+        """
+        if not tokens:
+            raise ValueError("Empty token list")
+        # Define unary and binary operators
+        UNARY_OPS = {'sin', 'cos', 'tan', 'exp', 'log', 'sqrt', 'abs', 'asin'}
+        BINARY_OPS = {'+', '-', '*', '/', '**', '^'}
+        stack = []
+        # Process tokens in reverse order
+        for token in reversed(tokens):
+            if token in BINARY_OPS or token in UNARY_OPS:
+                # Operator: pop operands from stack
+                if token in UNARY_OPS:
+                    if len(stack) < 1:
+                        raise ValueError(f"Not enough operands for {token}")
+                    arg = stack.pop()
+                    if token in ['sin', 'cos', 'tan', 'exp', 'log', 'sqrt', 'abs', 'asin']:
+                        stack.append(f"{token}({arg})")
+                    else:
+                        raise ValueError(f"Unknown unary operator: {token}")
+                else:  # Binary operator
+                    if len(stack) < 2:
+                        raise ValueError(f"Not enough operands for {token}")
+                    right = stack.pop()
+                    left = stack.pop()
+                    # Handle operator mapping
+                    op_map = {'+': '+', '-': '-', '*': '*', '/': '/', '**': '**', '^': '**'}
+                    op = op_map.get(token, token)
+                    if op in ['**', '^']:
+                        stack.append(f"({left})**({right})")
+                    elif op == '/':
+                        stack.append(f"({left})/({right})")
+                    else:
+                        stack.append(f"({left}){op}({right})")
+            else:
+                # Operand: push to stack
+                stack.append(token)
+        if len(stack) != 1:
+            raise ValueError(f"Invalid prefix expression, {len(stack)} elements remaining")
+        return sympy.sympify(stack[0], evaluate=False)
+    def __init__(self, expression, is_prefix=False):
+        try:
+            self.original_expression = expression  # Save original
+            if is_prefix:
+                # Ensure input prefix uses '**' if converting from external source
+                tokens = expression.replace('^', '**').split()
+                self.sympy_expression = self.parse_prefix(tokens)
+            else:
+                # Load the expression as a sympy expression without simplification
+                self.sympy_expression = sympy.sympify(expression, evaluate=False)
+        except Exception as e:
+            raise ValueError(f"Failed to parse expression: {e}")
+        self.max_var = 0
+        for symbol in self.sympy_expression.free_symbols:
+            if symbol.name.startswith('x_'):
+                try:
+                    index = int(symbol.name.split('_')[1])
+                    self.max_var = max(self.max_var, index)
+                except ValueError:
+                    # Handle symbols that look like x_ but aren't x_number
+                     pass # Or raise ValueError(f"Invalid variable name: {symbol.name}") if strict
+        computable_expression = str(self.sympy_expression)
+        for i in range(1, self.max_var + 1):
+            # Use regex to match whole words to avoid issues with x_1 followed by x_11
+            computable_expression = re.sub(rf'\bx_{i}\b', f'x[{i-1}]', computable_expression)
+        self.computable_expression = computable_expression.replace('**C', '**2')
+        self.constant_count = self.computable_expression.count('C')
+        self.best_constants = [1.0] * self.constant_count
+        if self.constant_count > 0:
+            # Replace 'C' with indexable constants
+            split_expr = self.computable_expression.split('C')
+            new_expr = split_expr[0]  # Start with first part
+            for i in range(1, len(split_expr)):
+                # Add constant reference
+                new_expr += f'constants[{i-1}]'
+                # Add next part
+                new_expr += split_expr[i]
+            self.computable_expression = new_expr
+    def __str__(self):
+        return f"Expression: {self.original_expression}, Best constants: {self.best_constants}"
+    def sympy_str(self):
+        """
+        Returns the string representation of the sympy expression.
+        """
+        return str(self.sympy_expression)
+    def is_valid_on_dataset(self, X, test_constants_list=None):
+        """
+        Checks if the expression evaluates to valid (finite) values for all rows in X,
+        across one or more sets of test constants.
+        Args:
+            X (np.ndarray): Input data, shape (n_samples, n_features)
+            test_constants_list (list of lists): Optional. Defaults to [[1.0]*count].
+                Example: [[1.0]*n, [0.5]*n, [2.0]*n] to test more thoroughly.
+        Returns:
+            bool: True if no evaluation returns nan/inf or crashes. False otherwise.
+        """
+        if test_constants_list is None:
+            test_constants_list = [[1.0] * self.constant_count]
+        try:
+            for constants in test_constants_list:
+                results = self.evaluate(X, constants)
+                if not np.all(np.isfinite(results)):
+                    return False
+            return True
+        except Exception:
+            return False
+    # Inside the Expression class
+    def evaluate(self, X, constants=None):
+        # with warnings.catch_warnings():
+        #     warnings.simplefilter("ignore", category=RuntimeWarning)  # Hide power/tan warnings
+        #     np.seterr(invalid='ignore', divide='ignore')
+        if constants is None:
+            # print("No constants provided, using best constants.") # Optional: uncomment for debugging
+            constants = self.best_constants
+        try:
+            local_env = {
+                "constants": np.array(constants), # Ensure constants is a numpy array for broadcasting
+                **self.SAFE_FUNCTIONS,
+                "__builtins__": None
+            }
+            if not isinstance(X, np.ndarray):
+                X = np.array(X) # Ensure X is a numpy array
+            # Ensure X is 2D, even if it has only one sample
+            if X.ndim == 1:
+                X = X.reshape(1, -1)
+            # x becomes a list of columns (1D arrays of shape (n_samples,))
+            x_cols = [X[:, i] for i in range(X.shape[1])]
+            local_env["x"] = x_cols
+            # The result will be a numpy array of shape (n_samples,)
+            try:
+                y_pred_array = eval(self.computable_expression, local_env)
+            except FloatingPointError as e:
+                # print(f"FloatingPointError during eval: {e}")
+                # print(f"Expression: {self.computable_expression}")
+                # print(f"Constants: {constants}")
+                return np.full(X.shape[0], np.nan)  # Return NaNs to be caught by loss
+            except Exception as e:
+                # print(f"General exception during eval: {e}")
+                return np.full(X.shape[0], np.nan)
+            finally:
+                np.seterr(all='warn')  # 🔁 Reset to default behavior
+            # Ensure output is float to avoid issues with mixed types if some results are int
+            return np.asarray(y_pred_array, dtype=float)
+        except Exception as e:
+            # Return an array of NaNs of the expected shape to ensure loss calculation doesn't break
+            num_samples = X.shape[0] if X.ndim > 0 else 1
+            return np.full(num_samples, np.nan) # Return NaNs on error
+    def fit_constants(self, X, y):
+        X = np.array(X)
+        y = np.array(y)
+        if self.constant_count == 0:
+            try:
+                y_pred = self.evaluate(X) # Vectorized call
+                if not np.all(np.isfinite(y_pred)): # Check for NaNs/Infs
+                    return -np.inf
+                if np.all(y_pred == y_pred[0]) and len(np.unique(y)) > 1: # Avoid R2 issues with constant prediction for non-constant y
+                    return 0.0 # Or handle as per specific requirements
+                return r2_score(y, y_pred)
+            except Exception as e: # Broader catch for any eval issue
+                return -np.inf
+        def loss(current_constants):
+            try:
+                y_pred = self.evaluate(X, current_constants)
+            except Exception as e:
+                print(f"Exception during evaluation: {e}")
+                return np.inf
+            if not np.all(np.isfinite(y_pred)):
+                return np.inf
+            # MSE calculation
+            mse = np.mean((y - y_pred) ** 2)
+            return mse
+        bounds = [(-2., 2.)] * self.constant_count
+        initial_guess = (
+            self.best_constants
+            if self.best_constants and len(self.best_constants) == self.constant_count
+            else [.0] * self.constant_count # Default to 1.0
+        )
+        # Ensure initial_guess is a flat numpy array
+        initial_guess = np.array(initial_guess, dtype=float).flatten()
+        # from scipy.optimize import differential_evolution
+        # # Step 1: Use Differential Evolution for global exploration
+        # print("\n--- Starting Differential Evolution ---")
+        # result_de = differential_evolution(loss, bounds,
+        #                                    popsize=70,      # Aumente para 50, 70, ou mais
+        #                                    maxiter=10000,   # Aumente para 5000, 10000, ou mais
+        #                                    strategy='rand1bin', # Tente 'rand1exp' se rand1bin não funcionar
+        #                                    tol=1e-7,        # Tolerância mais apertada
+        #                                    mutation=(0.8, 1.2), # Experimente valores mais altos
+        #                                    recombination=0.5, # Experimente valores mais baixos
+        #                                    seed=42,         # Mantém a reproducibilidade
+        #                                    disp=True,       # Exibe o progresso
+        #                                    polish=False)
+        # if result_de.success:
+        #     print(f"\nDifferential Evolution finished successfully. Best raw constants: {result_de.x}, Best MSE: {result_de.fun}")
+        #     # Use the result from DE as initial guess for local optimizer
+        #     initial_guess_for_minimize = result_de.x
+        #     # Step 2: (Optional but recommended) Refine with L-BFGS-B
+        #     # L-BFGS-B will be applied to the "raw" (non-rounded) values,
+        #     # but the loss function internally rounds for discrete ones.
+        #     # It might still struggle if the function is too "stepped" from rounding.
+        #     print("\n--- Starting L-BFGS-B refinement ---")
+        #     result_min = minimize(loss,
+        #                           x0=initial_guess_for_minimize,
+        #                           method='L-BFGS-B',
+        #                           bounds=bounds,
+        #                           options={'maxiter': 500, 'ftol': 1e-9, 'disp': True} # More iterations, tighter tolerance
+        #     )
+        #     if result_min.success:
+        #         print(f"\nL-BFGS-B refinement successful. Final raw constants: {result_min.x}, Final MSE: {result_min.fun}")
+        #         self.best_constants = list(result_min.x)
+        #     else:
+        #         print(f"\nL-BFGS-B refinement failed: {result_min.message}. Using Differential Evolution's result.")
+        #         self.best_constants = list(result_de.x)
+        # else:
+        #     print(f"\nDifferential Evolution did not converge successfully: {result_de.message}. Cannot proceed with optimization.")
+        #     return -np.inf # Indicate failure
+        # try:
+        #     y_pred = self.evaluate(X)
+        #     if not np.all(np.isfinite(y_pred)):
+        #         print("Final evaluation produced non-finite values for R2 score.")
+        #         return -np.inf
+        #     if len(np.unique(y)) == 1:
+        #         if np.allclose(y_pred, y[0]):
+        #             return 1.0
+        #         else:
+        #             return 0.0
+        #     return r2_score(y, y_pred)
+        # except Exception as e:
+        #     print(f"Error calculating final R2: {e}")
+        #     return -np.inf
+        result = minimize(loss,
+                        x0=initial_guess,
+                        method='L-BFGS-B',
+                        bounds=bounds,
+                        #options={'maxiter': 10, 'maxfun': 10, 'disp': True}
+        )
+        if result.success:
+            self.best_constants = result.x.tolist()
+            # print(f"Optimization successful. Final loss: {result.fun}") # Optional
+            try:
+                y_pred = self.evaluate(X) # Uses self.best_constants (vectorized)
+                if not np.all(np.isfinite(y_pred)):
+                    return -np.inf
+                # Refined R2 calculation for edge cases
+                if len(np.unique(y)) == 1: # If y is constant
+                    if np.allclose(y_pred, y[0]):
+                        return 1.0 # Perfect prediction of a constant
+                    else:
+                        return 0.0 # Or some other metric for imperfect constant prediction
+                #return mean_squared_error(y, y_pred) # Use MSE for optimization
+                #return mean_absolute_error(y, y_pred) # Use MAE for robustness
+                return r2_score(y, y_pred)
+            except Exception as e:
+                return -np.inf
+        else:
+            return -np.inf
+# from dataset import RegressionDataset
+# import numpy as np
+# import warnings
+# with warnings.catch_warnings():
+#     warnings.simplefilter("ignore", category=RuntimeWarning)
+#     np.seterr(invalid='ignore')
+# #reg = RegressionDataset('../data/evaluate/srsd-feynman_hard/train', 'feynman-bonus.12.txt', delimiter=' ')
+# reg = RegressionDataset('./data/evaluate/srsd-feynman_easy/train', 'feynman-i.18.16.txt', delimiter=' ')
+# X, y = reg.get_numpy()
+# #x = np.array(X).T
+# expression = "x_1*x_2*sin(x_4)"
+# #expr = "0.5*x[0]*x[1]**2"
+# expr = Expression(expression)
+# print("Expression:", expr)
+# if expr.is_valid_on_dataset(X):
+#     print("Expression is valid on dataset.")
+#     score = expr.fit_constants(X, y)
+#     print("Fitted constants:", expr.best_constants)
+#     print("R2 score:", score)
+# else:
+#     print("Expression is not valid on dataset.")

configs/eval_dataset_download.sh ADDED Viewed

	@@ -0,0 +1,6 @@

+git clone https://huggingface.co/datasets/yoshitomo-matsubara/srsd-feynman_easy_dummy
+git clone https://huggingface.co/datasets/yoshitomo-matsubara/srsd-feynman_medium_dummy
+git clone https://huggingface.co/datasets/yoshitomo-matsubara/srsd-feynman_hard_dummy
+git clone https://huggingface.co/datasets/yoshitomo-matsubara/srsd-feynman_easy
+git clone https://huggingface.co/datasets/yoshitomo-matsubara/srsd-feynman_medium
+git clone https://huggingface.co/datasets/yoshitomo-matsubara/srsd-feynman_hard

configs/model_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {}

configs/peft_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {}

configs/training.sh ADDED Viewed

	@@ -0,0 +1,82 @@

+CUDA_VISIBLE_DEVICES=0 python /home/augusto/symbo_repos/seringuela/scripts/train_test.py \
+  --dataset_repo_id augustocsc/sintetico_natural \
+  --data_dir 500k \
+  --output_dir ./output \
+  --push_to_hub \
+  --hub_model_id augustocsc/Se124M500KInfPrompt_EOS \
+  --source_data_column i_prompt \
+  --report_to wandb \
+  --run_name Se124M500KInfPrompt_EOS \
+  --model_name_or_path gpt2 \
+  --bf16 \
+  --eval_strategy steps \
+  --num_train_epochs 3 \
+  --per_device_train_batch_size 16 \
+  --per_device_eval_batch_size 16 \
+  --gradient_accumulation_steps 4 \
+  --dataloader_num_workers 8 \
+  --learning_rate 5e-5 \
+  --warmup_ratio 0.03 \
+  --weight_decay 0.01 \
+  --max_grad_norm 1.0 \
+  --lr_scheduler_type cosine \
+  --optim adamw_torch_fused \
+  --logging_steps 20 \
+  --eval_steps 500 \
+  --save_steps 1000 \
+  --save_total_limit 3 \
+# CUDA_VISIBLE_DEVICES=1 python /home/augusto/symbo_repos/seringuela/scripts/train_test.py \
+#   --dataset_repo_id augustocsc/sintetico_final \
+#   --data_dir 100k \
+#   --output_dir ./output \
+#   --push_to_hub \
+#   --hub_model_id augustocsc/Se124M100KInfPrompt_NT \
+#   --source_data_column i_prompt \
+#   --report_to wandb \
+#   --run_name Se124M100KInfPrompt_NT \
+#   --bf16 \
+#   --eval_strategy steps \
+#   --num_train_epochs 3 \
+#   --per_device_train_batch_size 16 \
+#   --per_device_eval_batch_size 16 \
+#   --gradient_accumulation_steps 2 \
+#   --dataloader_num_workers 8 \
+#   --learning_rate 2e-5 \
+#   --warmup_ratio 0.03 \
+#   --weight_decay 0.01 \
+#   --max_grad_norm 1.0 \
+#   --lr_scheduler_type cosine \
+#   --optim adamw_torch_fused \
+#   --logging_steps 20 \
+#   --eval_steps 500 \
+#   --save_steps 1000 \
+#   --save_total_limit 3
+# CUDA_VISIBLE_DEVICES=0 python /home/augusto/symbo_repos/seringuela/scripts/train_test.py \
+#   --dataset_repo_id augustocsc/sintetico_final \
+#   --data_dir 100k \
+#   --output_dir ./output \
+#   --push_to_hub \
+#   --hub_model_id augustocsc/Se124M100KInfPrompt_WT \
+#   --source_data_column i_prompt \
+#   --report_to wandb \
+#   --run_name Se124M100KInfPrompt_WT \
+#   --bf16 \
+#   --eval_strategy steps \
+#   --num_train_epochs 3 \
+#   --per_device_train_batch_size 16 \
+#   --per_device_eval_batch_size 16 \
+#   --gradient_accumulation_steps 2 \
+#   --dataloader_num_workers 8 \
+#   --learning_rate 2e-5 \
+#   --warmup_ratio 0.03 \
+#   --weight_decay 0.01 \
+#   --max_grad_norm 1.0 \
+#   --lr_scheduler_type cosine \
+#   --optim adamw_torch_fused \
+#   --logging_steps 20 \
+#   --eval_steps 500 \
+#   --save_steps 1000 \
+#   --save_total_limit 3

configs/training_args.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+    "output_dir": "./output",
+    "overwrite_output_dir": true,
+    "num_train_epochs": 50,
+    "per_device_train_batch_size": 8,
+    "gradient_accumulation_steps": 1,
+    "learning_rate": 5e-5,
+    "weight_decay": 0.01,
+    "warmup_steps": 0,
+    "fp16": true,
+    "seed": 42,
+    "per_device_eval_batch_size": 8,
+    "eval_strategy": "epoch",
+    "metric_for_best_model": "eval_loss",
+    "greater_is_better": false,
+    "eval_steps": null,
+    "load_best_model_at_end": true,
+    "save_strategy": "epoch",
+    "save_steps": null,
+    "save_total_limit": 2,
+    "logging_dir": "./output/logs",
+    "logging_steps": 100,
+    "report_to": "wandb",
+    "run_name": "Se124M100K",
+    "push_to_hub": true,
+    "hub_model_id": "augustocsc/Se124M100K",
+    "hub_token": null
+}

configs/training_large.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+    "model_config": {
+        "model_name_or_path": "gpt2-large",
+        "model_size": "774M",
+        "description": "GPT-2 Large - 774M parameters"
+    },
+    "training_args": {
+        "num_train_epochs": 2,
+        "per_device_train_batch_size": 4,
+        "per_device_eval_batch_size": 4,
+        "gradient_accumulation_steps": 16,
+        "effective_batch_size": 64,
+        "learning_rate": 2e-5,
+        "weight_decay": 0.01,
+        "warmup_steps": 100,
+        "max_grad_norm": 1.0,
+        "lr_scheduler_type": "cosine",
+        "fp16": true,
+        "seed": 42,
+        "block_size": 128
+    },
+    "evaluation_args": {
+        "eval_strategy": "epoch",
+        "eval_steps": null,
+        "metric_for_best_model": "eval_loss",
+        "greater_is_better": false,
+        "load_best_model_at_end": true
+    },
+    "save_args": {
+        "save_strategy": "epoch",
+        "save_steps": null,
+        "save_total_limit": 2
+    },
+    "logging_args": {
+        "logging_dir": "./output/logs",
+        "logging_steps": 50,
+        "report_to": "wandb"
+    },
+    "lora_config": {
+        "r": 8,
+        "lora_alpha": 32,
+        "target_modules": ["c_attn", "c_proj"],
+        "lora_dropout": 0.05,
+        "bias": "none",
+        "task_type": "CAUSAL_LM"
+    },
+    "dataset_config": {
+        "dataset_repo_id": "augustocsc/sintetico_natural",
+        "data_dir": "700K",
+        "data_columns": {
+            "infix": "i_prompt_n",
+            "prefix": "p_prompt_n"
+        }
+    },
+    "hub_config": {
+        "push_to_hub": true,
+        "hub_model_id_template": "augustocsc/Se774M_700K_{format}",
+        "formats": ["infix", "prefix"]
+    },
+    "estimated_time": {
+        "per_epoch_minutes": 180,
+        "total_hours": 6,
+        "notes": "Estimated for AWS g5.xlarge with A10G GPU. May need gradient checkpointing for memory optimization."
+    }
+}

configs/training_medium.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+    "model_config": {
+        "model_name_or_path": "gpt2-medium",
+        "model_size": "355M",
+        "description": "GPT-2 Medium - 355M parameters"
+    },
+    "training_args": {
+        "num_train_epochs": 2,
+        "per_device_train_batch_size": 8,
+        "per_device_eval_batch_size": 8,
+        "gradient_accumulation_steps": 8,
+        "effective_batch_size": 64,
+        "learning_rate": 3e-5,
+        "weight_decay": 0.01,
+        "warmup_steps": 100,
+        "max_grad_norm": 1.0,
+        "lr_scheduler_type": "cosine",
+        "fp16": true,
+        "seed": 42,
+        "block_size": 128
+    },
+    "evaluation_args": {
+        "eval_strategy": "epoch",
+        "eval_steps": null,
+        "metric_for_best_model": "eval_loss",
+        "greater_is_better": false,
+        "load_best_model_at_end": true
+    },
+    "save_args": {
+        "save_strategy": "epoch",
+        "save_steps": null,
+        "save_total_limit": 2
+    },
+    "logging_args": {
+        "logging_dir": "./output/logs",
+        "logging_steps": 50,
+        "report_to": "wandb"
+    },
+    "lora_config": {
+        "r": 8,
+        "lora_alpha": 32,
+        "target_modules": ["c_attn", "c_proj"],
+        "lora_dropout": 0.05,
+        "bias": "none",
+        "task_type": "CAUSAL_LM"
+    },
+    "dataset_config": {
+        "dataset_repo_id": "augustocsc/sintetico_natural",
+        "data_dir": "700K",
+        "data_columns": {
+            "infix": "i_prompt_n",
+            "prefix": "p_prompt_n"
+        }
+    },
+    "hub_config": {
+        "push_to_hub": true,
+        "hub_model_id_template": "augustocsc/Se355M_700K_{format}",
+        "formats": ["infix", "prefix"]
+    },
+    "estimated_time": {
+        "per_epoch_minutes": 90,
+        "total_hours": 3,
+        "notes": "Estimated for AWS g5.xlarge with A10G GPU"
+    }
+}

configs/training_small.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+    "model_config": {
+        "model_name_or_path": "gpt2",
+        "model_size": "124M",
+        "description": "GPT-2 Small - 124M parameters"
+    },
+    "training_args": {
+        "num_train_epochs": 3,
+        "per_device_train_batch_size": 16,
+        "per_device_eval_batch_size": 16,
+        "gradient_accumulation_steps": 4,
+        "effective_batch_size": 64,
+        "learning_rate": 5e-5,
+        "weight_decay": 0.01,
+        "warmup_steps": 100,
+        "max_grad_norm": 1.0,
+        "lr_scheduler_type": "cosine",
+        "fp16": true,
+        "seed": 42,
+        "block_size": 128
+    },
+    "evaluation_args": {
+        "eval_strategy": "epoch",
+        "eval_steps": null,
+        "metric_for_best_model": "eval_loss",
+        "greater_is_better": false,
+        "load_best_model_at_end": true
+    },
+    "save_args": {
+        "save_strategy": "epoch",
+        "save_steps": null,
+        "save_total_limit": 2
+    },
+    "logging_args": {
+        "logging_dir": "./output/logs",
+        "logging_steps": 50,
+        "report_to": "wandb"
+    },
+    "lora_config": {
+        "r": 8,
+        "lora_alpha": 32,
+        "target_modules": ["c_attn", "c_proj"],
+        "lora_dropout": 0.05,
+        "bias": "none",
+        "task_type": "CAUSAL_LM"
+    },
+    "dataset_config": {
+        "dataset_repo_id": "augustocsc/sintetico_natural",
+        "data_dir": "700K",
+        "data_columns": {
+            "infix": "i_prompt_n",
+            "prefix": "p_prompt_n"
+        }
+    },
+    "hub_config": {
+        "push_to_hub": true,
+        "hub_model_id_template": "augustocsc/Se124M_700K_{format}",
+        "formats": ["infix", "prefix"]
+    },
+    "estimated_time": {
+        "per_epoch_minutes": 40,
+        "total_hours": 2,
+        "notes": "Estimated for AWS g5.xlarge with A10G GPU"
+    }
+}

configs/training_v3.json ADDED Viewed

	@@ -0,0 +1,78 @@

+{
+    "model_config": {
+        "model_name_or_path": "gpt2",
+        "model_size": "124M",
+        "description": "GPT-2 Small (124M) - v3 with proper end markers"
+    },
+    "training_args": {
+        "num_train_epochs": 3,
+        "per_device_train_batch_size": 8,
+        "per_device_eval_batch_size": 8,
+        "gradient_accumulation_steps": 4,
+        "effective_batch_size": 32,
+        "learning_rate": 5e-5,
+        "weight_decay": 0.01,
+        "warmup_steps": 100,
+        "max_grad_norm": 1.0,
+        "lr_scheduler_type": "cosine",
+        "fp16": true,
+        "seed": 42,
+        "block_size": 128
+    },
+    "evaluation_args": {
+        "eval_strategy": "epoch",
+        "eval_steps": null,
+        "metric_for_best_model": "eval_loss",
+        "greater_is_better": false,
+        "load_best_model_at_end": true
+    },
+    "save_args": {
+        "save_strategy": "epoch",
+        "save_steps": null,
+        "save_total_limit": 2
+    },
+    "logging_args": {
+        "logging_dir": "./output/logs",
+        "logging_steps": 50,
+        "report_to": "wandb"
+    },
+    "lora_config": {
+        "r": 8,
+        "lora_alpha": 32,
+        "target_modules": ["c_attn"],
+        "lora_dropout": 0.05,
+        "bias": "none",
+        "task_type": "CAUSAL_LM"
+    },
+    "dataset_config": {
+        "use_local_csvs": true,
+        "train_file": "./data/processed/700K_fixed/train_700K.csv",
+        "validation_file": "./data/processed/700K_fixed/validation_700K.csv",
+        "test_file": "./data/processed/700K_fixed/test_700K.csv",
+        "data_column": "text"
+    },
+    "hub_config": {
+        "push_to_hub": true,
+        "hub_model_id": "augustocsc/Se124M_700K_infix_v3"
+    },
+    "special_tokens": {
+        "start_token": "<|startofex|>",
+        "end_token": "<|endofex|>",
+        "notes": "End token configured as EOS token for proper stopping"
+    },
+    "estimated_time": {
+        "per_epoch_minutes": 45,
+        "total_hours": 2.25,
+        "notes": "Estimated for AWS g5.xlarge with A10G GPU, GPT-2 Small, 3 epochs"
+    },
+    "version_info": {
+        "model_version": "v3",
+        "improvements": [
+            "Training data includes proper <|endofex|> markers",
+            "100% validation rate on prepared dataset",
+            "Addresses v1 non-stopping issue and v2 garbage generation",
+            "Uses local CSVs with validated end markers"
+        ],
+        "training_date": "2026-02-01"
+    }
+}

create_structure.sh ADDED Viewed

	@@ -0,0 +1,171 @@

+#!/bin/bash
+echo "Criando estrutura de pastas para o projeto de fine-tuning..."
+# Diretórios Principais
+mkdir -p data/raw
+mkdir -p data/processed
+mkdir -p scripts
+mkdir -p configs
+mkdir -p output
+mkdir -p notebooks
+echo "Diretórios criados."
+# Arquivos Placeholder e de Configuração Inicial
+touch data/raw/.gitkeep                 # Mantém a pasta no Git mesmo vazia
+touch data/processed/.gitkeep           # Mantém a pasta no Git mesmo vazia
+echo "# Script para pré-processar dados (raw -> processed)" > scripts/preprocess_data.py
+echo "# Script principal de treinamento (usa Trainer)" > scripts/train.py
+echo "# Script para avaliação customizada" > scripts/evaluate.py
+echo "# Script para geração de texto com modelo treinado" > scripts/generate.py
+echo "{}" > configs/training_args.json   # Placeholder para argumentos do Trainer
+echo "{}" > configs/peft_config.json     # Placeholder para config PEFT (se usar)
+echo "{}" > configs/model_config.json    # Placeholder para config do modelo base
+touch notebooks/01_data_exploration.ipynb
+touch notebooks/.gitkeep                # Mantém a pasta no Git mesmo vazia
+touch requirements.txt
+echo "Arquivos placeholder criados."
+# Conteúdo Inicial para .gitignore
+echo "Gerando .gitignore..."
+cat << EOF > .gitignore
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+# Usually these files are written by a python script from a template
+# before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+# Environments
+.env
+.venv
+venv/
+ENV/
+env/
+env.bak/
+venv.bak/
+# IDEs / Editors
+.idea/
+.vscode/
+*.suo
+*.ntvs*
+*.njsproj
+*.sln
+*.sw?
+# Jupyter Notebook
+.ipynb_checkpoints
+# Output folder (geralmente grande demais para Git)
+output/*
+!output/.gitkeep # Não ignore um .gitkeep se precisar manter a pasta
+# Dados (podem ser grandes, usar Git LFS ou armazenar fora se necessário)
+data/raw/*
+data/processed/*
+!data/raw/.gitkeep
+!data/processed/.gitkeep
+# OS generated files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+EOF
+# Conteúdo Inicial para README.md (será preenchido com o texto gerado abaixo)
+echo "Gerando README.md inicial..."
+echo "# Nome do Seu Projeto de Fine-Tuning" > README.md
+echo "" >> README.md
+echo "(Breve descrição do objetivo do projeto)" >> README.md
+echo "" >> README.md
+echo "## Estrutura de Pastas" >> README.md
+echo "" >> README.md
+echo "**(COPIE E COLE A EXPLICAÇÃO DA ESTRUTURA GERADA NA PRÓXIMA SEÇÃO AQUI)**" >> README.md
+echo "" >> README.md
+echo "## Como Usar" >> README.md
+echo "" >> README.md
+echo "1.  **Setup:** Crie um ambiente virtual e instale as dependências:" >> README.md
+echo "    \`\`\`bash" >> README.md
+echo "    python -m venv venv" >> README.md
+echo "    source venv/bin/activate  # Linux/macOS" >> README.md
+echo "    # venv\\Scripts\\activate  # Windows" >> README.md
+echo "    pip install -r requirements.txt" >> README.md
+echo "    \`\`\`" >> README.md
+echo "2.  **Dados:** Coloque seus dados brutos em \`data/raw/\` e execute (ou crie) o script \`scripts/preprocess_data.py\` para gerar os arquivos em \`data/processed/\`." >> README.md
+echo "3.  **Configuração:** Ajuste os arquivos em \`configs/\` (argumentos de treino, modelo base, PEFT se aplicável)." >> README.md
+echo "4.  **Treinamento:** Execute o script principal:" >> README.md
+echo "    \`\`\`bash" >> README.md
+echo "    python scripts/train.py --args_config configs/training_args.json --model_config configs/model_config.json" >> README.md
+echo "    \`\`\`" >> README.md
+echo "    *(Adapte os argumentos conforme necessário)*" >> README.md
+echo "" >> README.md
+echo "## Dependências" >> README.md
+echo "" >> README.md
+echo "As dependências Python estão listadas no arquivo \`requirements.txt\`." >> README.md
+chmod +x create_structure.sh
+echo "--------------------------------------------------"
+echo "Estrutura criada com sucesso!"
+echo "Para usar:"
+echo "1. Torne o script executável: chmod +x create_structure.sh"
+echo "2. Execute o script: ./create_structure.sh"
+echo "3. Copie a explicação da estrutura (gerada na resposta anterior) para dentro do README.md onde indicado."
+echo "--------------------------------------------------"

notebooks/.gitkeep ADDED Viewed

File without changes

notebooks/01_data_exploration.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

notebooks/02_finetuning_avaliation.ipynb ADDED Viewed

	@@ -0,0 +1,568 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "5c6de955",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "import json\n",
+    "from collections import Counter, defaultdict\n",
+    "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "import sympy as sp\n",
+    "\n",
+    "# Configuration\n",
+    "TOKENIZER_REPO = \"augustocsc/Se124M500KInfPrompt_EOS\"\n",
+    "LORA_REPO = \"augustocsc/Se124M500KInfPrompt_EOS\"\n",
+    "BASE_MODEL = \"gpt2\"\n",
+    "PROMPT = \"\"\"\n",
+    "vars: x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10\n",
+    "oper: *, **, +, -, /\n",
+    "cons: C\n",
+    "expr:\"\"\"\n",
+    "GENERATE_BATCH = 10\n",
+    "REPEAT_TIMES = 1\n",
+    "OUTPUT_EXPR_FILE = \"generated_expressions.json\"\n",
+    "OUTPUT_ANALYSIS_FILE = \"analysis_results.json\"\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "e0b08244",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loading tokenizer and model...\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "8db99632228d4e599ab477a436f16c3e",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "tokenizer_config.json:   0%|          | 0.00/1.09k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "b94dc7909bb74e1fbf6a3a3922c88a8b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "a20a9a7738224785b2693a2239aa079d",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "00c1fb75426b49c295e79ff1f7b92d4f",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "1af7b45de60248b192b9c47b63a1c4c4",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "added_tokens.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "b9cb7f8429d0486198aa7870aa45b381",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "special_tokens_map.json:   0%|          | 0.00/562 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "12fa3ee725a04c57a19737d18f70f340",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "adapter_config.json:   0%|          | 0.00/744 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "3b8be3854aee41d8a28bf0810a860b42",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "adapter_model.safetensors:   0%|          | 0.00/310M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# Load tokenizer and model with LoRA adapter\n",
+    "print(\"Loading tokenizer and model...\")\n",
+    "tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_REPO)\n",
+    "model = AutoModelForCausalLM.from_pretrained(BASE_MODEL)\n",
+    "model.resize_token_embeddings(len(tokenizer))\n",
+    "model = PeftModel.from_pretrained(model, LORA_REPO)\n",
+    "\n",
+    "\n",
+    "model.eval()\n",
+    "\n",
+    "# Regex to extract expressions between tokens\n",
+    "pattern = re.compile(r\"<startofex>(.*?)<endofex>\", re.DOTALL)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c76ee26f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of the model checkpoint at augustocsc/Se124M100KInfPrompt_EOS_Merged were not used when initializing GPT2LMHeadModel: ['v_head.summary.bias', 'v_head.summary.weight']\n",
+      "- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "Some weights of the model checkpoint at augustocsc/Se124M100KInfPrompt_EOS_Merged were not used when initializing GPT2LMHeadModel: ['v_head.summary.bias', 'v_head.summary.weight']\n",
+      "- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "from trl import AutoModelForCausalLMWithValueHead\n",
+    "\n",
+    "# Carrega o modelo base\n",
+    "base_model = AutoModelForCausalLM.from_pretrained(\"gpt2\")\n",
+    "\n",
+    "# Carrega os pesos LoRA (checkpoint treinado)\n",
+    "peft_model = PeftModel.from_pretrained(base_model, \"augustocsc/Se124M100KInfPrompt_EOS\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "ffc6e072",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Run 1/1: Generating 10 samples...\n"
+     ]
+    }
+   ],
+   "source": [
+    "all_expressions = []\n",
+    "\n",
+    "# Generation loop\n",
+    "for run in range(REPEAT_TIMES):\n",
+    "    print(f\"Run {run+1}/{REPEAT_TIMES}: Generating {GENERATE_BATCH} samples...\")\n",
+    "    inputs = tokenizer([PROMPT] * GENERATE_BATCH, return_tensors=\"pt\", padding=True)\n",
+    "    outputs = model.generate(\n",
+    "        **inputs,\n",
+    "        max_new_tokens=75,\n",
+    "        do_sample=True,\n",
+    "        top_p=0.9,\n",
+    "        top_k=50,\n",
+    "        temperature=0.7,\n",
+    "    )\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 71,
+   "id": "be3b4bcb",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Generated expressions:\n",
+      " a_1, b_2, c_1, c_2, c_3, c_4, c_5, c_6, c_7, c_8, c_9, c_10, c_\n",
+      "\n",
+      "\n",
+      "A function that evaluates to a string, and returns a string.\n",
+      "\n",
+      "A string can be any character, and can be either a double, a string, a double, a singleton, a string with multiple elements, a string with\n",
+      "\n",
+      "\n",
+      "vars: x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10\n",
+      "\n",
+      "op: *,\n",
+      " *, +, +, -, /\n",
+      "cons: C\n",
+      "\n",
+      "expr: *, +, +, -, /\n",
+      "\n",
+      "cons: C\n",
+      "\n",
+      "expr: *, +, +, -, /\n",
+      "\n",
+      "cons: C\n",
+      "\n",
+      " *\n",
+      "\n",
+      "cons: c\n",
+      "\n",
+      "type: Int\n",
+      "\n",
+      "value: *\n",
+      "\n",
+      "value: *\n",
+      "\n",
+      "value: *\n",
+      "\n",
+      "value: *\n",
+      "\n",
+      "value: *\n",
+      "\n",
+      "value: *\n",
+      "\n",
+      "value: *\n",
+      "\n",
+      "value:\n",
+      " *, **, -, /\n",
+      "oper: *, **, +, -, /\n",
+      "oper: *, **, +, -, /\n",
+      "\n",
+      "op: [\n",
+      "\n",
+      "op: [\n",
+      "\n",
+      "op: [\n",
+      "\n",
+      "op:\n",
+      "\n",
+      "\n",
+      "*, **, +, -, /\n",
+      "\n",
+      "vars: x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_\n",
+      " *, **, *, **, *, **, *, *, **, *, **, *, **, *, **, *, **, *, *, **, *, **, *, *\n",
+      "\n",
+      "oper\n",
+      " *, *, *, *, *, *, *\n",
+      "cons: C\n",
+      "\n",
+      "expr: *, *, *, *, *, *, *\n",
+      "\n",
+      "cons: C\n",
+      "\n",
+      "expr: *, *, *, *\n",
+      "\n",
+      "\n",
+      "vars: *, +, *, *, *, *, *, *, *, *, *, *, *, *, *, *\n",
+      "\n",
+      "oper: *, **, +, *, *,\n"
+     ]
+    }
+   ],
+   "source": [
+    "# remove the prompt from the generated text and print the decoded text\n",
+    "generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)\n",
+    "generated_text = [text.replace(PROMPT, \"\") for text in generated_text]\n",
+    "all_expressions.extend(generated_text)\n",
+    "print(\"Generated expressions:\")\n",
+    "for text in generated_text:\n",
+    "    print(text)\n",
+    "        "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "5d8e569f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Valid Expressions:\n",
+      "x_6 - x_3 + C*x_6 + x_7 + C\n",
+      "x_2*(x_9 + x_2)**C\n",
+      "x_9 + x_2 + C*x_7**C\n",
+      "x_9**C + x_1**C + x_2\n",
+      "C*x_1 + x_8 + x_1 + C\n",
+      "x_1**C*(x_9 + x_4**C + C)\n",
+      "x_2*(x_9 - C)**C/x_7\n",
+      "x_1*(x_8 - C)/(x_1 + x_2)\n",
+      "x_8**C*(x_2 + x_7)\n",
+      "x_1**C + x_2**C + x_9\n",
+      "\n",
+      "Invalid Expressions:\n"
+     ]
+    }
+   ],
+   "source": [
+    "valid_expressions = []\n",
+    "invalid_expressions = []\n",
+    "\n",
+    "for out in outputs:\n",
+    "    text = tokenizer.decode(out)\n",
+    "    expr = text.split(\"expr: \")[1].split(\"<|endoftext|>\")[0].strip()  # Extract the expression between \"expr: \" and <|endoftext|>\n",
+    "    try:\n",
+    "        sympy_expr = sp.sympify(expr, evaluate=False)  # Try to parse the expression with sympy\n",
+    "        valid_expressions.append(expr)\n",
+    "    except Exception as e:\n",
+    "        invalid_expressions.append(expr)\n",
+    "\n",
+    "# Print valid expressions\n",
+    "print(\"Valid Expressions:\")\n",
+    "for expr in valid_expressions:\n",
+    "    print(expr)\n",
+    "\n",
+    "# Print invalid expressions\n",
+    "print(\"\\nInvalid Expressions:\")\n",
+    "for expr in invalid_expressions:\n",
+    "    print(expr)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d05f1edd",
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "AttributeError",
+     "evalue": "'AutoModelForCausalLMWithValueHead' object has no attribute 'generation_config'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
+      "\u001b[31mAttributeError\u001b[39m                            Traceback (most recent call last)",
+      "\u001b[36mFile \u001b[39m\u001b[32m~/symbo_repos/seringuela/.seriguela/lib/python3.11/site-packages/peft/peft_model.py:793\u001b[39m, in \u001b[36mPeftModel.__getattr__\u001b[39m\u001b[34m(self, name)\u001b[39m\n\u001b[32m    792\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m793\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m.\u001b[49m\u001b[34;43m__getattr__\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mname\u001b[49m\u001b[43m)\u001b[49m  \u001b[38;5;66;03m# defer to nn.Module's logic\u001b[39;00m\n\u001b[32m    794\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mAttributeError\u001b[39;00m:\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m~/symbo_repos/seringuela/.seriguela/lib/python3.11/site-packages/torch/nn/modules/module.py:1928\u001b[39m, in \u001b[36mModule.__getattr__\u001b[39m\u001b[34m(self, name)\u001b[39m\n\u001b[32m   1927\u001b[39m         \u001b[38;5;28;01mreturn\u001b[39;00m modules[name]\n\u001b[32m-> \u001b[39m\u001b[32m1928\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mAttributeError\u001b[39;00m(\n\u001b[32m   1929\u001b[39m     \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mtype\u001b[39m(\u001b[38;5;28mself\u001b[39m).\u001b[34m__name__\u001b[39m\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m'\u001b[39m\u001b[33m object has no attribute \u001b[39m\u001b[33m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mname\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m'\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m   1930\u001b[39m )\n",
+      "\u001b[31mAttributeError\u001b[39m: 'PeftModelForCausalLM' object has no attribute 'generation_config'",
+      "\nDuring handling of the above exception, another exception occurred:\n",
+      "\u001b[31mAttributeError\u001b[39m                            Traceback (most recent call last)",
+      "\u001b[36mFile \u001b[39m\u001b[32m~/symbo_repos/seringuela/.seriguela/lib/python3.11/site-packages/peft/tuners/lora/model.py:359\u001b[39m, in \u001b[36mLoraModel.__getattr__\u001b[39m\u001b[34m(self, name)\u001b[39m\n\u001b[32m    358\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m359\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m.\u001b[49m\u001b[34;43m__getattr__\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mname\u001b[49m\u001b[43m)\u001b[49m  \u001b[38;5;66;03m# defer to nn.Module's logic\u001b[39;00m\n\u001b[32m    360\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mAttributeError\u001b[39;00m:\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m~/symbo_repos/seringuela/.seriguela/lib/python3.11/site-packages/torch/nn/modules/module.py:1928\u001b[39m, in \u001b[36mModule.__getattr__\u001b[39m\u001b[34m(self, name)\u001b[39m\n\u001b[32m   1927\u001b[39m         \u001b[38;5;28;01mreturn\u001b[39;00m modules[name]\n\u001b[32m-> \u001b[39m\u001b[32m1928\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mAttributeError\u001b[39;00m(\n\u001b[32m   1929\u001b[39m     \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mtype\u001b[39m(\u001b[38;5;28mself\u001b[39m).\u001b[34m__name__\u001b[39m\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m'\u001b[39m\u001b[33m object has no attribute \u001b[39m\u001b[33m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mname\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m'\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m   1930\u001b[39m )\n",
+      "\u001b[31mAttributeError\u001b[39m: 'LoraModel' object has no attribute 'generation_config'",
+      "\nDuring handling of the above exception, another exception occurred:\n",
+      "\u001b[31mAttributeError\u001b[39m                            Traceback (most recent call last)",
+      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[41]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m      1\u001b[39m \u001b[38;5;66;03m# Generate with beam search and early stopping\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m output = \u001b[43mmodel\u001b[49m\u001b[43m.\u001b[49m\u001b[43mgenerate\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m      3\u001b[39m \u001b[43m    \u001b[49m\u001b[43minputs\u001b[49m\u001b[43m.\u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m      4\u001b[39m \u001b[43m    \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m=\u001b[49m\u001b[43minputs\u001b[49m\u001b[43m.\u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m      5\u001b[39m \u001b[43m    \u001b[49m\u001b[38;5;66;43;03m#max_length=100,\u001b[39;49;00m\n\u001b[32m      6\u001b[39m \u001b[43m    \u001b[49m\u001b[43mnum_beams\u001b[49m\u001b[43m=\u001b[49m\u001b[32;43m5\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m  \u001b[49m\u001b[38;5;66;43;03m# Enable beam search\u001b[39;49;00m\n\u001b[32m      7\u001b[39m \u001b[43m    \u001b[49m\u001b[43mearly_stopping\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\u001b[43m  \u001b[49m\u001b[38;5;66;43;03m# Stop when all beams hit EOS\u001b[39;49;00m\n\u001b[32m      8\u001b[39m \n\u001b[32m      9\u001b[39m \u001b[43m)\u001b[49m\n\u001b[32m     11\u001b[39m decoded_output = tokenizer.decode(output[\u001b[32m0\u001b[39m], skip_special_tokens=\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[32m     12\u001b[39m \u001b[38;5;28mprint\u001b[39m(decoded_output)\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m~/symbo_repos/seringuela/.seriguela/lib/python3.11/site-packages/peft/peft_model.py:1867\u001b[39m, in \u001b[36mPeftModelForCausalLM.generate\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m   1865\u001b[39m \u001b[38;5;28mself\u001b[39m.base_model.prepare_inputs_for_generation = \u001b[38;5;28mself\u001b[39m.prepare_inputs_for_generation\n\u001b[32m   1866\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(\u001b[38;5;28mself\u001b[39m.base_model, \u001b[33m\"\u001b[39m\u001b[33mmodel\u001b[39m\u001b[33m\"\u001b[39m):\n\u001b[32m-> \u001b[39m\u001b[32m1867\u001b[39m     \u001b[38;5;28mself\u001b[39m.base_model.model.generation_config = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mgeneration_config\u001b[49m\n\u001b[32m   1868\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m   1869\u001b[39m     \u001b[38;5;28mself\u001b[39m.base_model.generation_config = \u001b[38;5;28mself\u001b[39m.generation_config\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m~/symbo_repos/seringuela/.seriguela/lib/python3.11/site-packages/peft/peft_model.py:797\u001b[39m, in \u001b[36mPeftModel.__getattr__\u001b[39m\u001b[34m(self, name)\u001b[39m\n\u001b[32m    795\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m name == \u001b[33m\"\u001b[39m\u001b[33mbase_model\u001b[39m\u001b[33m\"\u001b[39m:  \u001b[38;5;66;03m# see #1892: prevent infinite recursion if class is not initialized\u001b[39;00m\n\u001b[32m    796\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m797\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mgetattr\u001b[39m(\u001b[38;5;28mself\u001b[39m.base_model, name)\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m~/symbo_repos/seringuela/.seriguela/lib/python3.11/site-packages/peft/tuners/lora/model.py:363\u001b[39m, in \u001b[36mLoraModel.__getattr__\u001b[39m\u001b[34m(self, name)\u001b[39m\n\u001b[32m    361\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m name == \u001b[33m\"\u001b[39m\u001b[33mmodel\u001b[39m\u001b[33m\"\u001b[39m:  \u001b[38;5;66;03m# see #1892: prevent infinite recursion if class is not initialized\u001b[39;00m\n\u001b[32m    362\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m363\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mgetattr\u001b[39m(\u001b[38;5;28mself\u001b[39m.model, name)\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m~/symbo_repos/seringuela/.seriguela/lib/python3.11/site-packages/torch/nn/modules/module.py:1928\u001b[39m, in \u001b[36mModule.__getattr__\u001b[39m\u001b[34m(self, name)\u001b[39m\n\u001b[32m   1926\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m name \u001b[38;5;129;01min\u001b[39;00m modules:\n\u001b[32m   1927\u001b[39m         \u001b[38;5;28;01mreturn\u001b[39;00m modules[name]\n\u001b[32m-> \u001b[39m\u001b[32m1928\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mAttributeError\u001b[39;00m(\n\u001b[32m   1929\u001b[39m     \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mtype\u001b[39m(\u001b[38;5;28mself\u001b[39m).\u001b[34m__name__\u001b[39m\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m'\u001b[39m\u001b[33m object has no attribute \u001b[39m\u001b[33m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mname\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m'\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m   1930\u001b[39m )\n",
+      "\u001b[31mAttributeError\u001b[39m: 'AutoModelForCausalLMWithValueHead' object has no attribute 'generation_config'"
+     ]
+    }
+   ],
+   "source": [
+    "# Generate with beam search and early stopping\n",
+    "output = model.generate(\n",
+    "    inputs.input_ids,\n",
+    "    attention_mask=inputs.attention_mask,\n",
+    "    #max_length=100,\n",
+    "    num_beams=5,  # Enable beam search\n",
+    "    early_stopping=True,  # Stop when all beams hit EOS\n",
+    "\n",
+    ")\n",
+    "\n",
+    "decoded_output = tokenizer.decode(output[0], skip_special_tokens=False)\n",
+    "print(decoded_output)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7a9ade5c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# Save raw expressions\n",
+    "with open(OUTPUT_EXPR_FILE, 'w') as f:\n",
+    "    json.dump(all_expressions, f, indent=2)\n",
+    "print(f\"Saved {len(all_expressions)} expressions to {OUTPUT_EXPR_FILE}\")\n",
+    "\n",
+    "# Analysis\n",
+    "analysis = {\n",
+    "    'total_expressions': len(all_expressions),\n",
+    "    'syntactic_semantic': {\n",
+    "        'valid_equations': 0,\n",
+    "        'parse_errors': defaultdict(int),\n",
+    "    },\n",
+    "    'diversity_redundancy': {},\n",
+    "    'statistical_distributions': {\n",
+    "        'variable_freq': Counter(),\n",
+    "        'operator_freq': Counter(),\n",
+    "        'avg_operators_per_eq': 0.0,\n",
+    "        'avg_variables_per_eq': 0.0,\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "# Helper to compute tree depth\n",
+    "def tree_depth(expr):\n",
+    "    if not expr.args:\n",
+    "        return 1\n",
+    "    return 1 + max(tree_depth(arg) for arg in expr.args)\n",
+    "\n",
+    "# Operators list\n",
+    "operators = ['+', '-', '*', '/', '^', 'log', 'exp', 'cos', 'sqrt', 'asin', 'sin', 'pow', 'tan', 'abs']\n",
+    "\n",
+    "depths = []\n",
+    "operator_counts = []\n",
+    "variable_counts = []\n",
+    "unique_set = set()\n",
+    "\n",
+    "for expr in all_expressions:\n",
+    "    # Parse with sympy\n",
+    "    try:\n",
+    "        sympy_expr = sp.sympify(expr, evaluate=False)\n",
+    "        analysis['syntactic_semantic']['valid_equations'] += 1\n",
+    "        depths.append(tree_depth(sympy_expr))\n",
+    "    except Exception as e:\n",
+    "        err_msg = str(e)\n",
+    "        if 'could not parse' in err_msg:\n",
+    "            analysis['syntactic_semantic']['parse_errors']['parse_failure'] += 1\n",
+    "        else:\n",
+    "            analysis['syntactic_semantic']['parse_errors'][err_msg] += 1\n",
+    "        continue\n",
+    "\n",
+    "    # Variables\n",
+    "    vars_in_expr = [str(v) for v in sympy_expr.free_symbols]\n",
+    "    for v in vars_in_expr:\n",
+    "        analysis['statistical_distributions']['variable_freq'][v] += 1\n",
+    "    variable_counts.append(len(vars_in_expr))\n",
+    "\n",
+    "    # Operators\n",
+    "    op_count = sum(expr.count(op) for op in operators)\n",
+    "    analysis['statistical_distributions']['operator_freq'].update({op: expr.count(op) for op in operators})\n",
+    "    operator_counts.append(op_count)\n",
+    "\n",
+    "    # Diversity\n",
+    "    unique_set.add(expr)\n",
+    "\n",
+    "# Populate diversity metrics\n",
+    "total = analysis['total_expressions']\n",
+    "unique_count = len(unique_set)\n",
+    "analysis['diversity_redundancy'] = {\n",
+    "    'unique_expressions': unique_count,\n",
+    "    'unique_proportion': unique_count / total if total else 0,\n",
+    "    'duplicate_counts': {expr: cnt for expr, cnt in Counter(all_expressions).items() if cnt > 1},\n",
+    "    'structural_diversity': {\n",
+    "        'avg_tree_depth': sum(depths) / len(depths) if depths else 0,\n",
+    "        'min_tree_depth': min(depths) if depths else 0,\n",
+    "        'max_tree_depth': max(depths) if depths else 0,\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "# Statistical distributions averages\n",
+    "analysis['statistical_distributions']['avg_operators_per_eq'] = sum(operator_counts) / len(operator_counts) if operator_counts else 0\n",
+    "analysis['statistical_distributions']['avg_variables_per_eq'] = sum(variable_counts) / len(variable_counts) if variable_counts else 0\n",
+    "\n",
+    "# Convert Counters to dicts for JSON serialization\n",
+    "analysis['statistical_distributions']['variable_freq'] = dict(analysis['statistical_distributions']['variable_freq'])\n",
+    "analysis['statistical_distributions']['operator_freq'] = dict(analysis['statistical_distributions']['operator_freq'])\n",
+    "analysis['syntactic_semantic']['parse_errors'] = dict(analysis['syntactic_semantic']['parse_errors'])\n",
+    "\n",
+    "# Save analysis results\n",
+    "with open(OUTPUT_ANALYSIS_FILE, 'w') as f:\n",
+    "    json.dump(analysis, f, indent=2)\n",
+    "print(f\"Saved analysis results to {OUTPUT_ANALYSIS_FILE}\")\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".seriguela",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

notebooks/03_RL.ipynb ADDED Viewed

	@@ -0,0 +1,338 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "59d6d70b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of the model checkpoint at augustocsc/Se124M100KInfPrompt_EOS_Merged were not used when initializing GPT2LMHeadModel: ['v_head.summary.bias', 'v_head.summary.weight']\n",
+      "- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "WARNING:root:A <class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'> model is loaded from 'augustocsc/Se124M100KInfPrompt_EOS_Merged', and no v_head weight is found. This IS expected if you are not resuming PPO training.\n",
+      "Some weights of the model checkpoint at augustocsc/Se124M100KInfPrompt_EOS_Merged were not used when initializing GPT2LMHeadModel: ['v_head.summary.bias', 'v_head.summary.weight']\n",
+      "- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "WARNING:root:A <class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'> model is loaded from 'augustocsc/Se124M100KInfPrompt_EOS_Merged', and no v_head weight is found. This IS expected if you are not resuming PPO training.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import torch\n",
+    "import numpy as np\n",
+    "from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead\n",
+    "from transformers import AutoTokenizer\n",
+    "from datasets import Dataset\n",
+    "from peft import PeftModel, AutoPeftModelForCausalLM\n",
+    "import sys\n",
+    "from transformers import AutoModelForCausalLM\n",
+    "\n",
+    "# Add path for Expression class\n",
+    "sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../classes')))\n",
+    "from expression import Expression\n",
+    "from dataset import RegressionDataset\n",
+    "\n",
+    "# === Reward function ===\n",
+    "def compute_reward(expression_str: str) -> float:\n",
+    "    try:\n",
+    "        expr = Expression(expression_str)\n",
+    "        \n",
+    "        # Check if the expression is valid and can be evaluated\n",
+    "        if expr.is_valid_on_dataset(X):\n",
+    "            score = expr.fit_constants(X, y)\n",
+    "            return max(0.1 , (float(score) if np.isfinite(score) else -1.0))\n",
+    "        else:\n",
+    "            #print(f\"Expressão inválida: {expression_str}\")\n",
+    "            return -1.0\n",
+    "    except Exception as e:\n",
+    "        #print(f\"Erro ao avaliar expressão: {expression_str} - {e}\")\n",
+    "        return -1.0\n",
+    "\n",
+    "# === Helper to extract expression ===\n",
+    "def extract_expression(response: str) -> str:\n",
+    "    return response.split(\"expr: \")[1].split(\"<|endoftext|>\")[0].strip()\n",
+    "\n",
+    "# === Load Data ===\n",
+    "#reg = RegressionDataset('../data/evaluate/srsd-feynman_hard/train', 'feynman-bonus.12.txt', delimiter=' ')\n",
+    "reg = RegressionDataset('../data/evaluate/srsd-feynman_easy/train', 'feynman-i.18.16.txt', delimiter=' ')\n",
+    "X, y = reg.get_numpy()\n",
+    "\n",
+    "# === Configs ===\n",
+    "BASE_MODEL = \"augustocsc/Se124M100KInfPrompt_EOS_Merged\"\n",
+    "LORA_REPO = \"augustocsc/Se124M100KInfPrompt_EOS_Merged\"\n",
+    "TOKENIZER_REPO = LORA_REPO\n",
+    "\n",
+    "# ppo_config = PPOConfig(\n",
+    "#     #model_name=BASE_MODEL,\n",
+    "#     learning_rate=1e-5,\n",
+    "#     batch_size=32,\n",
+    "#     mini_batch_size=8,\n",
+    "#     gradient_accumulation_steps=1,\n",
+    "# )\n",
+    "\n",
+    "\n",
+    "model = AutoModelForCausalLMWithValueHead.from_pretrained(BASE_MODEL)\n",
+    "ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(BASE_MODEL)\n",
+    "tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_REPO)\n",
+    "\n",
+    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+    "model = model.to(device)\n",
+    "ref_model = ref_model.to(device)\n",
+    "\n",
+    "\n",
+    "import os\n",
+    "os.environ[\"CUDA_LAUNCH_BLOCKING\"] = \"1\"\n",
+    "\n",
+    "\n",
+    "import numpy as np\n",
+    "\n",
+    "def get_safe_functions(X, functions=['log', 'sqrt', 'asin', 'tan', 'abs', 'exp', 'sin', 'cos']):\n",
+    "    \"\"\"\n",
+    "    Returns a list of functions from `functions` that are safe to use on all columns of X.\n",
+    "\n",
+    "    Parameters:\n",
+    "        X: np.ndarray of shape (n_samples, n_features)\n",
+    "        functions: list of function names to check\n",
+    "\n",
+    "    Returns:\n",
+    "        List of function names that are safe to use given the data\n",
+    "    \"\"\"\n",
+    "    safe_functions = []\n",
+    "\n",
+    "    for fn in functions:\n",
+    "        if fn in {'sin', 'cos', 'exp', 'abs'}:\n",
+    "            # These are defined for all real values\n",
+    "            safe_functions.append(fn)\n",
+    "\n",
+    "        elif fn == 'log':\n",
+    "            if np.all(X > 0):\n",
+    "                safe_functions.append(fn)\n",
+    "\n",
+    "        elif fn == 'sqrt':\n",
+    "            if np.all(X >= 0):\n",
+    "                safe_functions.append(fn)\n",
+    "\n",
+    "        elif fn == 'asin':\n",
+    "            if np.all((X >= -1) & (X <= 1)):\n",
+    "                safe_functions.append(fn)\n",
+    "\n",
+    "        elif fn == 'tan':\n",
+    "            # Check if cos(x) ≈ 0 anywhere → tan(x) will explode\n",
+    "            # We use np.cos to simulate tan issues (e.g., near π/2, 3π/2, etc.)\n",
+    "            cos_vals = np.cos(X)\n",
+    "            if np.all(np.abs(cos_vals) > 1e-6):  # adjustable tolerance\n",
+    "                safe_functions.append(fn)\n",
+    "\n",
+    "        # else skip unknown functions\n",
+    "\n",
+    "    return safe_functions\n",
+    "\n",
+    "\n",
+    "safe_functions = get_safe_functions(X)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "9e2f618a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "log, sqrt, tan, abs, exp, sin, cos\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(', '.join(safe_functions))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "dd922d70",
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "NameError",
+     "evalue": "name 'PPOConfig' is not defined",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
+      "\u001b[31mNameError\u001b[39m                                 Traceback (most recent call last)",
+      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[1]\u001b[39m\u001b[32m, line 3\u001b[39m\n\u001b[32m      1\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mtqdm\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m tqdm\n\u001b[32m----> \u001b[39m\u001b[32m3\u001b[39m ppo_config = \u001b[43mPPOConfig\u001b[49m(\n\u001b[32m      4\u001b[39m     model_name=\u001b[38;5;28;01mNone\u001b[39;00m,  \u001b[38;5;66;03m# definimos o modelo manualmente\u001b[39;00m\n\u001b[32m      5\u001b[39m     learning_rate=\u001b[32m1e-5\u001b[39m,\n\u001b[32m      6\u001b[39m     batch_size=\u001b[32m5\u001b[39m,            \u001b[38;5;66;03m# total prompts/responses por step\u001b[39;00m\n\u001b[32m      7\u001b[39m     mini_batch_size=\u001b[32m32\u001b[39m,        \u001b[38;5;66;03m# 4 minibatches por batch\u001b[39;00m\n\u001b[32m      8\u001b[39m     gradient_accumulation_steps=\u001b[32m1\u001b[39m,\n\u001b[32m      9\u001b[39m     ppo_epochs=\u001b[32m4\u001b[39m,              \u001b[38;5;66;03m# 4 passes por minibatch\u001b[39;00m\n\u001b[32m     10\u001b[39m     log_with=\u001b[38;5;28;01mNone\u001b[39;00m,             \u001b[38;5;66;03m# ou \"wandb\"\u001b[39;00m\n\u001b[32m     11\u001b[39m     optimize_cuda_cache=\u001b[38;5;28;01mTrue\u001b[39;00m,  \u001b[38;5;66;03m# 👍 melhora uso da A100\u001b[39;00m\n\u001b[32m     12\u001b[39m )\n\u001b[32m     14\u001b[39m \u001b[38;5;66;03m# === PPO Trainer ===\u001b[39;00m\n\u001b[32m     15\u001b[39m ppo_trainer = PPOTrainer(\n\u001b[32m     16\u001b[39m     config=ppo_config,\n\u001b[32m     17\u001b[39m     tokenizer=tokenizer,\n\u001b[32m   (...)\u001b[39m\u001b[32m     20\u001b[39m \n\u001b[32m     21\u001b[39m )\n",
+      "\u001b[31mNameError\u001b[39m: name 'PPOConfig' is not defined"
+     ]
+    }
+   ],
+   "source": [
+    "from tqdm import tqdm\n",
+    "\n",
+    "ppo_config = PPOConfig(\n",
+    "    model_name=None,  # definimos o modelo manualmente\n",
+    "    learning_rate=1e-5,\n",
+    "    batch_size=5,            # total prompts/responses por step\n",
+    "    mini_batch_size=32,        # 4 minibatches por batch\n",
+    "    gradient_accumulation_steps=1,\n",
+    "    ppo_epochs=4,              # 4 passes por minibatch\n",
+    "    log_with=None,             # ou \"wandb\"\n",
+    "    optimize_cuda_cache=True,  # 👍 melhora uso da A100\n",
+    ")\n",
+    "\n",
+    "# === PPO Trainer ===\n",
+    "ppo_trainer = PPOTrainer(\n",
+    "    config=ppo_config,\n",
+    "    tokenizer=tokenizer,\n",
+    "    model=model,\n",
+    "    ref_model=ref_model,\n",
+    "    \n",
+    ")\n",
+    "\n",
+    "# Define the prompt with the safe functions\n",
+    "PROMPT = f\"\"\"\n",
+    "vars: x_1, x_2, x_3\n",
+    "oper: * +, /, **, {', '.join(safe_functions)}\n",
+    "cons: C\n",
+    "expr:\"\"\"\n",
+    "\n",
+    "# === Dummy dataset ===\n",
+    "dummy_dataset = Dataset.from_dict({\n",
+    "    \"prompt\": [PROMPT] * 5\n",
+    "})\n",
+    "\n",
+    "\n",
+    "# Get the device of the model\n",
+    "device = next(model.parameters()).device\n",
+    "\n",
+    "# === PPO Training Loop ===\n",
+    "# Tokenize the prompt and convert it to tensors\n",
+    "inputs = tokenizer([PROMPT] * ppo_config.batch_size, return_tensors=\"pt\", padding=True)\n",
+    "\n",
+    "# Move inputs to the same device as the model\n",
+    "inputs = {key: value.to(device) for key, value in inputs.items()}\n",
+    "\n",
+    "# Convert the batch tensor into a list of individual tensors\n",
+    "queries = [inputs[\"input_ids\"][i] for i in range(inputs[\"input_ids\"].size(0))]\n",
+    "all_rewards = []\n",
+    "all_responses = []\n",
+    "for epoch in tqdm(range(10), desc=\"Training Epochs\"):  # adjust as needed\n",
+    "    responses = []\n",
+    "    constants = []\n",
+    "    rewards = []\n",
+    "    for i in tqdm(range(ppo_config.batch_size), desc=\"Batch Progress\", leave=False):  # Nested progress bar\n",
+    "        try:\n",
+    "            input_ids = inputs[\"input_ids\"][i].unsqueeze(0)\n",
+    "            attention_mask = inputs[\"attention_mask\"][i].unsqueeze(0)\n",
+    "\n",
+    "            # === VALIDATION PATCH ===\n",
+    "            assert torch.all((input_ids >= 0) & (input_ids < model.config.vocab_size)), \\\n",
+    "                f\"Token inválido detectado: max={input_ids.max().item()}, vocab_size={model.config.vocab_size}\"\n",
+    "\n",
+    "            # (opcional)\n",
+    "            model.config.pad_token_id = tokenizer.pad_token_id\n",
+    "            reward = -1\n",
+    "            while reward < 0:\n",
+    "                output = model.generate(\n",
+    "                    input_ids=input_ids,\n",
+    "                    attention_mask=attention_mask,\n",
+    "                    max_new_tokens=50,\n",
+    "                    do_sample=True,\n",
+    "                    top_k=50,\n",
+    "                    top_p=0.95,\n",
+    "                    temperature=0.7,\n",
+    "                    eos_token_id=tokenizer.eos_token_id,\n",
+    "                    pad_token_id=tokenizer.pad_token_id,\n",
+    "                    return_dict_in_generate=True,\n",
+    "                    output_scores=False\n",
+    "                )\n",
+    "                response_ids = output.sequences[0][input_ids.shape[1]:]\n",
+    "                response = tokenizer.decode(response_ids, skip_special_tokens=True)\n",
+    "\n",
+    "                reward = compute_reward(response)\n",
+    "\n",
+    "\n",
+    "        except Exception as e:\n",
+    "            print(f\"Error at index {i}: {e}\")\n",
+    "            print(f\"Input IDs: {input_ids}\")\n",
+    "            print(f\"Token range: min={input_ids.min()}, max={input_ids.max()}, vocab_size={model.config.vocab_size}\")\n",
+    "            raise e\n",
+    "\n",
+    "        responses.append(response)\n",
+    "        rewards.append(reward)\n",
+    "    all_responses.extend(responses)\n",
+    "    all_rewards.extend(rewards)\n",
+    "\n",
+    "    #if one reward is >= .9 break\n",
+    "    if any(r >= 0.9 for r in rewards):\n",
+    "        print(\"Reward >= 0.9 found, stopping training.\")\n",
+    "        break\n",
+    "    # Compute rewards with a progress bar\n",
+    "    \n",
+    "    import concurrent.futures\n",
+    "\n",
+    "    # # Use process-based parallelism\n",
+    "    # with concurrent.futures.ProcessPoolExecutor() as executor:\n",
+    "    #     rewards = list(tqdm(executor.map(compute_reward, responses), total=len(responses), desc=\"Computing Rewards\", leave=False))\n",
+    "  \n",
+    "    #rewards = [ compute_reward(response) for response in tqdm(responses, desc=\"Computing Rewards\", leave=False)]\n",
+    "  \n",
+    "\n",
+    "    # Convert rewards to a list of PyTorch tensors\n",
+    "    rewards = [torch.tensor(reward, dtype=torch.float32, device=device) for reward in rewards]\n",
+    "    \n",
+    "    # Ensure responses are also tokenized and converted to tensors\n",
+    "    responses = [tokenizer(response, return_tensors=\"pt\", padding=True)[\"input_ids\"].squeeze(0).to(device) for response in responses]\n",
+    "\n",
+    "    # Pass the tokenized tensors to ppo_trainer.step()\n",
+    "    ppo_trainer.step(queries, responses, rewards)\n",
+    "\n",
+    "    # Log top expressions\n",
+    "    top_k = 3\n",
+    "    sorted_responses = sorted(zip(responses, rewards), key=lambda x: -x[1])\n",
+    "    print(f\"\\nEpoch {epoch + 1} melhores expressões:\")\n",
+    "    for i, (expr, score) in enumerate(sorted_responses[:top_k]):\n",
+    "        print(f\"{i+1}. {tokenizer.decode(expr, skip_special_tokens=True)} -> R² = {score:.4f}\")\n",
+    "    # Print average, median, and std of rewards\n",
+    "    avg_reward = torch.mean(torch.stack(rewards)).item()\n",
+    "    median_reward = torch.median(torch.stack(rewards)).item()\n",
+    "    count_invalid = sum(1 for r in rewards if r == -1.0)\n",
+    "    print(f\"Average Reward: {avg_reward:.4f}, Median Reward: {median_reward:.4f}, Invalid Count: {count_invalid}\")\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "70a60613",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".seriguela",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

notebooks/04_merging_model.ipynb ADDED Viewed

	@@ -0,0 +1,206 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "86149941",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('./modelo_final_para_ppo/tokenizer_config.json',\n",
+       " './modelo_final_para_ppo/special_tokens_map.json',\n",
+       " './modelo_final_para_ppo/vocab.json',\n",
+       " './modelo_final_para_ppo/merges.txt',\n",
+       " './modelo_final_para_ppo/added_tokens.json',\n",
+       " './modelo_final_para_ppo/tokenizer.json')"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# ===============================\n",
+    "# 🚀 LoRA Merge + ValueHead + Test\n",
+    "# ===============================\n",
+    "\n",
+    "\n",
+    "# ✅ Imports\n",
+    "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "from trl import AutoModelForCausalLMWithValueHead\n",
+    "\n",
+    "# === Configurações ===\n",
+    "LORA_REPO = \"augustocsc/Se124M500KInfPrompt_EOS\"\n",
+    "BASE_MODEL = \"gpt2\"\n",
+    "OUTPUT_DIR = \"./modelo_final_para_ppo\"\n",
+    "MODEL_HUB = \"augustocsc/Se124M500KInfPrompt_EOS_Merged\"\n",
+    "# === Carregar o tokenizer correto ===\n",
+    "tokenizer = AutoTokenizer.from_pretrained(LORA_REPO)\n",
+    "tokenizer.pad_token = tokenizer.eos_token\n",
+    "\n",
+    "# === Carregar modelo base e ajustar os embeddings ===\n",
+    "base_model = AutoModelForCausalLM.from_pretrained(BASE_MODEL)\n",
+    "base_model.resize_token_embeddings(len(tokenizer))  # Corrige shape para 50258\n",
+    "\n",
+    "# Load the PEFT model\n",
+    "peft_model = PeftModel.from_pretrained(base_model, LORA_REPO)\n",
+    "\n",
+    "# === Merge das LoRA weights (corretamente) ===\n",
+    "merged_model = peft_model.merge_and_unload()\n",
+    "\n",
+    "# === Adicionar Value Head ao modelo mergeado ===\n",
+    "model = AutoModelForCausalLMWithValueHead.from_pretrained(merged_model)\n",
+    "\n",
+    "# === Salvar modelo final para PPO ===\n",
+    "model.save_pretrained(OUTPUT_DIR)\n",
+    "tokenizer.save_pretrained(OUTPUT_DIR)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "e921394e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "0d38506bf99e418eb92d977159c9550b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "45094d0c70c344acbb4f800968a9eb55",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "CommitInfo(commit_url='https://huggingface.co/augustocsc/Se124M500KInfPrompt_EOS_Merged/commit/175b8a2750f170839ce04cb3dab9b1740fc83e92', commit_message='Upload tokenizer', commit_description='', oid='175b8a2750f170839ce04cb3dab9b1740fc83e92', pr_url=None, repo_url=RepoUrl('https://huggingface.co/augustocsc/Se124M500KInfPrompt_EOS_Merged', endpoint='https://huggingface.co', repo_type='model', repo_id='augustocsc/Se124M500KInfPrompt_EOS_Merged'), pr_revision=None, pr_num=None)"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "model.push_to_hub(MODEL_HUB)\n",
+    "tokenizer.push_to_hub(MODEL_HUB)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "34b6777d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of the model checkpoint at augustocsc/Se124M100KInfPrompt_EOS_Merged were not used when initializing GPT2LMHeadModel: ['v_head.summary.bias', 'v_head.summary.weight']\n",
+      "- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "WARNING:root:A <class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'> model is loaded from 'augustocsc/Se124M100KInfPrompt_EOS_Merged', and no v_head weight is found. This IS expected if you are not resuming PPO training.\n",
+      "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
+      "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n",
+      "The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "🧪 Resposta do modelo:\n",
+      "\n",
+      "\n",
+      "vars: x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10\n",
+      "oper: *, **, +, -, /\n",
+      "cons: C\n",
+      "expr: x_1 + x_2 + C*x_8 + C*x_5**C<|endoftext|>\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "from trl import AutoModelForCausalLMWithValueHead\n",
+    "# 🔁 Recarregar o modelo já mergeado + value head\n",
+    "from trl import AutoModelForCausalLMWithValueHead\n",
+    "MODEL_HUB = \"augustocsc/Se124M100KInfPrompt_EOS_Merged\"\n",
+    "#load model\n",
+    "model = AutoModelForCausalLMWithValueHead.from_pretrained(MODEL_HUB)\n",
+    "tokenizer = AutoTokenizer.from_pretrained(MODEL_HUB)\n",
+    "\n",
+    "# 🔁 Prompt de teste\n",
+    "PROMPT = \"\"\"\n",
+    "vars: x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10\n",
+    "oper: *, **, +, -, /\n",
+    "cons: C\n",
+    "expr:\"\"\"\n",
+    "\n",
+    "device = model.pretrained_model.device  # 👈 modelo base dentro do wrapper\n",
+    "input_ids = tokenizer(PROMPT, return_tensors=\"pt\").input_ids.to(device)\n",
+    "\n",
+    "# 🔮 Geração\n",
+    "gen_tokens = output = model.generate(\n",
+    "            input_ids=input_ids,\n",
+    "            max_new_tokens=50,\n",
+    "            do_sample=True,\n",
+    "            top_k=50,\n",
+    "            top_p=0.95,\n",
+    "            temperature=0.7,\n",
+    "            \n",
+    "        )\n",
+    "\n",
+    "# Mostrar resposta\n",
+    "response = tokenizer.decode(gen_tokens[0], skip_special_tokens=False)\n",
+    "print(\"🧪 Resposta do modelo:\\n\")\n",
+    "print(response)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".seriguela",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

out.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+Special constants found: [1]
+Found 1 constants in the expression: tan(x_1**C + cos(x_1))
+Testing expression validity with constants: [[1.0]]
+Expression is valid on dataset.
+Bounds for optimization: [(1, 3)]
+Fitted constants: [1.0]
+R2 score: -1.8028651105117532e-05

out2.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

requirements.txt ADDED Viewed

	@@ -0,0 +1,30 @@

+--extra-index-url https://download.pytorch.org/whl/cu121
+# Core Hugging Face e Deep Learning
+transformers==4.51.3
+torch==2.5.1
+torchvision==0.20.1
+torchaudio==2.5.1
+accelerate==1.6.0
+python-dotenv==1.0.1
+datasets==3.5.0
+evaluate==0.4.1
+huggingface-hub==0.30.2
+# Parameter-Efficient Fine-Tuning (PEFT)
+peft==0.15.1
+# Avaliação e utilitários
+scikit-learn==1.6.1
+numpy==1.26.4
+pandas==2.2.1
+tqdm==4.67.1
+sympy==1.13.1
+regex==2024.11.6
+# Logging e visualização
+tensorboard==2.16.2
+wandb>=0.24.1  # Versão atualizada para suportar novo formato de API key (wandb_v1_...)
+# Fine-tuning avançado (SFT, DPO, etc.)
+trl==0.16.1

scripts/aws/analyze_model.sh ADDED Viewed

	@@ -0,0 +1,203 @@

+#!/bin/bash
+# Automatic Model Analysis Script
+# Runs evaluation and generation analysis after training
+set -e
+# Colors
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m'
+print_status() { echo -e "${GREEN}[INFO]${NC} $1"; }
+print_header() { echo -e "\n${BLUE}========================================\n$1\n========================================${NC}\n"; }
+# Parameters
+MODEL_PATH="${1:-./output/Se124M_700K_infix}"
+DATA_COLUMN="${2:-i_prompt_n}"
+DATASET_REPO="augustocsc/sintetico_natural"
+DATA_DIR="700K"
+NUM_SAMPLES=500
+NUM_GENERATIONS=100
+# Directories
+PROJECT_DIR="/home/ubuntu/seriguela"
+OUTPUT_DIR="$HOME/analysis_results_$(date +%Y%m%d_%H%M%S)"
+mkdir -p "$OUTPUT_DIR"
+cd "$PROJECT_DIR"
+source venv/bin/activate
+print_header "Automatic Model Analysis"
+print_status "Model: $MODEL_PATH"
+print_status "Output: $OUTPUT_DIR"
+echo ""
+# =============================================================================
+# 1. EVALUATE MODEL
+# =============================================================================
+print_header "Step 1: Model Evaluation"
+print_status "Running evaluation on $NUM_SAMPLES samples..."
+python scripts/evaluate.py \
+    --model_path "$MODEL_PATH" \
+    --dataset_repo_id "$DATASET_REPO" \
+    --data_dir "$DATA_DIR" \
+    --data_column "$DATA_COLUMN" \
+    --num_samples "$NUM_SAMPLES" \
+    --output_dir "$OUTPUT_DIR/evaluation" \
+    --temperature 0.7 \
+    --seed 42 \
+    2>&1 | tee "$OUTPUT_DIR/evaluation.log"
+if [ $? -eq 0 ]; then
+    print_status "✅ Evaluation completed"
+else
+    print_status "⚠️ Evaluation had issues"
+fi
+# =============================================================================
+# 2. GENERATE SAMPLES
+# =============================================================================
+print_header "Step 2: Sample Generation & Validation"
+print_status "Generating $NUM_GENERATIONS samples with validation..."
+python scripts/generate.py \
+    --model_path "$MODEL_PATH" \
+    --num_generations "$NUM_GENERATIONS" \
+    --validate \
+    --output_file "$OUTPUT_DIR/generations.txt" \
+    --temperature 0.8 \
+    --top_p 0.95 \
+    --seed 42 \
+    2>&1 | tee "$OUTPUT_DIR/generation.log"
+if [ $? -eq 0 ]; then
+    print_status "✅ Generation completed"
+else
+    print_status "⚠️ Generation had issues"
+fi
+# =============================================================================
+# 3. ANALYZE TRAINING LOGS
+# =============================================================================
+print_header "Step 3: Training Log Analysis"
+print_status "Extracting training metrics..."
+TRAINING_LOG="$HOME/training_success.log"
+if [ -f "$TRAINING_LOG" ]; then
+    # Extract loss values
+    grep -E "'loss':|train_loss|eval_loss" "$TRAINING_LOG" > "$OUTPUT_DIR/training_metrics.txt" 2>/dev/null || true
+    # Extract epoch summaries
+    grep -E "epoch.*loss" "$TRAINING_LOG" | tail -20 > "$OUTPUT_DIR/epoch_summary.txt" 2>/dev/null || true
+    # Count total steps
+    TOTAL_STEPS=$(grep -E "[0-9]+/21882" "$TRAINING_LOG" | tail -1 | sed 's/.*\([0-9]\+\)\/21882.*/\1/' || echo "0")
+    print_status "Total training steps: $TOTAL_STEPS"
+fi
+# =============================================================================
+# 4. CREATE SUMMARY REPORT
+# =============================================================================
+print_header "Step 4: Creating Analysis Report"
+cat > "$OUTPUT_DIR/ANALYSIS_REPORT.md" << 'EOFREPORT'
+# Training Analysis Report
+**Generated:** $(date)
+## 📊 Model Information
+- **Architecture:** GPT-2 Small (124M parameters)
+- **Training Method:** LoRA (294K trainable parameters, 0.24%)
+- **Dataset:** 700K samples (infix notation)
+- **Training Duration:** $(grep "Training Duration:" $HOME/training_notification.txt 2>/dev/null | head -1 || echo "N/A")
+## 📈 Training Metrics
+### Loss Progression
+```
+$(tail -20 $OUTPUT_DIR/training_metrics.txt 2>/dev/null || echo "No metrics available")
+```
+### Epoch Summary
+```
+$(cat $OUTPUT_DIR/epoch_summary.txt 2>/dev/null || echo "No epoch data available")
+```
+## 🎯 Evaluation Results
+### Performance Metrics
+```
+$(grep -E "Accuracy|Loss|Perplexity" $OUTPUT_DIR/evaluation.log 2>/dev/null || echo "Check evaluation.log for details")
+```
+### Sample Predictions
+```
+$(head -50 $OUTPUT_DIR/evaluation/*.txt 2>/dev/null | head -20 || echo "No evaluation samples found")
+```
+## 🔮 Generation Quality
+### Validation Results
+```
+$(grep -E "Valid:|Success|Failed" $OUTPUT_DIR/generation.log | head -20 || echo "Check generation.log")
+```
+### Sample Generations
+```
+$(head -30 $OUTPUT_DIR/generations.txt 2>/dev/null || echo "No generations file found")
+```
+## 📁 Output Files
+- Evaluation results: `evaluation/`
+- Generated samples: `generations.txt`
+- Full logs: `evaluation.log`, `generation.log`
+- Training metrics: `training_metrics.txt`
+## 🔗 Resources
+- **Wandb Dashboard:** https://wandb.ai/symbolic-gression/seriguela_700K_test
+- **HuggingFace Model:** https://huggingface.co/augustocsc/Se124M_700K_infix
+- **Analysis Directory:** $OUTPUT_DIR
+---
+*Generated automatically by analyze_model.sh*
+EOFREPORT
+# Evaluate the report with actual values
+eval "cat > \"$OUTPUT_DIR/ANALYSIS_REPORT.md\" << 'EOFREPORT'
+$(cat "$OUTPUT_DIR/ANALYSIS_REPORT.md")
+EOFREPORT"
+print_status "Report created: $OUTPUT_DIR/ANALYSIS_REPORT.md"
+# =============================================================================
+# 5. FINAL SUMMARY
+# =============================================================================
+print_header "Analysis Complete!"
+echo ""
+print_status "All results saved to: $OUTPUT_DIR"
+print_status "Main report: $OUTPUT_DIR/ANALYSIS_REPORT.md"
+echo ""
+print_status "Key files:"
+echo "  - Evaluation: $OUTPUT_DIR/evaluation.log"
+echo "  - Generation: $OUTPUT_DIR/generation.log"
+echo "  - Metrics: $OUTPUT_DIR/training_metrics.txt"
+echo "  - Report: $OUTPUT_DIR/ANALYSIS_REPORT.md"
+echo ""
+print_status "View the full report with:"
+echo "  cat $OUTPUT_DIR/ANALYSIS_REPORT.md"
+echo ""
+# Create a quick summary
+EVAL_SUCCESS=$(grep -c "✅" "$OUTPUT_DIR/evaluation.log" 2>/dev/null || echo "0")
+GEN_SUCCESS=$(grep -c "Valid" "$OUTPUT_DIR/generation.log" 2>/dev/null || echo "0")
+print_header "Quick Summary"
+echo "Evaluation samples processed: $NUM_SAMPLES"
+echo "Generations created: $NUM_GENERATIONS"
+echo "Check logs for detailed metrics and quality assessment"
+echo ""
+print_status "Done!"

scripts/aws/evaluate_models.sh ADDED Viewed

	@@ -0,0 +1,62 @@

+#!/bin/bash
+# Script to evaluate two models on AWS and compare results
+# This script compares the original model (without end token) with the v2 model (with end token)
+# Usage: bash scripts/aws/evaluate_models.sh
+set -e
+echo "=========================================="
+echo "Model Comparison: v1 vs v2"
+echo "=========================================="
+echo "Model 1: augustocsc/Se124M_700K_infix (original)"
+echo "Model 2: augustocsc/Se124M_700K_infix_v2 (with <|endofex|> token)"
+echo "=========================================="
+echo ""
+# Activate virtual environment
+source ~/seriguela/venv/bin/activate
+cd ~/seriguela
+# Set up logging
+LOG_FILE="evaluation_$(date +%Y%m%d_%H%M%S).log"
+exec > >(tee -a "$LOG_FILE") 2>&1
+echo "[$(date)] Starting evaluation..."
+echo ""
+# Check GPU availability
+echo "Checking GPU..."
+if nvidia-smi &> /dev/null; then
+    nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader
+    echo ""
+else
+    echo "WARNING: No GPU detected. Evaluation will be slow."
+    echo ""
+fi
+# Run comparison
+echo "Running model comparison..."
+echo "This will evaluate both models on 500 samples from the test set."
+echo ""
+python scripts/compare_models.py \
+    --model1 augustocsc/Se124M_700K_infix \
+    --model2 augustocsc/Se124M_700K_infix_v2 \
+    --model1_name "Original (no end token)" \
+    --model2_name "V2 (with <|endofex|>)" \
+    --num_samples 500 \
+    --dataset_repo_id augustocsc/sintetico_natural \
+    --data_dir 700K \
+    --data_column i_prompt_n \
+    --output_dir ./evaluation_results/comparison
+echo ""
+echo "=========================================="
+echo "Evaluation Complete!"
+echo "=========================================="
+echo "Results saved to: ./evaluation_results/comparison"
+echo "Log file: $LOG_FILE"
+echo ""
+echo "To view results:"
+echo "  cat ./evaluation_results/comparison/comparison_*.json | jq"
+echo ""

scripts/aws/launch_evaluation_instance.sh ADDED Viewed

	@@ -0,0 +1,299 @@

+#!/bin/bash
+# Script to launch AWS instance for model evaluation
+# Evaluates two models: original (Se124M_700K_infix) vs v2 (with end token)
+# Usage: ./launch_evaluation_instance.sh [--hf-token TOKEN]
+set -e
+# Colors
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+RED='\033[0;31m'
+BLUE='\033[0;34m'
+NC='\033[0m'
+print_status() { echo -e "${GREEN}[INFO]${NC} $1"; }
+print_warning() { echo -e "${YELLOW}[WARN]${NC} $1"; }
+print_error() { echo -e "${RED}[ERROR]${NC} $1"; }
+# Default configuration
+INSTANCE_TYPE="g5.xlarge"
+AMI_ID=""
+KEY_NAME=""
+SECURITY_GROUP=""
+REGION=$(aws configure get region 2>/dev/null || echo "us-east-1")
+VOLUME_SIZE=80
+INSTANCE_NAME="seriguela-evaluation"
+HF_TOKEN=""
+# Parse arguments
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --hf-token) HF_TOKEN="$2"; shift 2;;
+        --instance-type) INSTANCE_TYPE="$2"; shift 2;;
+        --key-name) KEY_NAME="$2"; shift 2;;
+        --help)
+            echo "Usage: $0 [OPTIONS]"
+            echo "Options:"
+            echo "  --hf-token TOKEN     HuggingFace token (optional, for accessing models)"
+            echo "  --instance-type TYPE Instance type (default: g5.xlarge)"
+            echo "  --key-name NAME      SSH key pair name"
+            echo ""
+            echo "Example:"
+            echo "  $0 --hf-token hf_xxx"
+            exit 0;;
+        *) echo "Unknown option: $1"; exit 1;;
+    esac
+done
+if [ -z "$HF_TOKEN" ]; then
+    print_warning "HuggingFace token not provided. Public models will still work."
+    print_warning "Get your token from: https://huggingface.co/settings/tokens"
+fi
+print_status "Launching Seriguela evaluation instance..."
+# Find Deep Learning AMI
+print_status "Finding Deep Learning AMI..."
+AMI_ID=$(aws ec2 describe-images \
+    --owners amazon \
+    --filters "Name=name,Values=*Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04)*" \
+    --query "Images | sort_by(@, &CreationDate) | [-1].ImageId" \
+    --output text)
+if [ -z "$AMI_ID" ] || [ "$AMI_ID" == "None" ]; then
+    print_error "Could not find Deep Learning AMI"
+    exit 1
+fi
+print_status "Using AMI: $AMI_ID"
+# Find or select key pair
+if [ -z "$KEY_NAME" ]; then
+    KEY_NAME=$(aws ec2 describe-key-pairs --query "KeyPairs[0].KeyName" --output text 2>/dev/null)
+fi
+if [ -z "$KEY_NAME" ] || [ "$KEY_NAME" == "None" ]; then
+    print_error "No SSH key pair found. Create one first or specify with --key-name"
+    exit 1
+fi
+print_status "Using key pair: $KEY_NAME"
+# Find or create security group
+SECURITY_GROUP=$(aws ec2 describe-security-groups \
+    --filters "Name=group-name,Values=seriguela-sg" \
+    --query "SecurityGroups[0].GroupId" \
+    --output text 2>/dev/null)
+if [ -z "$SECURITY_GROUP" ] || [ "$SECURITY_GROUP" == "None" ]; then
+    print_status "Creating security group..."
+    SECURITY_GROUP=$(aws ec2 create-security-group \
+        --group-name seriguela-sg \
+        --description "Security group for Seriguela" \
+        --query "GroupId" --output text)
+    # Get current IP and add SSH rule
+    MY_IP=$(curl -s ifconfig.me)
+    aws ec2 authorize-security-group-ingress \
+        --group-id "$SECURITY_GROUP" \
+        --protocol tcp --port 22 \
+        --cidr "${MY_IP}/32"
+    print_status "Created security group with SSH access from $MY_IP"
+else
+    # Update security group with current IP
+    MY_IP=$(curl -s ifconfig.me)
+    aws ec2 authorize-security-group-ingress \
+        --group-id "$SECURITY_GROUP" \
+        --protocol tcp --port 22 \
+        --cidr "${MY_IP}/32" 2>/dev/null || true
+fi
+print_status "Using security group: $SECURITY_GROUP"
+# Create user-data script for automatic setup
+USER_DATA=$(cat << 'USERDATA'
+#!/bin/bash
+exec > /var/log/user-data.log 2>&1
+set -x
+echo "=========================================="
+echo "Seriguela Evaluation Instance Setup"
+echo "Started: $(date)"
+echo "=========================================="
+# Wait for cloud-init to complete
+cloud-init status --wait
+# Setup as ubuntu user
+sudo -u ubuntu bash << 'UBUNTUSETUP'
+cd /home/ubuntu
+echo "[1/7] Installing system dependencies..."
+sudo apt-get update -qq
+sudo apt-get install -y -qq python3-venv python3-pip git jq
+echo "[2/7] Cloning repository..."
+git clone https://github.com/augustocsc/seriguela.git
+cd seriguela
+echo "[3/7] Creating virtual environment..."
+python3 -m venv venv
+source venv/bin/activate
+echo "[4/7] Upgrading pip..."
+pip install --upgrade pip -q
+echo "[5/7] Installing requirements..."
+pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121 -q
+echo "[6/7] Testing setup..."
+python3 << 'PYCHECK'
+import sys
+print("Testing imports...")
+try:
+    import transformers
+    print(f"✅ transformers {transformers.__version__}")
+    import torch
+    print(f"✅ torch {torch.__version__}")
+    print(f"✅ CUDA available: {torch.cuda.is_available()}")
+    import peft
+    print(f"✅ peft {peft.__version__}")
+    import datasets
+    print(f"✅ datasets {datasets.__version__}")
+except ImportError as e:
+    print(f"❌ Import failed: {e}")
+    sys.exit(1)
+PYCHECK
+if [ $? -ne 0 ]; then
+    echo "❌ Package validation failed"
+    exit 1
+fi
+echo "[7/7] Checking GPU..."
+if nvidia-smi &> /dev/null; then
+    echo "✅ GPU detected:"
+    nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
+else
+    echo "⚠️  No GPU detected (will be slower)"
+fi
+# Configure HuggingFace token if provided
+if [ -n "$HF_TOKEN" ]; then
+    echo "Configuring HuggingFace authentication..."
+    mkdir -p ~/.cache/huggingface
+    echo "$HF_TOKEN" > ~/.cache/huggingface/token
+    echo "✅ HuggingFace token configured"
+fi
+# Make evaluation script executable
+chmod +x ~/seriguela/scripts/aws/evaluate_models.sh
+# Create completion marker
+touch /home/ubuntu/.setup_complete
+# Create info file
+cat > /home/ubuntu/setup_info.txt << 'INFOFILE'
+Seriguela Evaluation Instance - Ready!
+Setup completed successfully:
+- Python packages installed
+- GPU available (if supported)
+- Repository cloned and configured
+To run the evaluation:
+  cd ~/seriguela
+  source venv/bin/activate
+  bash scripts/aws/evaluate_models.sh
+This will compare:
+  - Model 1: augustocsc/Se124M_700K_infix (original)
+  - Model 2: augustocsc/Se124M_700K_infix_v2 (with <|endofex|> token)
+On 500 test samples to evaluate if the ending token improves generation stopping.
+INFOFILE
+echo ""
+echo "=========================================="
+echo "✅ Setup Complete!"
+echo "Finished: $(date)"
+echo "=========================================="
+cat ~/setup_info.txt
+UBUNTUSETUP
+echo "User-data script completed"
+USERDATA
+)
+# Replace HF_TOKEN placeholder
+USER_DATA="${USER_DATA//\$HF_TOKEN/$HF_TOKEN}"
+# Launch instance
+print_status "Launching instance..."
+INSTANCE_ID=$(aws ec2 run-instances \
+    --image-id "$AMI_ID" \
+    --instance-type "$INSTANCE_TYPE" \
+    --key-name "$KEY_NAME" \
+    --security-group-ids "$SECURITY_GROUP" \
+    --block-device-mappings "[{\"DeviceName\":\"/dev/sda1\",\"Ebs\":{\"VolumeSize\":$VOLUME_SIZE,\"VolumeType\":\"gp3\"}}]" \
+    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$INSTANCE_NAME},{Key=Project,Value=seriguela},{Key=Purpose,Value=evaluation}]" \
+    --user-data "$USER_DATA" \
+    --query "Instances[0].InstanceId" \
+    --output text)
+print_status "Instance launched: $INSTANCE_ID"
+# Wait for instance to be running
+print_status "Waiting for instance to start..."
+aws ec2 wait instance-running --instance-ids "$INSTANCE_ID"
+# Get public IP
+PUBLIC_IP=$(aws ec2 describe-instances \
+    --instance-ids "$INSTANCE_ID" \
+    --query "Reservations[0].Instances[0].PublicIpAddress" \
+    --output text)
+echo ""
+echo "=========================================="
+echo -e "${GREEN}Instance Ready!${NC}"
+echo "=========================================="
+echo "Instance ID: $INSTANCE_ID"
+echo "Public IP: $PUBLIC_IP"
+echo "Key Pair: $KEY_NAME"
+echo ""
+echo -e "${BLUE}Connect with:${NC}"
+echo "  ssh -i ~/.ssh/${KEY_NAME}.pem ubuntu@${PUBLIC_IP}"
+echo ""
+echo -e "${BLUE}Check setup progress:${NC}"
+echo "  ssh -i ~/.ssh/${KEY_NAME}.pem ubuntu@${PUBLIC_IP} 'tail -f /var/log/user-data.log'"
+echo ""
+echo -e "${BLUE}Wait for setup to complete (takes ~5-10 minutes):${NC}"
+echo "  ssh -i ~/.ssh/${KEY_NAME}.pem ubuntu@${PUBLIC_IP} 'while [ ! -f ~/.setup_complete ]; do sleep 10; echo \"Setup in progress...\"; done; echo \"✅ Setup complete!\"; cat ~/setup_info.txt'"
+echo ""
+echo -e "${BLUE}Then run evaluation:${NC}"
+echo "  ssh -i ~/.ssh/${KEY_NAME}.pem ubuntu@${PUBLIC_IP} 'cd seriguela && source venv/bin/activate && bash scripts/aws/evaluate_models.sh'"
+echo ""
+echo -e "${BLUE}Or run in one command:${NC}"
+echo "  ssh -i ~/.ssh/${KEY_NAME}.pem ubuntu@${PUBLIC_IP} 'cd seriguela && source venv/bin/activate && nohup bash scripts/aws/evaluate_models.sh > evaluation.log 2>&1 &'"
+echo ""
+echo -e "${YELLOW}IMPORTANT:${NC} Remember to stop the instance when done:"
+echo "  aws ec2 stop-instances --instance-ids $INSTANCE_ID"
+echo ""
+# Save instance info
+INFO_DIR="${HOME}/.seriguela"
+mkdir -p "$INFO_DIR"
+echo "$INSTANCE_ID" > "$INFO_DIR/last_evaluation_instance_id.txt"
+echo "$PUBLIC_IP" > "$INFO_DIR/last_evaluation_instance_ip.txt"
+echo "$KEY_NAME" > "$INFO_DIR/last_evaluation_key_name.txt"
+cat > "$INFO_DIR/last_evaluation_instance_info.txt" << INFOEND
+Instance ID: $INSTANCE_ID
+Public IP: $PUBLIC_IP
+Key Name: $KEY_NAME
+Instance Type: $INSTANCE_TYPE
+Region: $REGION
+Launched: $(date)
+Purpose: Model Evaluation (v1 vs v2)
+INFOEND
+print_status "Instance info saved to: $INFO_DIR/"
+echo ""

scripts/aws/launch_instance.sh ADDED Viewed

	@@ -0,0 +1,196 @@

+#!/bin/bash
+# Script to launch and configure AWS g5.xlarge instance for Seriguela training
+# Usage: ./launch_instance.sh [--hf-token TOKEN] [--wandb-key KEY]
+set -e
+# Colors
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+RED='\033[0;31m'
+NC='\033[0m'
+print_status() { echo -e "${GREEN}[INFO]${NC} $1"; }
+print_warning() { echo -e "${YELLOW}[WARN]${NC} $1"; }
+print_error() { echo -e "${RED}[ERROR]${NC} $1"; }
+# Default configuration
+INSTANCE_TYPE="g5.xlarge"
+AMI_ID=""  # Will be auto-detected
+KEY_NAME=""  # Will be auto-detected
+SECURITY_GROUP=""  # Will be auto-detected or created
+REGION=$(aws configure get region 2>/dev/null || echo "us-east-1")
+VOLUME_SIZE=100
+INSTANCE_NAME="seriguela-training"
+HF_TOKEN=""
+WANDB_KEY=""
+# Parse arguments
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --hf-token) HF_TOKEN="$2"; shift 2;;
+        --wandb-key) WANDB_KEY="$2"; shift 2;;
+        --instance-type) INSTANCE_TYPE="$2"; shift 2;;
+        --key-name) KEY_NAME="$2"; shift 2;;
+        --help)
+            echo "Usage: $0 [OPTIONS]"
+            echo "Options:"
+            echo "  --hf-token TOKEN     HuggingFace token"
+            echo "  --wandb-key KEY      Wandb API key"
+            echo "  --instance-type TYPE Instance type (default: g5.xlarge)"
+            echo "  --key-name NAME      SSH key pair name"
+            exit 0;;
+        *) echo "Unknown option: $1"; exit 1;;
+    esac
+done
+print_status "Launching Seriguela training instance..."
+# Find Deep Learning AMI
+print_status "Finding Deep Learning AMI..."
+AMI_ID=$(aws ec2 describe-images \
+    --owners amazon \
+    --filters "Name=name,Values=*Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04)*" \
+    --query "Images | sort_by(@, &CreationDate) | [-1].ImageId" \
+    --output text)
+if [ -z "$AMI_ID" ] || [ "$AMI_ID" == "None" ]; then
+    print_error "Could not find Deep Learning AMI"
+    exit 1
+fi
+print_status "Using AMI: $AMI_ID"
+# Find or select key pair
+if [ -z "$KEY_NAME" ]; then
+    KEY_NAME=$(aws ec2 describe-key-pairs --query "KeyPairs[0].KeyName" --output text 2>/dev/null)
+fi
+if [ -z "$KEY_NAME" ] || [ "$KEY_NAME" == "None" ]; then
+    print_error "No SSH key pair found. Create one first or specify with --key-name"
+    exit 1
+fi
+print_status "Using key pair: $KEY_NAME"
+# Find or create security group
+SECURITY_GROUP=$(aws ec2 describe-security-groups \
+    --filters "Name=group-name,Values=seriguela-sg" \
+    --query "SecurityGroups[0].GroupId" \
+    --output text 2>/dev/null)
+if [ -z "$SECURITY_GROUP" ] || [ "$SECURITY_GROUP" == "None" ]; then
+    print_status "Creating security group..."
+    SECURITY_GROUP=$(aws ec2 create-security-group \
+        --group-name seriguela-sg \
+        --description "Security group for Seriguela training" \
+        --query "GroupId" --output text)
+    # Get current IP and add SSH rule
+    MY_IP=$(curl -s ifconfig.me)
+    aws ec2 authorize-security-group-ingress \
+        --group-id "$SECURITY_GROUP" \
+        --protocol tcp --port 22 \
+        --cidr "${MY_IP}/32"
+    print_status "Created security group with SSH access from $MY_IP"
+else
+    # Update security group with current IP
+    MY_IP=$(curl -s ifconfig.me)
+    aws ec2 authorize-security-group-ingress \
+        --group-id "$SECURITY_GROUP" \
+        --protocol tcp --port 22 \
+        --cidr "${MY_IP}/32" 2>/dev/null || true
+fi
+print_status "Using security group: $SECURITY_GROUP"
+# Create user-data script for automatic setup
+USER_DATA=$(cat << 'USERDATA'
+#!/bin/bash
+exec > /var/log/user-data.log 2>&1
+set -x
+# Wait for cloud-init to complete
+cloud-init status --wait
+# Setup as ubuntu user
+sudo -u ubuntu bash << 'UBUNTUSETUP'
+cd /home/ubuntu
+# Install dependencies
+sudo apt-get update -qq
+sudo apt-get install -y -qq python3-venv python3-pip git
+# Clone repository
+git clone https://github.com/augustocsc/seriguela.git
+cd seriguela
+# Create virtual environment
+python3 -m venv venv
+source venv/bin/activate
+# Install requirements
+pip install --upgrade pip -q
+pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121 -q
+# Create marker file to indicate setup complete
+touch /home/ubuntu/.setup_complete
+UBUNTUSETUP
+USERDATA
+)
+# Add tokens to user-data if provided
+if [ -n "$HF_TOKEN" ] || [ -n "$WANDB_KEY" ]; then
+    TOKEN_SETUP="
+# Configure tokens
+cd /home/ubuntu/seriguela
+echo 'HF_TOKEN=$HF_TOKEN' > .env
+echo 'WANDB_API_KEY=$WANDB_KEY' >> .env
+"
+    USER_DATA="${USER_DATA}${TOKEN_SETUP}"
+fi
+# Launch instance
+print_status "Launching instance..."
+INSTANCE_ID=$(aws ec2 run-instances \
+    --image-id "$AMI_ID" \
+    --instance-type "$INSTANCE_TYPE" \
+    --key-name "$KEY_NAME" \
+    --security-group-ids "$SECURITY_GROUP" \
+    --block-device-mappings "[{\"DeviceName\":\"/dev/sda1\",\"Ebs\":{\"VolumeSize\":$VOLUME_SIZE,\"VolumeType\":\"gp3\"}}]" \
+    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$INSTANCE_NAME}]" \
+    --user-data "$USER_DATA" \
+    --query "Instances[0].InstanceId" \
+    --output text)
+print_status "Instance launched: $INSTANCE_ID"
+# Wait for instance to be running
+print_status "Waiting for instance to start..."
+aws ec2 wait instance-running --instance-ids "$INSTANCE_ID"
+# Get public IP
+PUBLIC_IP=$(aws ec2 describe-instances \
+    --instance-ids "$INSTANCE_ID" \
+    --query "Reservations[0].Instances[0].PublicIpAddress" \
+    --output text)
+echo ""
+echo "=========================================="
+echo -e "${GREEN}Instance Ready!${NC}"
+echo "=========================================="
+echo "Instance ID: $INSTANCE_ID"
+echo "Public IP: $PUBLIC_IP"
+echo ""
+echo "Connect with:"
+echo "  ssh -i ~/.ssh/${KEY_NAME}.pem ubuntu@${PUBLIC_IP}"
+echo ""
+echo "Check setup progress:"
+echo "  ssh ubuntu@${PUBLIC_IP} 'tail -f /var/log/user-data.log'"
+echo ""
+echo "Wait for setup to complete (check for .setup_complete):"
+echo "  ssh ubuntu@${PUBLIC_IP} 'while [ ! -f ~/.setup_complete ]; do sleep 10; done; echo Done!'"
+echo ""
+echo "Then run training:"
+echo "  ssh ubuntu@${PUBLIC_IP} 'cd seriguela && source venv/bin/activate && bash scripts/aws/run_all_training.sh'"
+echo ""
+# Save instance info
+echo "$INSTANCE_ID" > /tmp/seriguela_instance_id.txt
+echo "$PUBLIC_IP" > /tmp/seriguela_instance_ip.txt

scripts/aws/launch_instance_fixed.sh ADDED Viewed

	@@ -0,0 +1,371 @@

+#!/bin/bash
+# Script to launch and configure AWS g5.xlarge instance for Seriguela training
+# FIXED VERSION - Includes Wandb validation and proper setup
+# Usage: ./launch_instance_fixed.sh [--hf-token TOKEN] [--wandb-key KEY]
+set -e
+# Colors
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+RED='\033[0;31m'
+BLUE='\033[0;34m'
+NC='\033[0m'
+print_status() { echo -e "${GREEN}[INFO]${NC} $1"; }
+print_warning() { echo -e "${YELLOW}[WARN]${NC} $1"; }
+print_error() { echo -e "${RED}[ERROR]${NC} $1"; }
+# Default configuration
+INSTANCE_TYPE="g5.xlarge"
+AMI_ID=""  # Will be auto-detected
+KEY_NAME=""  # Will be auto-detected
+SECURITY_GROUP=""  # Will be auto-detected or created
+REGION=$(aws configure get region 2>/dev/null || echo "us-east-1")
+VOLUME_SIZE=100
+INSTANCE_NAME="seriguela-training"
+HF_TOKEN=""
+WANDB_KEY=""
+# Parse arguments
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --hf-token) HF_TOKEN="$2"; shift 2;;
+        --wandb-key) WANDB_KEY="$2"; shift 2;;
+        --instance-type) INSTANCE_TYPE="$2"; shift 2;;
+        --key-name) KEY_NAME="$2"; shift 2;;
+        --help)
+            echo "Usage: $0 [OPTIONS]"
+            echo "Options:"
+            echo "  --hf-token TOKEN     HuggingFace token (required for push to hub)"
+            echo "  --wandb-key KEY      Wandb API key (required for logging)"
+            echo "  --instance-type TYPE Instance type (default: g5.xlarge)"
+            echo "  --key-name NAME      SSH key pair name"
+            echo ""
+            echo "Example:"
+            echo "  $0 --hf-token hf_xxx --wandb-key wandb_v1_xxx"
+            exit 0;;
+        *) echo "Unknown option: $1"; exit 1;;
+    esac
+done
+# Validate required tokens
+if [ -z "$WANDB_KEY" ]; then
+    print_error "Wandb API key is required! Use --wandb-key"
+    print_warning "Get your key from: https://wandb.ai/authorize"
+    exit 1
+fi
+if [ -z "$HF_TOKEN" ]; then
+    print_warning "HuggingFace token not provided. Model won't be pushed to Hub."
+    print_warning "Get your token from: https://huggingface.co/settings/tokens"
+fi
+print_status "Launching Seriguela training instance with validated setup..."
+# Find Deep Learning AMI
+print_status "Finding Deep Learning AMI..."
+AMI_ID=$(aws ec2 describe-images \
+    --owners amazon \
+    --filters "Name=name,Values=*Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04)*" \
+    --query "Images | sort_by(@, &CreationDate) | [-1].ImageId" \
+    --output text)
+if [ -z "$AMI_ID" ] || [ "$AMI_ID" == "None" ]; then
+    print_error "Could not find Deep Learning AMI"
+    exit 1
+fi
+print_status "Using AMI: $AMI_ID"
+# Find or select key pair
+if [ -z "$KEY_NAME" ]; then
+    KEY_NAME=$(aws ec2 describe-key-pairs --query "KeyPairs[0].KeyName" --output text 2>/dev/null)
+fi
+if [ -z "$KEY_NAME" ] || [ "$KEY_NAME" == "None" ]; then
+    print_error "No SSH key pair found. Create one first or specify with --key-name"
+    exit 1
+fi
+print_status "Using key pair: $KEY_NAME"
+# Find or create security group
+SECURITY_GROUP=$(aws ec2 describe-security-groups \
+    --filters "Name=group-name,Values=seriguela-sg" \
+    --query "SecurityGroups[0].GroupId" \
+    --output text 2>/dev/null)
+if [ -z "$SECURITY_GROUP" ] || [ "$SECURITY_GROUP" == "None" ]; then
+    print_status "Creating security group..."
+    SECURITY_GROUP=$(aws ec2 create-security-group \
+        --group-name seriguela-sg \
+        --description "Security group for Seriguela training" \
+        --query "GroupId" --output text)
+    # Get current IP and add SSH rule
+    MY_IP=$(curl -s ifconfig.me)
+    aws ec2 authorize-security-group-ingress \
+        --group-id "$SECURITY_GROUP" \
+        --protocol tcp --port 22 \
+        --cidr "${MY_IP}/32"
+    print_status "Created security group with SSH access from $MY_IP"
+else
+    # Update security group with current IP
+    MY_IP=$(curl -s ifconfig.me)
+    aws ec2 authorize-security-group-ingress \
+        --group-id "$SECURITY_GROUP" \
+        --protocol tcp --port 22 \
+        --cidr "${MY_IP}/32" 2>/dev/null || true
+fi
+print_status "Using security group: $SECURITY_GROUP"
+# Create user-data script for automatic setup with validation
+USER_DATA=$(cat << USERDATA
+#!/bin/bash
+exec > /var/log/user-data.log 2>&1
+set -x
+echo "=========================================="
+echo "Seriguela Instance Setup - VALIDATED"
+echo "Started: \$(date)"
+echo "=========================================="
+# Wait for cloud-init to complete
+cloud-init status --wait
+# Setup as ubuntu user
+sudo -u ubuntu bash << 'UBUNTUSETUP'
+cd /home/ubuntu
+echo "[1/8] Installing system dependencies..."
+sudo apt-get update -qq
+sudo apt-get install -y -qq python3-venv python3-pip git dos2unix
+echo "[2/8] Cloning repository..."
+git clone https://github.com/augustocsc/seriguela.git
+cd seriguela
+echo "[3/8] Creating virtual environment..."
+python3 -m venv venv
+source venv/bin/activate
+echo "[4/8] Upgrading pip..."
+pip install --upgrade pip -q
+echo "[5/8] Installing requirements..."
+pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121 -q
+echo "[6/8] Upgrading Wandb to latest version..."
+pip install --upgrade 'wandb>=0.24.1' -q
+echo "[7/8] Configuring environment..."
+# Create .env file
+cat > .env << 'ENVFILE'
+HF_TOKEN=$HF_TOKEN
+WANDB_API_KEY=$WANDB_KEY
+ENVFILE
+echo "[8/8] Validating setup..."
+# Validate Python packages
+python3 << 'PYCHECK'
+import sys
+print("Testing imports...")
+try:
+    import transformers
+    print(f"✅ transformers {transformers.__version__}")
+    import torch
+    print(f"✅ torch {torch.__version__}")
+    import wandb
+    print(f"✅ wandb {wandb.__version__}")
+    import peft
+    print(f"✅ peft {peft.__version__}")
+except ImportError as e:
+    print(f"❌ Import failed: {e}")
+    sys.exit(1)
+PYCHECK
+if [ \$? -ne 0 ]; then
+    echo "❌ Package validation failed"
+    exit 1
+fi
+# Validate GPU
+echo "Checking GPU..."
+if nvidia-smi &> /dev/null; then
+    echo "✅ GPU detected:"
+    nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
+else
+    echo "❌ No GPU detected"
+    exit 1
+fi
+# Validate Wandb authentication
+if [ -n "$WANDB_KEY" ]; then
+    echo "Validating Wandb authentication..."
+    python3 << PYVALIDATE
+import wandb
+import os
+try:
+    result = wandb.login(key='$WANDB_KEY')
+    if result:
+        print("✅ Wandb authentication successful")
+        # Get user info
+        import requests
+        response = requests.get('https://api.wandb.ai/graphql',
+                              headers={'Authorization': f'Bearer $WANDB_KEY'},
+                              json={'query': '{viewer{entity}}'})
+        if response.status_code == 200:
+            print(f"   Logged in to Wandb")
+    else:
+        print("❌ Wandb authentication failed")
+        exit(1)
+except Exception as e:
+    print(f"❌ Wandb validation error: {e}")
+    exit(1)
+PYVALIDATE
+    if [ \$? -ne 0 ]; then
+        echo "❌ Wandb authentication failed"
+        exit 1
+    fi
+else
+    echo "⚠️  No Wandb key provided - skipping validation"
+fi
+# Validate HuggingFace token
+if [ -n "$HF_TOKEN" ]; then
+    echo "Validating HuggingFace authentication..."
+    python3 << PYVALIDATE
+from huggingface_hub import HfApi
+try:
+    api = HfApi(token='$HF_TOKEN')
+    user = api.whoami()
+    print(f"✅ HuggingFace authentication successful")
+    print(f"   Logged in as: {user.get('name', 'unknown')}")
+except Exception as e:
+    print(f"❌ HuggingFace validation error: {e}")
+    exit(1)
+PYVALIDATE
+    if [ \$? -ne 0 ]; then
+        echo "❌ HuggingFace authentication failed"
+        exit 1
+    fi
+else
+    echo "⚠️  No HuggingFace token provided - model won't be pushed to Hub"
+fi
+# All validations passed
+echo ""
+echo "=========================================="
+echo "✅ Setup Complete and Validated!"
+echo "Finished: \$(date)"
+echo "=========================================="
+# Create completion markers
+touch /home/ubuntu/.setup_complete
+touch /home/ubuntu/.setup_validated
+# Create info file
+cat > /home/ubuntu/setup_info.txt << 'INFOFILE'
+Setup completed successfully!
+Validated:
+- Python packages installed
+- GPU detected
+- Wandb authenticated
+- HuggingFace authenticated (if token provided)
+Ready to train!
+Quick commands:
+  cd ~/seriguela
+  source venv/bin/activate
+  python scripts/train.py --help
+Monitor scripts:
+  bash scripts/aws/monitor_training_auto.sh
+INFOFILE
+echo "Setup info saved to ~/setup_info.txt"
+UBUNTUSETUP
+# End of setup
+echo "User-data script completed"
+USERDATA
+)
+# Replace placeholder tokens in user-data
+USER_DATA="${USER_DATA//\$HF_TOKEN/$HF_TOKEN}"
+USER_DATA="${USER_DATA//\$WANDB_KEY/$WANDB_KEY}"
+# Launch instance
+print_status "Launching instance..."
+INSTANCE_ID=$(aws ec2 run-instances \
+    --image-id "$AMI_ID" \
+    --instance-type "$INSTANCE_TYPE" \
+    --key-name "$KEY_NAME" \
+    --security-group-ids "$SECURITY_GROUP" \
+    --block-device-mappings "[{\"DeviceName\":\"/dev/sda1\",\"Ebs\":{\"VolumeSize\":$VOLUME_SIZE,\"VolumeType\":\"gp3\"}}]" \
+    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$INSTANCE_NAME},{Key=Project,Value=seriguela},{Key=AutoSetup,Value=validated}]" \
+    --user-data "$USER_DATA" \
+    --query "Instances[0].InstanceId" \
+    --output text)
+print_status "Instance launched: $INSTANCE_ID"
+# Wait for instance to be running
+print_status "Waiting for instance to start..."
+aws ec2 wait instance-running --instance-ids "$INSTANCE_ID"
+# Get public IP
+PUBLIC_IP=$(aws ec2 describe-instances \
+    --instance-ids "$INSTANCE_ID" \
+    --query "Reservations[0].Instances[0].PublicIpAddress" \
+    --output text)
+echo ""
+echo "=========================================="
+echo -e "${GREEN}Instance Ready!${NC}"
+echo "=========================================="
+echo "Instance ID: $INSTANCE_ID"
+echo "Public IP: $PUBLIC_IP"
+echo "Key Pair: $KEY_NAME"
+echo ""
+echo -e "${BLUE}Connect with:${NC}"
+echo "  ssh -i ~/.ssh/${KEY_NAME}.pem ubuntu@${PUBLIC_IP}"
+echo ""
+echo -e "${BLUE}Check setup progress:${NC}"
+echo "  ssh ubuntu@${PUBLIC_IP} 'tail -f /var/log/user-data.log'"
+echo ""
+echo -e "${BLUE}Wait for VALIDATED setup to complete:${NC}"
+echo "  ssh ubuntu@${PUBLIC_IP} 'while [ ! -f ~/.setup_validated ]; do sleep 10; echo \"Setup in progress...\"; done; echo \"✅ Setup validated!\"; cat ~/setup_info.txt'"
+echo ""
+echo -e "${BLUE}Then run training:${NC}"
+echo "  ssh ubuntu@${PUBLIC_IP} 'cd seriguela && source venv/bin/activate && bash scripts/aws/run_all_training.sh'"
+echo ""
+echo -e "${YELLOW}Setup includes:${NC}"
+echo "  ✅ Wandb 0.24.1+ with authentication test"
+echo "  ✅ HuggingFace authentication test"
+echo "  ✅ GPU validation"
+echo "  ✅ All packages validated"
+echo ""
+# Save instance info
+INFO_DIR="${HOME}/.seriguela"
+mkdir -p "$INFO_DIR"
+echo "$INSTANCE_ID" > "$INFO_DIR/last_instance_id.txt"
+echo "$PUBLIC_IP" > "$INFO_DIR/last_instance_ip.txt"
+echo "$KEY_NAME" > "$INFO_DIR/last_key_name.txt"
+cat > "$INFO_DIR/last_instance_info.txt" << INFOEND
+Instance ID: $INSTANCE_ID
+Public IP: $PUBLIC_IP
+Key Name: $KEY_NAME
+Instance Type: $INSTANCE_TYPE
+Region: $REGION
+Launched: $(date)
+Setup: Validated (Wandb + HF + GPU)
+INFOEND
+print_status "Instance info saved to: $INFO_DIR/"
+echo ""

scripts/aws/monitor_evaluation.sh ADDED Viewed

	@@ -0,0 +1,116 @@

+#!/bin/bash
+# Script to monitor evaluation progress and download results
+# Usage: bash scripts/aws/monitor_evaluation.sh [PUBLIC_IP]
+set -e
+# Colors
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m'
+print_status() { echo -e "${GREEN}[INFO]${NC} $1"; }
+print_warning() { echo -e "${YELLOW}[WARN]${NC} $1"; }
+# Get IP from argument or saved info
+if [ -n "$1" ]; then
+    PUBLIC_IP="$1"
+else
+    INFO_DIR="${HOME}/.seriguela"
+    if [ -f "$INFO_DIR/last_evaluation_instance_ip.txt" ]; then
+        PUBLIC_IP=$(cat "$INFO_DIR/last_evaluation_instance_ip.txt")
+        print_status "Using saved IP: $PUBLIC_IP"
+    else
+        echo "Error: No IP provided and no saved IP found."
+        echo "Usage: $0 <PUBLIC_IP>"
+        exit 1
+    fi
+fi
+# Get key name
+INFO_DIR="${HOME}/.seriguela"
+if [ -f "$INFO_DIR/last_evaluation_key_name.txt" ]; then
+    KEY_NAME=$(cat "$INFO_DIR/last_evaluation_key_name.txt")
+else
+    KEY_NAME=$(aws ec2 describe-key-pairs --query "KeyPairs[0].KeyName" --output text 2>/dev/null)
+fi
+SSH_CMD="ssh -i ~/.ssh/${KEY_NAME}.pem -o StrictHostKeyChecking=no ubuntu@${PUBLIC_IP}"
+echo "=========================================="
+echo "Monitoring Evaluation"
+echo "=========================================="
+echo "Instance: $PUBLIC_IP"
+echo "Key: $KEY_NAME"
+echo ""
+# Check if setup is complete
+print_status "Checking setup status..."
+if $SSH_CMD 'test -f ~/.setup_complete'; then
+    print_status "✅ Setup complete"
+else
+    print_warning "Setup still in progress. Waiting..."
+    $SSH_CMD 'while [ ! -f ~/.setup_complete ]; do sleep 5; done; echo "Setup complete!"'
+fi
+echo ""
+echo "=========================================="
+echo "Evaluation Progress"
+echo "=========================================="
+echo "Press Ctrl+C to stop monitoring (evaluation will continue)"
+echo ""
+# Check if evaluation has started
+if $SSH_CMD 'test -f ~/seriguela/evaluation_*.log'; then
+    print_status "Evaluation in progress. Showing logs..."
+    echo ""
+    $SSH_CMD 'tail -f ~/seriguela/evaluation_*.log' || true
+else
+    print_warning "Evaluation hasn't started yet."
+    echo ""
+    echo "To start evaluation, run:"
+    echo "  $SSH_CMD 'cd seriguela && source venv/bin/activate && bash scripts/aws/evaluate_models.sh'"
+    echo ""
+    echo "Or run in background:"
+    echo "  $SSH_CMD 'cd seriguela && source venv/bin/activate && nohup bash scripts/aws/evaluate_models.sh > evaluation.log 2>&1 &'"
+fi
+echo ""
+echo "=========================================="
+echo "Download Results"
+echo "=========================================="
+echo ""
+# Download results if available
+if $SSH_CMD 'test -d ~/seriguela/evaluation_results/comparison'; then
+    print_status "Downloading results..."
+    # Create local directory
+    mkdir -p ./evaluation_results/comparison
+    # Download results
+    scp -i ~/.ssh/${KEY_NAME}.pem -o StrictHostKeyChecking=no -r \
+        ubuntu@${PUBLIC_IP}:~/seriguela/evaluation_results/comparison/* \
+        ./evaluation_results/comparison/ 2>/dev/null || true
+    # Download log files
+    scp -i ~/.ssh/${KEY_NAME}.pem -o StrictHostKeyChecking=no \
+        ubuntu@${PUBLIC_IP}:~/seriguela/evaluation_*.log \
+        ./evaluation_results/ 2>/dev/null || true
+    print_status "Results downloaded to: ./evaluation_results/"
+    echo ""
+    # Show latest comparison
+    LATEST_COMPARISON=$(ls -t ./evaluation_results/comparison/comparison_*.json 2>/dev/null | head -1)
+    if [ -n "$LATEST_COMPARISON" ]; then
+        echo "Latest comparison results:"
+        echo ""
+        cat "$LATEST_COMPARISON" | jq '.comparison' 2>/dev/null || cat "$LATEST_COMPARISON"
+    fi
+else
+    print_warning "No results available yet."
+fi
+echo ""

scripts/aws/monitor_training_auto.sh ADDED Viewed

	@@ -0,0 +1,179 @@

+#!/bin/bash
+# Automatic Training Monitor and Notifier
+# Monitors training process and runs analysis when complete
+set -e
+# Colors
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+RED='\033[0;31m'
+BLUE='\033[0;34m'
+NC='\033[0m'
+print_status() { echo -e "${GREEN}[$(date '+%H:%M:%S')]${NC} $1"; }
+print_warning() { echo -e "${YELLOW}[$(date '+%H:%M:%S')]${NC} $1"; }
+print_error() { echo -e "${RED}[$(date '+%H:%M:%S')]${NC} $1"; }
+print_header() { echo -e "\n${BLUE}========================================\n$1\n========================================${NC}\n"; }
+# Configuration
+PROJECT_DIR="/home/ubuntu/seriguela"
+LOG_FILE="$HOME/training_success.log"
+MONITOR_LOG="$HOME/monitor_output.log"
+TRAINING_PID=""
+CHECK_INTERVAL=60  # Check every 60 seconds
+MODEL_PATH="./output/Se124M_700K_infix"
+DATASET_REPO="augustocsc/sintetico_natural"
+DATA_DIR="700K"
+DATA_COLUMN="i_prompt_n"
+cd "$PROJECT_DIR"
+source venv/bin/activate
+# Get training PID
+get_training_pid() {
+    TRAINING_PID=$(ps aux | grep "python scripts/train.py" | grep -v grep | awk '{print $2}')
+}
+# Check if training is running
+is_training_running() {
+    get_training_pid
+    if [ -z "$TRAINING_PID" ]; then
+        return 1
+    else
+        return 0
+    fi
+}
+# Get training progress from log
+get_progress() {
+    if [ -f "$LOG_FILE" ]; then
+        # Get last progress line
+        tail -100 "$LOG_FILE" | grep -E "([0-9]+)%\|" | tail -1 | sed 's/.*\([0-9]\+\)%|.*/\1/' || echo "0"
+    else
+        echo "0"
+    fi
+}
+# Get current epoch and step
+get_training_stats() {
+    if [ -f "$LOG_FILE" ]; then
+        local last_line=$(tail -100 "$LOG_FILE" | grep -E "[0-9]+/21882" | tail -1)
+        echo "$last_line"
+    fi
+}
+# Send notification (multiple methods)
+send_notification() {
+    local title="$1"
+    local message="$2"
+    print_header "$title"
+    echo "$message"
+    # Save to notification file
+    cat > "$HOME/training_notification.txt" << EOF
+================================================================================
+$title
+$(date '+%Y-%m-%d %H:%M:%S')
+================================================================================
+$message
+================================================================================
+EOF
+    print_status "Notification saved to: $HOME/training_notification.txt"
+}
+# Monitor training
+print_header "Training Monitor Started"
+print_status "Monitoring training process..."
+print_status "Log file: $LOG_FILE"
+print_status "Check interval: ${CHECK_INTERVAL}s"
+START_TIME=$(date +%s)
+LAST_PROGRESS=0
+while true; do
+    if is_training_running; then
+        CURRENT_PROGRESS=$(get_progress)
+        TRAINING_STATS=$(get_training_stats)
+        # Show progress every check
+        print_status "Training running (PID: $TRAINING_PID) - Progress: ${CURRENT_PROGRESS}%"
+        if [ ! -z "$TRAINING_STATS" ]; then
+            echo "         $TRAINING_STATS"
+        fi
+        # Check GPU
+        GPU_INFO=$(nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader,nounits)
+        echo "         GPU: $GPU_INFO"
+        LAST_PROGRESS=$CURRENT_PROGRESS
+        sleep $CHECK_INTERVAL
+    else
+        # Training finished or crashed
+        END_TIME=$(date +%s)
+        DURATION=$((END_TIME - START_TIME))
+        HOURS=$((DURATION / 3600))
+        MINUTES=$(((DURATION % 3600) / 60))
+        print_header "Training Process Ended"
+        # Check if training completed successfully
+        if grep -q "Training finished" "$LOG_FILE" 2>/dev/null || \
+           grep -q "100%|" "$LOG_FILE" 2>/dev/null; then
+            # SUCCESS - Training completed
+            print_status "Training completed successfully!"
+            print_status "Total time: ${HOURS}h ${MINUTES}m"
+            # Extract final metrics
+            FINAL_METRICS=$(tail -200 "$LOG_FILE" | grep -E "(train_loss|eval_loss)" | tail -5)
+            send_notification "✅ Training Completed Successfully" \
+"Training Duration: ${HOURS}h ${MINUTES}m
+Model: GPT-2 (124M) with LoRA
+Dataset: 700K infix
+Output: $MODEL_PATH
+Final Metrics:
+$FINAL_METRICS
+Wandb Dashboard:
+https://wandb.ai/symbolic-gression/seriguela_700K_test
+Starting automatic analysis...
+"
+            # Run automatic analysis
+            print_header "Starting Automatic Analysis"
+            bash "$PROJECT_DIR/scripts/aws/analyze_model.sh" "$MODEL_PATH" "$DATA_COLUMN" 2>&1 | tee "$HOME/analysis_output.log"
+            print_status "Analysis complete! Check: $HOME/analysis_output.log"
+        else
+            # FAILED - Training crashed or was killed
+            print_error "Training ended unexpectedly!"
+            # Get last errors
+            ERRORS=$(tail -50 "$LOG_FILE" | grep -E "(Error|Exception|Traceback)" | head -10)
+            send_notification "❌ Training Failed or Interrupted" \
+"Training Duration: ${HOURS}h ${MINUTES}m
+Last Progress: ${LAST_PROGRESS}%
+Possible Errors:
+$ERRORS
+Check full log: $LOG_FILE
+"
+        fi
+        break
+    fi
+done
+print_status "Monitor finished. Check notification file: $HOME/training_notification.txt"

scripts/aws/run_all_training.sh ADDED Viewed

	@@ -0,0 +1,365 @@

+#!/bin/bash
+# Workflow completo de treinamento para AWS g5.xlarge
+# Projeto Seriguela - Treinar 6 modelos GPT-2 (3 tamanhos x 2 formatos)
+set -e  # Exit on error
+echo "=========================================="
+echo "Seriguela - Full Training Workflow"
+echo "=========================================="
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+print_status() {
+    echo -e "${GREEN}[INFO]${NC} $1"
+}
+print_warning() {
+    echo -e "${YELLOW}[WARNING]${NC} $1"
+}
+print_error() {
+    echo -e "${RED}[ERROR]${NC} $1"
+}
+print_header() {
+    echo ""
+    echo -e "${BLUE}=========================================="
+    echo "$1"
+    echo -e "==========================================${NC}"
+    echo ""
+}
+# Configuration
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_DIR="$(dirname "$(dirname "$SCRIPT_DIR")")"
+cd "$PROJECT_DIR"
+# Check if virtual environment is activated
+if [ -z "$VIRTUAL_ENV" ]; then
+    print_warning "Virtual environment not activated. Activating..."
+    source venv/bin/activate 2>/dev/null || {
+        print_error "Could not activate virtual environment. Please run setup_aws.sh first."
+        exit 1
+    }
+fi
+# Check environment variables
+if [ -z "$HF_TOKEN" ]; then
+    print_error "HF_TOKEN not set. Please export HF_TOKEN='your_token'"
+    exit 1
+fi
+# Check GPU
+print_status "Checking GPU..."
+nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv || {
+    print_error "GPU not available!"
+    exit 1
+}
+# Dataset configuration
+DATASET_REPO="augustocsc/sintetico_natural"
+DATA_DIR="700K"
+HF_USER="augustocsc"
+# Common training parameters
+WANDB_PROJECT="seriguela_700K"
+SEED=42
+BLOCK_SIZE=128
+# Output directories
+OUTPUT_BASE="./output"
+EVAL_OUTPUT="./evaluation_results"
+mkdir -p "$OUTPUT_BASE" "$EVAL_OUTPUT"
+# Training configurations
+# Format: "model_name|epochs|batch_size|grad_accum|learning_rate|run_suffix"
+declare -a CONFIGS=(
+    # GPT-2 Small (124M)
+    "gpt2|3|16|4|5e-5|Se124M"
+    # GPT-2 Medium (355M)
+    "gpt2-medium|2|8|8|3e-5|Se355M"
+    # GPT-2 Large (774M)
+    "gpt2-large|2|4|16|2e-5|Se774M"
+)
+# Data columns for formats
+declare -a DATA_COLUMNS=(
+    "i_prompt_n|infix"
+    "p_prompt_n|prefix"
+)
+# Function to run training
+run_training() {
+    local model_name=$1
+    local epochs=$2
+    local batch_size=$3
+    local grad_accum=$4
+    local lr=$5
+    local run_suffix=$6
+    local data_column=$7
+    local format=$8
+    local run_name="${run_suffix}_${DATA_DIR}_${format}"
+    local output_dir="${OUTPUT_BASE}/${run_name}"
+    local hub_model_id="${HF_USER}/${run_name}"
+    print_header "Training: $run_name"
+    echo "Model: $model_name"
+    echo "Epochs: $epochs"
+    echo "Batch size: $batch_size"
+    echo "Gradient accumulation: $grad_accum"
+    echo "Effective batch size: $((batch_size * grad_accum))"
+    echo "Learning rate: $lr"
+    echo "Data column: $data_column"
+    echo "Output: $output_dir"
+    echo "Hub ID: $hub_model_id"
+    echo ""
+    # Run training
+    python scripts/train.py \
+        --model_name_or_path "$model_name" \
+        --dataset_repo_id "$DATASET_REPO" \
+        --data_dir "$DATA_DIR" \
+        --data_column "$data_column" \
+        --approach "$format" \
+        --output_dir "$output_dir" \
+        --num_train_epochs "$epochs" \
+        --per_device_train_batch_size "$batch_size" \
+        --per_device_eval_batch_size "$batch_size" \
+        --gradient_accumulation_steps "$grad_accum" \
+        --learning_rate "$lr" \
+        --weight_decay 0.01 \
+        --warmup_steps 100 \
+        --block_size "$BLOCK_SIZE" \
+        --logging_steps 50 \
+        --eval_strategy epoch \
+        --save_strategy epoch \
+        --save_total_limit 2 \
+        --load_best_model_at_end \
+        --fp16 \
+        --seed "$SEED" \
+        --wandb_project "$WANDB_PROJECT" \
+        --wandb_run_name "$run_name" \
+        --push_to_hub \
+        --hub_model_id "$hub_model_id"
+    # Check if training was successful
+    if [ $? -eq 0 ]; then
+        print_status "Training completed successfully: $run_name"
+        return 0
+    else
+        print_error "Training failed: $run_name"
+        return 1
+    fi
+}
+# Function to run evaluation
+run_evaluation() {
+    local model_path=$1
+    local data_column=$2
+    local num_samples=${3:-500}
+    print_status "Evaluating model: $model_path"
+    python scripts/evaluate.py \
+        --model_path "$model_path" \
+        --dataset_repo_id "$DATASET_REPO" \
+        --data_dir "$DATA_DIR" \
+        --data_column "$data_column" \
+        --num_samples "$num_samples" \
+        --output_dir "$EVAL_OUTPUT" \
+        --temperature 0.7 \
+        --seed "$SEED"
+    if [ $? -eq 0 ]; then
+        print_status "Evaluation completed: $model_path"
+    else
+        print_warning "Evaluation had issues: $model_path"
+    fi
+}
+# Parse command line arguments
+RUN_TEST=false
+RUN_TRAINING=true
+RUN_EVAL=true
+SPECIFIC_MODEL=""
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --test-only)
+            RUN_TEST=true
+            RUN_TRAINING=false
+            RUN_EVAL=false
+            shift
+            ;;
+        --no-eval)
+            RUN_EVAL=false
+            shift
+            ;;
+        --eval-only)
+            RUN_TRAINING=false
+            RUN_EVAL=true
+            shift
+            ;;
+        --model)
+            SPECIFIC_MODEL="$2"
+            shift 2
+            ;;
+        --help)
+            echo "Usage: $0 [OPTIONS]"
+            echo ""
+            echo "Options:"
+            echo "  --test-only    Run only the test training (1 epoch)"
+            echo "  --no-eval      Skip evaluation after training"
+            echo "  --eval-only    Run only evaluation (skip training)"
+            echo "  --model NAME   Train only specific model (gpt2, gpt2-medium, gpt2-large)"
+            echo "  --help         Show this help message"
+            exit 0
+            ;;
+        *)
+            print_error "Unknown option: $1"
+            exit 1
+            ;;
+    esac
+done
+# Test run
+if [ "$RUN_TEST" = true ]; then
+    print_header "Running Test Training (1 epoch with gpt2)"
+    python scripts/train.py \
+        --model_name_or_path gpt2 \
+        --dataset_repo_id "$DATASET_REPO" \
+        --data_dir "$DATA_DIR" \
+        --data_column "i_prompt_n" \
+        --approach "infix" \
+        --output_dir "${OUTPUT_BASE}/test_run" \
+        --num_train_epochs 1 \
+        --per_device_train_batch_size 16 \
+        --gradient_accumulation_steps 4 \
+        --learning_rate 5e-5 \
+        --block_size "$BLOCK_SIZE" \
+        --logging_steps 20 \
+        --eval_strategy epoch \
+        --save_strategy epoch \
+        --fp16 \
+        --seed "$SEED" \
+        --wandb_project "${WANDB_PROJECT}_test"
+    print_status "Test training completed!"
+    print_status "Checklist:"
+    echo "  [ ] GPU detected and functioning"
+    echo "  [ ] Dataset loaded correctly"
+    echo "  [ ] Training completed without errors"
+    echo "  [ ] Wandb received metrics"
+    echo "  [ ] Model saved locally"
+    echo ""
+    echo "Now test evaluate.py and generate.py:"
+    echo "  python scripts/evaluate.py --model_path ./output/test_run --num_samples 50"
+    echo "  python scripts/generate.py --model_path ./output/test_run --num_generations 5 --validate"
+    exit 0
+fi
+# Track completed trainings
+declare -a COMPLETED_MODELS=()
+declare -a FAILED_MODELS=()
+# Main training loop
+if [ "$RUN_TRAINING" = true ]; then
+    print_header "Starting Full Training Workflow"
+    START_TIME=$(date +%s)
+    for config in "${CONFIGS[@]}"; do
+        IFS='|' read -r model_name epochs batch_size grad_accum lr run_suffix <<< "$config"
+        # Skip if specific model requested and this is not it
+        if [ -n "$SPECIFIC_MODEL" ] && [ "$model_name" != "$SPECIFIC_MODEL" ]; then
+            continue
+        fi
+        for data_config in "${DATA_COLUMNS[@]}"; do
+            IFS='|' read -r data_column format <<< "$data_config"
+            run_name="${run_suffix}_${DATA_DIR}_${format}"
+            print_status "Starting training: $run_name"
+            if run_training "$model_name" "$epochs" "$batch_size" "$grad_accum" "$lr" "$run_suffix" "$data_column" "$format"; then
+                COMPLETED_MODELS+=("${HF_USER}/${run_name}|${data_column}")
+            else
+                FAILED_MODELS+=("$run_name")
+            fi
+            # Small delay between trainings
+            sleep 10
+        done
+    done
+    END_TIME=$(date +%s)
+    DURATION=$((END_TIME - START_TIME))
+    HOURS=$((DURATION / 3600))
+    MINUTES=$(((DURATION % 3600) / 60))
+    print_header "Training Summary"
+    echo "Total time: ${HOURS}h ${MINUTES}m"
+    echo ""
+    echo "Completed models (${#COMPLETED_MODELS[@]}):"
+    for model in "${COMPLETED_MODELS[@]}"; do
+        echo "  - ${model%|*}"
+    done
+    echo ""
+    if [ ${#FAILED_MODELS[@]} -gt 0 ]; then
+        echo "Failed models (${#FAILED_MODELS[@]}):"
+        for model in "${FAILED_MODELS[@]}"; do
+            echo "  - $model"
+        done
+    fi
+fi
+# Evaluation
+if [ "$RUN_EVAL" = true ]; then
+    print_header "Running Evaluations"
+    # If we just trained, use those models
+    if [ ${#COMPLETED_MODELS[@]} -gt 0 ]; then
+        for model_info in "${COMPLETED_MODELS[@]}"; do
+            IFS='|' read -r model_path data_column <<< "$model_info"
+            run_evaluation "$model_path" "$data_column" 500
+        done
+    else
+        # Otherwise, evaluate all expected models
+        for config in "${CONFIGS[@]}"; do
+            IFS='|' read -r model_name epochs batch_size grad_accum lr run_suffix <<< "$config"
+            for data_config in "${DATA_COLUMNS[@]}"; do
+                IFS='|' read -r data_column format <<< "$data_config"
+                run_name="${run_suffix}_${DATA_DIR}_${format}"
+                model_path="${HF_USER}/${run_name}"
+                run_evaluation "$model_path" "$data_column" 500
+            done
+        done
+    fi
+    print_header "Evaluation Complete"
+    echo "Results saved to: $EVAL_OUTPUT"
+fi
+print_header "Workflow Complete!"
+echo ""
+echo "Next steps:"
+echo "1. Check training results on wandb: https://wandb.ai/${WANDB_PROJECT}"
+echo "2. Check models on HuggingFace Hub: https://huggingface.co/${HF_USER}"
+echo "3. Review evaluation results in: $EVAL_OUTPUT"
+echo ""
+echo "To test a model interactively:"
+echo "  python scripts/generate.py --model_path ${HF_USER}/Se124M_700K_infix --interactive --validate"
+echo ""

scripts/aws/setup_and_train_exp_a.sh ADDED Viewed

	@@ -0,0 +1,83 @@

+#!/bin/bash
+# Complete setup and training script for EXP-A (JSON format)
+# Run this on a fresh AWS instance
+set -e
+echo "=============================================="
+echo "EXP-A: Complete Setup and Training"
+echo "JSON Format with <|endofex|> marker"
+echo "=============================================="
+echo "Started: $(date)"
+echo ""
+cd /home/ubuntu/seriguela
+# Activate environment
+source venv/bin/activate
+# Step 1: Prepare data
+echo "[1/3] Preparing training data..."
+echo "This will download from HuggingFace Hub and convert to JSON format"
+echo ""
+mkdir -p data/experiments
+python scripts/data/prepare_experiment_data.py \
+    --dataset_repo_id augustocsc/sintetico_natural \
+    --data_dir 700K \
+    --data_column i_prompt_n \
+    --output_base_dir ./data/experiments
+# Verify data
+if [ ! -f "./data/experiments/exp_a_json/train.csv" ]; then
+    echo "ERROR: Data preparation failed!"
+    exit 1
+fi
+TRAIN_COUNT=$(wc -l < ./data/experiments/exp_a_json/train.csv)
+echo "Training samples: $TRAIN_COUNT"
+# Step 2: Run training
+echo ""
+echo "[2/3] Starting training..."
+echo "Output: ./output/exp_a_json"
+echo ""
+python scripts/train_experiment.py \
+    --experiment_name "exp_a_json" \
+    --train_file ./data/experiments/exp_a_json/train.csv \
+    --validation_file ./data/experiments/exp_a_json/validation.csv \
+    --output_dir ./output/exp_a_json \
+    --json_format \
+    --end_marker '"}' \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 8 \
+    --gradient_accumulation_steps 4 \
+    --learning_rate 5e-5 \
+    --block_size 256 \
+    --fp16 \
+    --wandb_project seriguela_experiments \
+    --wandb_run_name "exp_a_json_$(date +%Y%m%d_%H%M%S)"
+# Step 3: Evaluate
+echo ""
+echo "[3/3] Evaluating model..."
+echo ""
+python scripts/evaluate_experiments.py \
+    --model_path ./output/exp_a_json \
+    --experiment_type json \
+    --num_samples 200 \
+    --output_file ./output/exp_a_json/evaluation_results.json
+echo ""
+echo "=============================================="
+echo "EXP-A Complete!"
+echo "=============================================="
+echo "Finished: $(date)"
+echo "Model: ./output/exp_a_json"
+echo "Results: ./output/exp_a_json/evaluation_results.json"
+# Create completion marker
+touch /home/ubuntu/.exp_a_complete

scripts/aws/setup_and_train_exp_b.sh ADDED Viewed

	@@ -0,0 +1,83 @@

+#!/bin/bash
+# Complete setup and training script for EXP-B (EOS format)
+# Run this on a fresh AWS instance
+set -e
+echo "=============================================="
+echo "EXP-B: Complete Setup and Training"
+echo "EOS Format with <|endoftext|> marker"
+echo "=============================================="
+echo "Started: $(date)"
+echo ""
+cd /home/ubuntu/seriguela
+# Activate environment
+source venv/bin/activate
+# Step 1: Prepare data
+echo "[1/3] Preparing training data..."
+echo "This will download from HuggingFace Hub and convert to EOS format"
+echo ""
+mkdir -p data/experiments
+python scripts/data/prepare_experiment_data.py \
+    --dataset_repo_id augustocsc/sintetico_natural \
+    --data_dir 700K \
+    --data_column i_prompt_n \
+    --output_base_dir ./data/experiments
+# Verify data
+if [ ! -f "./data/experiments/exp_b_eos/train.csv" ]; then
+    echo "ERROR: Data preparation failed!"
+    exit 1
+fi
+TRAIN_COUNT=$(wc -l < ./data/experiments/exp_b_eos/train.csv)
+echo "Training samples: $TRAIN_COUNT"
+# Step 2: Run training
+echo ""
+echo "[2/3] Starting training..."
+echo "Output: ./output/exp_b_eos"
+echo ""
+python scripts/train_experiment.py \
+    --experiment_name "exp_b_eos" \
+    --train_file ./data/experiments/exp_b_eos/train.csv \
+    --validation_file ./data/experiments/exp_b_eos/validation.csv \
+    --output_dir ./output/exp_b_eos \
+    --end_marker "<|endoftext|>" \
+    --use_native_eos \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 8 \
+    --gradient_accumulation_steps 4 \
+    --learning_rate 5e-5 \
+    --block_size 128 \
+    --fp16 \
+    --wandb_project seriguela_experiments \
+    --wandb_run_name "exp_b_eos_$(date +%Y%m%d_%H%M%S)"
+# Step 3: Evaluate
+echo ""
+echo "[3/3] Evaluating model..."
+echo ""
+python scripts/evaluate_experiments.py \
+    --model_path ./output/exp_b_eos \
+    --experiment_type eos \
+    --num_samples 200 \
+    --output_file ./output/exp_b_eos/evaluation_results.json
+echo ""
+echo "=============================================="
+echo "EXP-B Complete!"
+echo "=============================================="
+echo "Finished: $(date)"
+echo "Model: ./output/exp_b_eos"
+echo "Results: ./output/exp_b_eos/evaluation_results.json"
+# Create completion marker
+touch /home/ubuntu/.exp_b_complete

scripts/aws/setup_aws.sh ADDED Viewed

	@@ -0,0 +1,87 @@

+#!/bin/bash
+# Setup script for AWS g5.xlarge instance (Deep Learning AMI Ubuntu)
+# Project: Seriguela - GPT-2 Fine-tuning for Symbolic Regression
+# Optimized for faster setup
+set -e
+echo "=========================================="
+echo "Seriguela AWS Setup Script (Optimized)"
+echo "=========================================="
+# Colors
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+RED='\033[0;31m'
+NC='\033[0m'
+print_status() { echo -e "${GREEN}[INFO]${NC} $1"; }
+print_warning() { echo -e "${YELLOW}[WARN]${NC} $1"; }
+print_error() { echo -e "${RED}[ERROR]${NC} $1"; }
+# Configuration
+REPO_URL="https://github.com/augustocsc/seriguela.git"
+REPO_DIR="$HOME/seriguela"
+PYTHON_VERSION="python3"
+# Check GPU
+print_status "Checking GPU..."
+if ! nvidia-smi &>/dev/null; then
+    print_error "GPU not detected!"
+    exit 1
+fi
+nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
+# Install system dependencies (minimal)
+print_status "Installing system dependencies..."
+sudo apt-get update -qq
+sudo apt-get install -y -qq python3-venv python3-pip git htop
+# Clone or update repository
+if [ -d "$REPO_DIR" ]; then
+    print_status "Updating repository..."
+    cd "$REPO_DIR" && git pull
+else
+    print_status "Cloning repository..."
+    git clone "$REPO_URL" "$REPO_DIR"
+fi
+cd "$REPO_DIR"
+# Setup virtual environment
+print_status "Setting up virtual environment..."
+$PYTHON_VERSION -m venv venv
+source venv/bin/activate
+# Upgrade pip and install dependencies in one step
+print_status "Installing all dependencies (this may take a few minutes)..."
+pip install --upgrade pip -q
+pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121 -q
+# Verify installation
+print_status "Verifying installation..."
+python -c "
+import torch
+import transformers
+import peft
+print(f'PyTorch: {torch.__version__}')
+print(f'CUDA available: {torch.cuda.is_available()}')
+if torch.cuda.is_available():
+    print(f'GPU: {torch.cuda.get_device_name(0)}')
+    print(f'Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB')
+print(f'Transformers: {transformers.__version__}')
+print(f'PEFT: {peft.__version__}')
+"
+echo ""
+echo "=========================================="
+echo -e "${GREEN}Setup Complete!${NC}"
+echo "=========================================="
+echo ""
+echo "Next: Configure tokens in .env file:"
+echo "  echo 'HF_TOKEN=your_token' > .env"
+echo "  echo 'WANDB_API_KEY=your_key' >> .env"
+echo ""
+echo "Then run training:"
+echo "  source venv/bin/activate"
+echo "  bash scripts/aws/run_all_training.sh --test-only"
+echo ""

scripts/aws/train_exp_a.sh ADDED Viewed

	@@ -0,0 +1,57 @@

+#!/bin/bash
+# EXP-A: Training with JSON structured format
+# Uses <|endofex|> as end marker
+set -e
+echo "=============================================="
+echo "EXP-A: JSON Format Training"
+echo "=============================================="
+cd ~/seriguela
+# Activate virtual environment
+source venv/bin/activate
+# Check data exists
+if [ ! -f "./data/experiments/exp_a_json/train.csv" ]; then
+    echo "ERROR: Training data not found!"
+    echo "Expected: ./data/experiments/exp_a_json/train.csv"
+    exit 1
+fi
+# Count samples
+TRAIN_COUNT=$(wc -l < ./data/experiments/exp_a_json/train.csv)
+echo "Training samples: $TRAIN_COUNT"
+# Training configuration
+export WANDB_PROJECT="seriguela_experiments"
+export HF_TOKEN="${HF_TOKEN:-}"
+export WANDB_API_KEY="${WANDB_API_KEY:-}"
+# Run training
+echo ""
+echo "Starting training..."
+echo "Output: ./output/exp_a_json"
+echo ""
+python scripts/train_experiment.py \
+    --experiment_name "exp_a_json" \
+    --train_file ./data/experiments/exp_a_json/train.csv \
+    --validation_file ./data/experiments/exp_a_json/validation.csv \
+    --output_dir ./output/exp_a_json \
+    --end_marker "<|endofex|>" \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 8 \
+    --gradient_accumulation_steps 4 \
+    --learning_rate 5e-5 \
+    --block_size 256 \
+    --fp16 \
+    --wandb_project seriguela_experiments \
+    --wandb_run_name "exp_a_json_$(date +%Y%m%d_%H%M%S)"
+echo ""
+echo "=============================================="
+echo "EXP-A Training Complete!"
+echo "=============================================="
+echo "Model saved to: ./output/exp_a_json"

scripts/aws/train_exp_b.sh ADDED Viewed

	@@ -0,0 +1,58 @@

+#!/bin/bash
+# EXP-B: Training with GPT-2 EOS token (<|endoftext|>)
+# Uses native GPT-2 EOS token (ID 50256)
+set -e
+echo "=============================================="
+echo "EXP-B: EOS Token Format Training"
+echo "=============================================="
+cd ~/seriguela
+# Activate virtual environment
+source venv/bin/activate
+# Check data exists
+if [ ! -f "./data/experiments/exp_b_eos/train.csv" ]; then
+    echo "ERROR: Training data not found!"
+    echo "Expected: ./data/experiments/exp_b_eos/train.csv"
+    exit 1
+fi
+# Count samples
+TRAIN_COUNT=$(wc -l < ./data/experiments/exp_b_eos/train.csv)
+echo "Training samples: $TRAIN_COUNT"
+# Training configuration
+export WANDB_PROJECT="seriguela_experiments"
+export HF_TOKEN="${HF_TOKEN:-}"
+export WANDB_API_KEY="${WANDB_API_KEY:-}"
+# Run training
+echo ""
+echo "Starting training..."
+echo "Output: ./output/exp_b_eos"
+echo ""
+python scripts/train_experiment.py \
+    --experiment_name "exp_b_eos" \
+    --train_file ./data/experiments/exp_b_eos/train.csv \
+    --validation_file ./data/experiments/exp_b_eos/validation.csv \
+    --output_dir ./output/exp_b_eos \
+    --end_marker "<|endoftext|>" \
+    --use_native_eos \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 8 \
+    --gradient_accumulation_steps 4 \
+    --learning_rate 5e-5 \
+    --block_size 128 \
+    --fp16 \
+    --wandb_project seriguela_experiments \
+    --wandb_run_name "exp_b_eos_$(date +%Y%m%d_%H%M%S)"
+echo ""
+echo "=============================================="
+echo "EXP-B Training Complete!"
+echo "=============================================="
+echo "Model saved to: ./output/exp_b_eos"

scripts/aws/train_fixed_model.sh ADDED Viewed

	@@ -0,0 +1,144 @@

+#!/bin/bash
+# Train model with proper end-of-expression markers
+# This script retrains the Seriguela model with <|endofex|> markers in the training data
+# so the model learns to stop generation correctly.
+set -e  # Exit on error
+echo "================================================================"
+echo "SERIGUELA - Training Model with Proper End Markers"
+echo "================================================================"
+# Configuration
+MODEL_NAME="gpt2"
+DATASET_REPO="augustocsc/sintetico_natural"
+DATA_DIR="700K"
+DATA_COLUMN="i_prompt_n"  # or p_prompt_n for prefix
+OUTPUT_DIR="./output/Se124M_700K_infix_v2"
+HUB_MODEL_ID="augustocsc/Se124M_700K_infix_v2"  # NEW REPO NAME
+# Hyperparameters
+EPOCHS=3
+BATCH_SIZE=8
+LEARNING_RATE=5e-5
+BLOCK_SIZE=128
+LORA_R=8
+LORA_ALPHA=32
+LORA_DROPOUT=0.05
+echo ""
+echo "Configuration:"
+echo "  Model: $MODEL_NAME"
+echo "  Dataset: $DATASET_REPO/$DATA_DIR"
+echo "  Data Column: $DATA_COLUMN"
+echo "  Output: $OUTPUT_DIR"
+echo "  Hub Model: $HUB_MODEL_ID"
+echo ""
+echo "Hyperparameters:"
+echo "  Epochs: $EPOCHS"
+echo "  Batch Size: $BATCH_SIZE"
+echo "  Learning Rate: $LEARNING_RATE"
+echo "  Block Size: $BLOCK_SIZE"
+echo "  LoRA r: $LORA_R"
+echo "  LoRA alpha: $LORA_ALPHA"
+echo "  LoRA dropout: $LORA_DROPOUT"
+echo "================================================================"
+# Check if data preparation is needed
+echo ""
+echo "[Step 1/3] Checking data preparation..."
+if [ ! -f "./data/processed/700K_fixed/train_700K.csv" ]; then
+    echo "Training data not found. Preparing data with end markers..."
+    python scripts/data/prepare_training_data_fixed.py \
+        --dataset_repo_id $DATASET_REPO \
+        --data_dir $DATA_DIR \
+        --data_column $DATA_COLUMN \
+        --output_dir ./data/processed/700K_fixed \
+        --validate
+    if [ $? -ne 0 ]; then
+        echo "❌ Data preparation failed!"
+        exit 1
+    fi
+    echo "✅ Data preparation complete!"
+else
+    echo "✅ Training data already prepared (./data/processed/700K_fixed/)"
+fi
+# Optional: Show sample of prepared data
+echo ""
+echo "Sample of prepared data:"
+head -n 2 ./data/processed/700K_fixed/train_700K.csv
+echo ""
+# Start training
+echo ""
+echo "[Step 2/3] Starting training..."
+echo "================================================================"
+echo ""
+python scripts/train.py \
+    --model_name_or_path $MODEL_NAME \
+    --dataset_repo_id $DATASET_REPO \
+    --data_dir $DATA_DIR \
+    --data_column $DATA_COLUMN \
+    --output_dir $OUTPUT_DIR \
+    --num_train_epochs $EPOCHS \
+    --per_device_train_batch_size $BATCH_SIZE \
+    --learning_rate $LEARNING_RATE \
+    --block_size $BLOCK_SIZE \
+    --eval_strategy epoch \
+    --save_strategy epoch \
+    --save_total_limit 2 \
+    --load_best_model_at_end \
+    --lora_r $LORA_R \
+    --lora_alpha $LORA_ALPHA \
+    --lora_dropout $LORA_DROPOUT \
+    --push_to_hub \
+    --hub_model_id $HUB_MODEL_ID \
+    --logging_steps 100 \
+    --seed 42
+if [ $? -ne 0 ]; then
+    echo ""
+    echo "❌ Training failed!"
+    exit 1
+fi
+echo ""
+echo "✅ Training complete!"
+# Quick test generation
+echo ""
+echo "[Step 3/3] Testing model generation..."
+echo "================================================================"
+echo ""
+python scripts/generate.py \
+    --model_path $OUTPUT_DIR \
+    --num_generations 5 \
+    --validate
+if [ $? -ne 0 ]; then
+    echo ""
+    echo "⚠️ Generation test failed, but model was trained successfully"
+else
+    echo ""
+    echo "✅ Generation test passed!"
+fi
+# Summary
+echo ""
+echo "================================================================"
+echo "TRAINING COMPLETE"
+echo "================================================================"
+echo "Model saved to: $OUTPUT_DIR"
+echo "Model pushed to: $HUB_MODEL_ID"
+echo ""
+echo "Next steps:"
+echo "  1. Evaluate the model: python scripts/evaluate.py --model_path $OUTPUT_DIR"
+echo "  2. Compare with old model: python scripts/compare_models.py --model1 ./output/Se124M_700K_infix --model2 $OUTPUT_DIR"
+echo "  3. Generate more samples: python scripts/generate.py --model_path $OUTPUT_DIR --num_generations 20"
+echo "================================================================"

scripts/aws/train_v3_model.sh ADDED Viewed

	@@ -0,0 +1,144 @@

+#!/bin/bash
+# Training script for v3 model with proper end markers
+# This script is designed to be run on AWS EC2 instances with GPU
+set -e  # Exit on error
+echo "=================================================="
+echo "Seriguela v3 Model Training"
+echo "=================================================="
+echo "Start time: $(date)"
+echo ""
+# Configuration
+PROJECT_DIR="${HOME}/seriguela"
+OUTPUT_DIR="${PROJECT_DIR}/output/Se124M_700K_infix_v3"
+CONFIG_FILE="${PROJECT_DIR}/configs/training_v3.json"
+DATA_DIR="${PROJECT_DIR}/data/processed/700K_fixed"
+# Check if running in project directory
+if [ ! -d "$PROJECT_DIR" ]; then
+    echo "ERROR: Project directory not found: $PROJECT_DIR"
+    exit 1
+fi
+cd "$PROJECT_DIR"
+# Activate virtual environment
+echo "Activating virtual environment..."
+if [ -d "venv" ]; then
+    source venv/bin/activate
+elif [ -d ".seriguela" ]; then
+    source .seriguela/bin/activate
+else
+    echo "ERROR: Virtual environment not found!"
+    exit 1
+fi
+# Verify GPU availability
+echo ""
+echo "Checking GPU availability..."
+python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU count: {torch.cuda.device_count()}'); print(f'GPU name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"
+if ! python -c "import torch; exit(0 if torch.cuda.is_available() else 1)"; then
+    echo "WARNING: GPU not detected! Training will be slow on CPU."
+    read -p "Continue anyway? (y/n) " -n 1 -r
+    echo
+    if [[ ! $REPLY =~ ^[Yy]$ ]]; then
+        exit 1
+    fi
+fi
+# Verify data files exist
+echo ""
+echo "Verifying training data..."
+if [ ! -f "$DATA_DIR/train_700K.csv" ]; then
+    echo "ERROR: Training data not found: $DATA_DIR/train_700K.csv"
+    echo "Please ensure data preparation step was completed."
+    exit 1
+fi
+if [ ! -f "$DATA_DIR/validation_700K.csv" ]; then
+    echo "ERROR: Validation data not found: $DATA_DIR/validation_700K.csv"
+    exit 1
+fi
+# Check for end markers in data
+echo "Checking for end markers in training data..."
+MARKER_COUNT=$(head -100 "$DATA_DIR/train_700K.csv" | grep -c "<|endofex|>" || true)
+if [ "$MARKER_COUNT" -eq 0 ]; then
+    echo "ERROR: No <|endofex|> markers found in training data!"
+    echo "Please run data preparation script first."
+    exit 1
+else
+    echo "✓ End markers detected in training data"
+fi
+# Verify config file exists
+if [ ! -f "$CONFIG_FILE" ]; then
+    echo "ERROR: Config file not found: $CONFIG_FILE"
+    exit 1
+fi
+echo ""
+echo "Configuration:"
+echo "  Config file: $CONFIG_FILE"
+echo "  Output directory: $OUTPUT_DIR"
+echo "  Training data: $DATA_DIR/train_700K.csv"
+echo "  Validation data: $DATA_DIR/validation_700K.csv"
+echo ""
+# Create output directory
+mkdir -p "$OUTPUT_DIR"
+# Set environment variables
+export WANDB_PROJECT="seriguela_v3"
+export WANDB_RUN_NAME="v3_proper_markers_$(date +%Y%m%d_%H%M%S)"
+# Check if wandb is configured
+if ! python -c "import wandb; wandb.api.api_key" 2>/dev/null; then
+    echo "WARNING: Weights & Biases not configured. Training will proceed without W&B logging."
+    echo "To enable W&B: wandb login"
+fi
+# Start training
+echo ""
+echo "=================================================="
+echo "Starting training..."
+echo "=================================================="
+echo ""
+# Run training with config file
+python scripts/train.py \
+    --config "$CONFIG_FILE" \
+    --output_dir "$OUTPUT_DIR" \
+    --use_local_csvs \
+    --train_file "$DATA_DIR/train_700K.csv" \
+    --validation_file "$DATA_DIR/validation_700K.csv" \
+    --wandb_project seriguela_v3 \
+    --wandb_run_name "$WANDB_RUN_NAME"
+TRAIN_EXIT_CODE=$?
+echo ""
+echo "=================================================="
+echo "Training completed"
+echo "=================================================="
+echo "End time: $(date)"
+echo "Exit code: $TRAIN_EXIT_CODE"
+echo ""
+if [ $TRAIN_EXIT_CODE -eq 0 ]; then
+    echo "✓ Training completed successfully!"
+    echo ""
+    echo "Model saved to: $OUTPUT_DIR"
+    echo ""
+    echo "Next steps:"
+    echo "1. Run evaluation: python scripts/evaluate.py --model_path $OUTPUT_DIR"
+    echo "2. Test generation: python scripts/generate.py --model_path $OUTPUT_DIR --num_generations 50 --validate"
+    echo "3. Push to Hub (if configured): huggingface-cli upload augustocsc/Se124M_700K_infix_v3 $OUTPUT_DIR"
+else
+    echo "✗ Training failed with exit code $TRAIN_EXIT_CODE"
+    echo "Check logs above for error details."
+    exit $TRAIN_EXIT_CODE
+fi

scripts/aws/validate_setup.sh ADDED Viewed

	@@ -0,0 +1,285 @@

+#!/bin/bash
+# Validate Seriguela Training Setup
+# This script validates that everything is configured correctly before training
+# Usage: ./validate_setup.sh
+set -e
+GREEN='\033[0;32m'
+RED='\033[0;31m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m'
+print_success() { echo -e "${GREEN}✅${NC} $1"; }
+print_error() { echo -e "${RED}❌${NC} $1"; }
+print_warning() { echo -e "${YELLOW}⚠️${NC}  $1"; }
+print_header() { echo -e "\n${BLUE}========== $1 ==========${NC}"; }
+ERRORS=0
+print_header "Seriguela Setup Validation"
+# Change to project directory
+if [ -d "/home/ubuntu/seriguela" ]; then
+    cd /home/ubuntu/seriguela
+elif [ -d "$(pwd)/seriguela" ]; then
+    cd seriguela
+else
+    cd .
+fi
+print_header "1. Python Environment"
+# Check Python version
+if python3 --version &> /dev/null; then
+    PYTHON_VERSION=$(python3 --version)
+    print_success "Python installed: $PYTHON_VERSION"
+else
+    print_error "Python not found"
+    ERRORS=$((ERRORS + 1))
+fi
+# Check venv
+if [ -d "venv" ]; then
+    print_success "Virtual environment exists"
+    source venv/bin/activate
+else
+    print_error "Virtual environment not found"
+    ERRORS=$((ERRORS + 1))
+fi
+# Check pip
+if pip --version &> /dev/null; then
+    PIP_VERSION=$(pip --version | cut -d' ' -f2)
+    print_success "pip version: $PIP_VERSION"
+else
+    print_error "pip not found"
+    ERRORS=$((ERRORS + 1))
+fi
+print_header "2. Python Packages"
+# Check critical packages
+PACKAGES=(
+    "transformers:Hugging Face Transformers"
+    "torch:PyTorch"
+    "wandb:Weights & Biases"
+    "peft:Parameter-Efficient Fine-Tuning"
+    "datasets:Hugging Face Datasets"
+)
+for pkg_info in "${PACKAGES[@]}"; do
+    IFS=':' read -r pkg_name pkg_desc <<< "$pkg_info"
+    if python3 -c "import $pkg_name" &> /dev/null; then
+        VERSION=$(python3 -c "import $pkg_name; print($pkg_name.__version__)" 2>/dev/null || echo "unknown")
+        print_success "$pkg_desc ($pkg_name) - version $VERSION"
+    else
+        print_error "$pkg_desc ($pkg_name) not installed"
+        ERRORS=$((ERRORS + 1))
+    fi
+done
+# Check Wandb version specifically
+WANDB_VERSION=$(python3 -c "import wandb; print(wandb.__version__)" 2>/dev/null || echo "0.0.0")
+REQUIRED_VERSION="0.24.0"
+if python3 << VERSIONCHECK
+import sys
+from packaging import version
+current = version.parse("$WANDB_VERSION")
+required = version.parse("$REQUIRED_VERSION")
+sys.exit(0 if current >= required else 1)
+VERSIONCHECK
+then
+    print_success "Wandb version $WANDB_VERSION (>= $REQUIRED_VERSION required)"
+else
+    print_warning "Wandb version $WANDB_VERSION is older than recommended $REQUIRED_VERSION"
+    print_warning "New API key format (wandb_v1_...) requires Wandb >= 0.24.0"
+fi
+print_header "3. Environment Variables"
+# Load .env if exists
+if [ -f ".env" ]; then
+    source <(grep -v '^#' .env | sed 's/^/export /')
+    print_success ".env file loaded"
+else
+    print_warning ".env file not found"
+fi
+# Check HF_TOKEN
+if [ -n "$HF_TOKEN" ]; then
+    TOKEN_LEN=${#HF_TOKEN}
+    print_success "HF_TOKEN set ($TOKEN_LEN characters)"
+else
+    print_warning "HF_TOKEN not set (model won't be pushed to Hub)"
+fi
+# Check WANDB_API_KEY
+if [ -n "$WANDB_API_KEY" ]; then
+    KEY_LEN=${#WANDB_API_KEY}
+    print_success "WANDB_API_KEY set ($KEY_LEN characters)"
+else
+    print_error "WANDB_API_KEY not set"
+    ERRORS=$((ERRORS + 1))
+fi
+print_header "4. GPU / CUDA"
+# Check nvidia-smi
+if nvidia-smi &> /dev/null; then
+    GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader | head -1)
+    GPU_MEMORY=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader | head -1)
+    print_success "GPU detected: $GPU_NAME ($GPU_MEMORY)"
+else
+    print_error "GPU not detected (nvidia-smi failed)"
+    ERRORS=$((ERRORS + 1))
+fi
+# Check CUDA
+if python3 -c "import torch; assert torch.cuda.is_available()" &> /dev/null; then
+    CUDA_VERSION=$(python3 -c "import torch; print(torch.version.cuda)")
+    GPU_COUNT=$(python3 -c "import torch; print(torch.cuda.device_count())")
+    print_success "CUDA available: version $CUDA_VERSION ($GPU_COUNT GPU(s))"
+else
+    print_error "CUDA not available in PyTorch"
+    ERRORS=$((ERRORS + 1))
+fi
+print_header "5. Wandb Authentication"
+if [ -n "$WANDB_API_KEY" ]; then
+    if python3 << WANDBCHECK
+import wandb
+import sys
+try:
+    result = wandb.login(key="$WANDB_API_KEY", relogin=True)
+    if result:
+        print("Login successful")
+        sys.exit(0)
+    else:
+        print("Login failed")
+        sys.exit(1)
+except Exception as e:
+    print(f"Error: {e}")
+    sys.exit(1)
+WANDBCHECK
+    then
+        print_success "Wandb authentication successful"
+        # Get user info
+        WANDB_USER=$(python3 << 'GETUSER'
+import wandb
+try:
+    api = wandb.Api()
+    print(api.viewer.get("username", "unknown"))
+except:
+    print("unknown")
+GETUSER
+)
+        print_success "Logged in as: $WANDB_USER"
+    else
+        print_error "Wandb authentication failed"
+        ERRORS=$((ERRORS + 1))
+    fi
+else
+    print_warning "Skipping Wandb auth (no API key)"
+fi
+print_header "6. HuggingFace Authentication"
+if [ -n "$HF_TOKEN" ]; then
+    if python3 << HFCHECK
+from huggingface_hub import HfApi
+import sys
+try:
+    api = HfApi(token="$HF_TOKEN")
+    user = api.whoami()
+    print(f"Login successful: {user.get('name', 'unknown')}")
+    sys.exit(0)
+except Exception as e:
+    print(f"Error: {e}")
+    sys.exit(1)
+HFCHECK
+    then
+        print_success "HuggingFace authentication successful"
+    else
+        print_error "HuggingFace authentication failed"
+        ERRORS=$((ERRORS + 1))
+    fi
+else
+    print_warning "Skipping HF auth (no token)"
+fi
+print_header "7. Dataset Access"
+# Test dataset loading
+if python3 << DATASETCHECK
+from datasets import load_dataset
+import sys
+try:
+    # Quick test load (just get info, don't download)
+    ds = load_dataset("augustocsc/sintetico_natural", split="train", streaming=True)
+    print("Dataset accessible")
+    sys.exit(0)
+except Exception as e:
+    print(f"Error: {e}")
+    sys.exit(1)
+DATASETCHECK
+then
+    print_success "Dataset accessible: augustocsc/sintetico_natural"
+else
+    print_warning "Could not verify dataset access (may require authentication)"
+fi
+print_header "8. Scripts"
+SCRIPTS=(
+    "scripts/train.py"
+    "scripts/evaluate.py"
+    "scripts/generate.py"
+    "scripts/aws/monitor_training_auto.sh"
+    "scripts/aws/analyze_model.sh"
+)
+for script in "${SCRIPTS[@]}"; do
+    if [ -f "$script" ]; then
+        print_success "$script exists"
+    else
+        print_warning "$script not found"
+    fi
+done
+# Final summary
+print_header "Validation Summary"
+echo ""
+if [ $ERRORS -eq 0 ]; then
+    echo -e "${GREEN}╔══════════════════════════════════════╗${NC}"
+    echo -e "${GREEN}║                                      ║${NC}"
+    echo -e "${GREEN}║    ✅ ALL VALIDATIONS PASSED ✅     ║${NC}"
+    echo -e "${GREEN}║                                      ║${NC}"
+    echo -e "${GREEN}║     Ready for training! 🚀           ║${NC}"
+    echo -e "${GREEN}║                                      ║${NC}"
+    echo -e "${GREEN}╚══════════════════════════════════════╝${NC}"
+    echo ""
+    echo "You can now run:"
+    echo "  python scripts/train.py --help"
+    echo "  bash scripts/aws/run_all_training.sh"
+    echo ""
+    exit 0
+else
+    echo -e "${RED}╔══════════════════════════════════════╗${NC}"
+    echo -e "${RED}║                                      ║${NC}"
+    echo -e "${RED}║    ❌ VALIDATION FAILED ❌           ║${NC}"
+    echo -e "${RED}║                                      ║${NC}"
+    echo -e "${RED}║   $ERRORS error(s) found              ║${NC}"
+    echo -e "${RED}║                                      ║${NC}"
+    echo -e "${RED}╚══════════════════════════════════════╝${NC}"
+    echo ""
+    echo "Please fix the errors above before training."
+    echo ""
+    exit 1
+fi

scripts/compare_models.py ADDED Viewed

	@@ -0,0 +1,271 @@

+"""
+Compare two models: band-aided vs properly trained.
+Evaluates both on same test set and reports metrics.
+Usage:
+    python scripts/compare_models.py \
+        --model1 ./output/Se124M_700K_infix \
+        --model2 ./output/Se124M_700K_infix_v2 \
+        --num_samples 500
+"""
+import argparse
+import json
+import os
+import sys
+from datetime import datetime
+# Import evaluate_model from evaluate.py
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from evaluate import evaluate_model
+def format_metric(value, metric_type):
+    """Format metric value for display."""
+    if metric_type == "rate":
+        return f"{value * 100:5.1f}%"
+    elif metric_type == "float":
+        return f"{value:7.2f}"
+    elif metric_type == "int":
+        return f"{int(value):7d}"
+    else:
+        return f"{value:7}"
+def print_comparison_table(metrics1, metrics2, model1_name, model2_name):
+    """Print formatted comparison table."""
+    print("\n" + "=" * 80)
+    print("COMPARISON RESULTS")
+    print("=" * 80)
+    # Header
+    print(f"{'Metric':<35} {model1_name:>20} {model2_name:>20}")
+    print("-" * 80)
+    # Define metrics to compare
+    comparison_metrics = [
+        ("valid_rate", "Valid Rate", "rate"),
+        ("parseable_rate", "Parseable Rate", "rate"),
+        ("constraints_met_rate", "Constraints Met", "rate"),
+        ("diversity_rate", "Diversity", "rate"),
+        ("avg_expression_length", "Avg Expression Length", "float"),
+        ("total_samples", "Total Samples", "int"),
+        ("total_valid", "Total Valid", "int"),
+    ]
+    improvements = []
+    for key, label, metric_type in comparison_metrics:
+        val1 = metrics1.get(key, 0)
+        val2 = metrics2.get(key, 0)
+        formatted_val1 = format_metric(val1, metric_type)
+        formatted_val2 = format_metric(val2, metric_type)
+        print(f"{label:<35} {formatted_val1:>20} {formatted_val2:>20}")
+        # Calculate improvement for rate metrics
+        if metric_type == "rate" and val1 > 0:
+            improvement = ((val2 - val1) / val1) * 100
+            improvements.append((label, improvement, val2 - val1))
+    print("=" * 80)
+    # Show improvements
+    print("\nIMPROVEMENTS (Model 2 vs Model 1):")
+    print("-" * 80)
+    for label, improvement, absolute_diff in improvements:
+        sign = "+" if improvement > 0 else ""
+        abs_sign = "+" if absolute_diff > 0 else ""
+        print(f"{label:<35} {sign}{improvement:>6.1f}%  ({abs_sign}{absolute_diff * 100:>5.1f} pp)")
+    print("-" * 80)
+    # Determine winner
+    valid_rate_improvement = metrics2.get("valid_rate", 0) - metrics1.get("valid_rate", 0)
+    print("\n" + "=" * 80)
+    if valid_rate_improvement > 0.20:  # >20% improvement
+        print(f"🎯 SIGNIFICANT IMPROVEMENT: Model 2 wins by {valid_rate_improvement * 100:.1f} percentage points")
+        print("   The properly trained model significantly outperforms the band-aided version!")
+    elif valid_rate_improvement > 0.05:  # >5% improvement
+        print(f"✅ IMPROVEMENT: Model 2 wins by {valid_rate_improvement * 100:.1f} percentage points")
+        print("   The properly trained model shows clear improvement.")
+    elif valid_rate_improvement > 0:  # Any improvement
+        print(f"📈 SLIGHT IMPROVEMENT: Model 2 wins by {valid_rate_improvement * 100:.1f} percentage points")
+        print("   The properly trained model shows modest improvement.")
+    elif valid_rate_improvement == 0:
+        print("⚖️  TIE: Both models perform equally")
+        print("   No significant difference between models.")
+    else:
+        print(f"⚠️  REGRESSION: Model 1 wins by {-valid_rate_improvement * 100:.1f} percentage points")
+        print("   The band-aided model performs better - retraining may need adjustment.")
+    print("=" * 80)
+def save_comparison_report(metrics1, metrics2, model1_name, model2_name, output_dir):
+    """Save detailed comparison report to JSON."""
+    os.makedirs(output_dir, exist_ok=True)
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    report_file = os.path.join(output_dir, f"comparison_{timestamp}.json")
+    report = {
+        "timestamp": timestamp,
+        "model1": {
+            "name": model1_name,
+            "metrics": metrics1
+        },
+        "model2": {
+            "name": model2_name,
+            "metrics": metrics2
+        },
+        "comparison": {
+            "valid_rate_diff": metrics2.get("valid_rate", 0) - metrics1.get("valid_rate", 0),
+            "parseable_rate_diff": metrics2.get("parseable_rate", 0) - metrics1.get("parseable_rate", 0),
+            "constraints_met_diff": metrics2.get("constraints_met_rate", 0) - metrics1.get("constraints_met_rate", 0),
+            "diversity_diff": metrics2.get("diversity_rate", 0) - metrics1.get("diversity_rate", 0),
+        }
+    }
+    with open(report_file, "w") as f:
+        json.dump(report, f, indent=2)
+    print(f"\n📄 Detailed comparison report saved to: {report_file}")
+    return report_file
+def compare_models(model1_path, model2_path, model1_name, model2_name,
+                   num_samples=500, dataset_repo_id="augustocsc/sintetico_natural",
+                   data_dir="700K", data_column="i_prompt_n", output_dir="./evaluation_results/comparison"):
+    """Compare two models on same test set."""
+    print("=" * 80)
+    print("MODEL COMPARISON")
+    print("=" * 80)
+    print(f"Model 1 ({model1_name}): {model1_path}")
+    print(f"Model 2 ({model2_name}): {model2_path}")
+    print(f"Samples: {num_samples}")
+    print(f"Dataset: {dataset_repo_id}/{data_dir}")
+    print("=" * 80)
+    # Create output directory
+    os.makedirs(output_dir, exist_ok=True)
+    # Evaluate Model 1 (band-aided)
+    print(f"\n[1/2] Evaluating Model 1: {model1_name}")
+    print("-" * 80)
+    args1 = argparse.Namespace(
+        model_path=model1_path,
+        base_model=None,
+        dataset_repo_id=dataset_repo_id,
+        data_dir=data_dir,
+        data_column=data_column,
+        num_samples=num_samples,
+        num_generations=1,
+        max_new_tokens=128,
+        temperature=0.7,
+        top_p=0.9,
+        output_dir=os.path.join(output_dir, "model1"),
+        seed=42,
+        device="auto"
+    )
+    try:
+        metrics1 = evaluate_model(args1)
+    except Exception as e:
+        print(f"\n❌ Error evaluating Model 1: {e}")
+        import traceback
+        traceback.print_exc()
+        sys.exit(1)
+    # Evaluate Model 2 (properly trained)
+    print(f"\n[2/2] Evaluating Model 2: {model2_name}")
+    print("-" * 80)
+    args2 = argparse.Namespace(
+        model_path=model2_path,
+        base_model=None,
+        dataset_repo_id=dataset_repo_id,
+        data_dir=data_dir,
+        data_column=data_column,
+        num_samples=num_samples,
+        num_generations=1,
+        max_new_tokens=128,
+        temperature=0.7,
+        top_p=0.9,
+        output_dir=os.path.join(output_dir, "model2"),
+        seed=42,
+        device="auto"
+    )
+    try:
+        metrics2 = evaluate_model(args2)
+    except Exception as e:
+        print(f"\n❌ Error evaluating Model 2: {e}")
+        import traceback
+        traceback.print_exc()
+        sys.exit(1)
+    # Print comparison
+    print_comparison_table(metrics1, metrics2, model1_name, model2_name)
+    # Save report
+    save_comparison_report(metrics1, metrics2, model1_name, model2_name, output_dir)
+    return metrics1, metrics2
+def main():
+    parser = argparse.ArgumentParser(
+        description="Compare two models on the same test set"
+    )
+    parser.add_argument("--model1", type=str, required=True,
+                        help="Path to first model (band-aided)")
+    parser.add_argument("--model2", type=str, required=True,
+                        help="Path to second model (properly trained)")
+    parser.add_argument("--model1_name", type=str, default="Band-Aided",
+                        help="Display name for model 1")
+    parser.add_argument("--model2_name", type=str, default="Proper",
+                        help="Display name for model 2")
+    parser.add_argument("--num_samples", type=int, default=500,
+                        help="Number of samples to evaluate")
+    parser.add_argument("--dataset_repo_id", type=str, default="augustocsc/sintetico_natural",
+                        help="HuggingFace dataset repository")
+    parser.add_argument("--data_dir", type=str, default="700K",
+                        help="Data directory within dataset")
+    parser.add_argument("--data_column", type=str, default="i_prompt_n",
+                        help="Column name for prompts")
+    parser.add_argument("--output_dir", type=str, default="./evaluation_results/comparison",
+                        help="Directory to save comparison results")
+    args = parser.parse_args()
+    # Run comparison
+    try:
+        compare_models(
+            model1_path=args.model1,
+            model2_path=args.model2,
+            model1_name=args.model1_name,
+            model2_name=args.model2_name,
+            num_samples=args.num_samples,
+            dataset_repo_id=args.dataset_repo_id,
+            data_dir=args.data_dir,
+            data_column=args.data_column,
+            output_dir=args.output_dir
+        )
+        print("\n✅ Comparison complete!")
+    except Exception as e:
+        print(f"\n❌ Error during comparison: {e}")
+        import traceback
+        traceback.print_exc()
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

scripts/compare_v1_v2_simple.py ADDED Viewed

	@@ -0,0 +1,240 @@

+#!/usr/bin/env python3
+"""
+Simple comparison of V1 vs V2 model generation quality
+"""
+import sys
+import torch
+from pathlib import Path
+from transformers import AutoTokenizer, AutoModelForCausalLM, StoppingCriteria, StoppingCriteriaList
+from peft import PeftModel
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from classes.expression import Expression
+class ExpressionStoppingCriteria(StoppingCriteria):
+    def __init__(self, tokenizer, stop_sequences):
+        self.tokenizer = tokenizer
+        self.stop_ids = [tokenizer.encode(seq, add_special_tokens=False)
+                        for seq in stop_sequences]
+    def __call__(self, input_ids, scores, **kwargs):
+        for stop_ids in self.stop_ids:
+            if len(stop_ids) > 0 and len(input_ids[0]) >= len(stop_ids):
+                if input_ids[0][-len(stop_ids):].tolist() == stop_ids:
+                    return True
+        return False
+def load_model(model_name, model_label):
+    print(f"\n{'='*60}")
+    print(f"Loading {model_label}: {model_name}")
+    print('='*60)
+    # Load base GPT-2
+    print("Loading base GPT-2...")
+    model = AutoModelForCausalLM.from_pretrained(
+        "gpt2",
+        torch_dtype=torch.float16,
+        device_map="auto"
+    )
+    # Setup tokenizer
+    tokenizer = AutoTokenizer.from_pretrained("gpt2")
+    tokenizer.add_special_tokens({
+        "additional_special_tokens": ["<|startofex|>", "<|endofex|>"]
+    })
+    # Resize embeddings
+    model.resize_token_embeddings(len(tokenizer))
+    # Load adapter and merge
+    print(f"Loading adapter from {model_name}...")
+    model = PeftModel.from_pretrained(model, model_name)
+    print("Merging adapter...")
+    model = model.merge_and_unload()
+    model.eval()
+    print(f"✓ {model_label} loaded successfully")
+    return model, tokenizer
+def test_model(model, tokenizer, model_label, n_samples=20):
+    print(f"\n{'='*60}")
+    print(f"Testing {model_label} - {n_samples} generations")
+    print('='*60)
+    # Same prompt for both models
+    prompt = """vars: x_1, x_2
+oper: *, +, -, sin, cos
+cons: C
+expr:"""
+    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+    # Stopping criteria
+    stopping_criteria = StoppingCriteriaList([
+        ExpressionStoppingCriteria(tokenizer, ["<|endofex|>", "\n\nvars:"])
+    ])
+    # Use OPTIMAL config for each model (from FINAL_RESULTS_V1_VS_V2.md)
+    if model_label == "V1":
+        # V1 optimal: 83.3% valid rate
+        gen_config = {
+            "temperature": 0.5,
+            "top_k": 40,
+            "top_p": 0.9,
+            "repetition_penalty": 1.15,
+            "max_new_tokens": 100,
+            "do_sample": True,
+            "pad_token_id": tokenizer.eos_token_id,
+        }
+        print("Using V1 optimal config: temp=0.5, top_k=40, rep_penalty=1.15")
+    else:  # V2
+        # V2 optimal: 90% valid rate
+        gen_config = {
+            "temperature": 0.7,
+            "top_k": 0,
+            "top_p": 0.8,
+            "repetition_penalty": 1.0,
+            "max_new_tokens": 128,
+            "do_sample": True,
+            "pad_token_id": tokenizer.eos_token_id,
+        }
+        print("Using V2 optimal config: temp=0.7, top_p=0.8 (nucleus sampling)")
+    results = {
+        "valid_count": 0,
+        "correct_symbols_count": 0,
+        "expressions": []
+    }
+    allowed_vars = {"x_1", "x_2", "C"}
+    allowed_ops = {"*", "+", "-", "sin", "cos", "(", ")"}
+    print(f"\nGenerating {n_samples} expressions...\n")
+    for i in range(n_samples):
+        output = model.generate(
+            **inputs,
+            **gen_config,
+            stopping_criteria=stopping_criteria
+        )
+        text = tokenizer.decode(output[0], skip_special_tokens=False)
+        # Extract expression
+        if "expr:" in text:
+            expr_str = text.split("expr:")[-1].strip()
+            expr_str = expr_str.split("<|endofex|>")[0].strip()
+        else:
+            expr_str = text
+        # Check if valid (can be parsed and evaluated)
+        is_valid = False
+        try:
+            expr = Expression(expr_str, is_prefix=False)
+            X_test = [[1.0, 2.0]]  # Simple test
+            result = expr.evaluate(X_test)
+            if len(result) > 0 and all(x != float('inf') and x != float('-inf') and x == x for x in result):
+                is_valid = True
+                results["valid_count"] += 1
+        except:
+            pass
+        # Check if uses only correct symbols
+        has_correct_symbols = True
+        # Remove spaces and check tokens
+        expr_clean = expr_str.replace(" ", "")
+        # Check for allowed patterns
+        for char in expr_clean:
+            if char.isalpha() and char not in "xCsinco_":
+                has_correct_symbols = False
+                break
+        # Check for garbage words
+        garbage_words = ["Buyable", "Instore", "Online", "Muslims", "crash", "Berman",
+                        "vars:", "oper:", "expressed", "fluent", "Avenger", "repositories"]
+        for word in garbage_words:
+            if word in expr_str:
+                has_correct_symbols = False
+                break
+        if has_correct_symbols:
+            results["correct_symbols_count"] += 1
+        results["expressions"].append({
+            "index": i + 1,
+            "expression": expr_str[:80],  # Limit display length
+            "valid": is_valid,
+            "correct_symbols": has_correct_symbols
+        })
+        # Show first 5 samples
+        if i < 5:
+            status = "✓ Valid" if is_valid else "✗ Invalid"
+            symbols = "✓ Clean" if has_correct_symbols else "✗ Garbage"
+            print(f"  [{i+1:2d}] {status:10s} {symbols:10s} | {expr_str[:60]}")
+    print(f"\n{'-'*60}")
+    print(f"RESULTS FOR {model_label}:")
+    print(f"  Valid expressions:    {results['valid_count']:2d}/{n_samples} ({results['valid_count']/n_samples*100:.1f}%)")
+    print(f"  Correct symbols only: {results['correct_symbols_count']:2d}/{n_samples} ({results['correct_symbols_count']/n_samples*100:.1f}%)")
+    print(f"{'-'*60}")
+    return results
+def main():
+    print("\n" + "="*60)
+    print("V1 vs V2 MODEL COMPARISON")
+    print("="*60)
+    print("Testing same prompt on both models")
+    print("Measuring: valid expressions + symbol correctness\n")
+    # Test V1
+    v1_model, v1_tokenizer = load_model("augustocsc/Se124M_700K_infix", "V1")
+    v1_results = test_model(v1_model, v1_tokenizer, "V1", n_samples=20)
+    # Clean up V1 from memory
+    del v1_model
+    torch.cuda.empty_cache()
+    # Test V2
+    v2_model, v2_tokenizer = load_model("augustocsc/Se124M_700K_infix_v2", "V2")
+    v2_results = test_model(v2_model, v2_tokenizer, "V2", n_samples=20)
+    # Final comparison
+    print("\n" + "="*60)
+    print("FINAL COMPARISON")
+    print("="*60)
+    print(f"\n{'Metric':<30s} {'V1':>10s} {'V2':>10s} {'Winner':>10s}")
+    print("-"*60)
+    v1_valid = v1_results["valid_count"]
+    v2_valid = v2_results["valid_count"]
+    valid_winner = "V1" if v1_valid > v2_valid else ("V2" if v2_valid > v1_valid else "TIE")
+    print(f"{'Valid Expressions':<30s} {v1_valid:>10d} {v2_valid:>10d} {valid_winner:>10s}")
+    v1_clean = v1_results["correct_symbols_count"]
+    v2_clean = v2_results["correct_symbols_count"]
+    clean_winner = "V1" if v1_clean > v2_clean else ("V2" if v2_clean > v1_clean else "TIE")
+    print(f"{'Correct Symbols Only':<30s} {v1_clean:>10d} {v2_clean:>10d} {clean_winner:>10s}")
+    print("-"*60)
+    print(f"{'Valid Rate':<30s} {v1_valid/20*100:>9.1f}% {v2_valid/20*100:>9.1f}%")
+    print(f"{'Clean Symbol Rate':<30s} {v1_clean/20*100:>9.1f}% {v2_clean/20*100:>9.1f}%")
+    print("="*60)
+    # Conclusion
+    print("\nConclusion:")
+    if v1_valid > v2_valid and v1_clean > v2_clean:
+        print("  → V1 is better on both metrics")
+    elif v2_valid > v1_valid and v2_clean > v1_clean:
+        print("  → V2 is better on both metrics")
+    else:
+        print("  → Mixed results - models have different strengths")
+if __name__ == "__main__":
+    main()

scripts/data/data_augmentation.py ADDED Viewed

	@@ -0,0 +1,63 @@

+# augmentor.py
+import random
+import re
+ALL_OPERANDS = ['+', '-', '*', '/', 'log', 'exp', 'cos', 'sqrt', 'asin', 'sin', '**', 'tan', 'abs']
+def extract_operators(expr_str):
+    ops = set()
+    if 'exp' in expr_str: ops.add('exp')
+    if 'log' in expr_str: ops.add('log')
+    if 'cos' in expr_str: ops.add('cos')
+    if 'sin' in expr_str: ops.add('sin')
+    if '**' in expr_str: ops.add('**')
+    if 'sqrt' in expr_str: ops.add('sqrt')
+    if 'asin' in expr_str: ops.add('asin')
+    if 'tan' in expr_str: ops.add('tan')
+    if 'abs' in expr_str: ops.add('abs')
+    if '/' in expr_str: ops.add('/')
+    for op in ['+', '-', '*']:
+        if op in expr_str: ops.add(op)
+    return list(ops)
+def infer_max_var(expr_str):
+    matches = re.findall(r'x_(\d+)', expr_str)
+    return max([int(m) for m in matches]) if matches else 1
+def generate_expression_instructions(expr_str):
+    max_var = infer_max_var(expr_str)
+    variables = [f"x_{i}" for i in range(1, max_var + random.randint(1, (max_var) + 1))]
+    used_ops = extract_operators(expr_str)
+    extra_ops = list(set(ALL_OPERANDS) - set(used_ops))
+    added_ops = random.sample(extra_ops, random.randint(1, len(extra_ops))) if extra_ops else []
+    all_ops = sorted(set(used_ops + added_ops))
+    constants = ['C']
+    wrapped_expr = f"{expr_str}"
+    return {
+        "Simple_Instruct": f"Instruction: Generate a mathematical expression using variables {variables} and operands {all_ops} and {constants} as constant.\nExpression: {wrapped_expr}",
+        "Key_Value": f"Variables: {variables}\nOperands: {all_ops}\nConstant: {constants}\nExpression: {wrapped_expr}",
+        "Delimiter_Based": f"Input: Variables={variables}, Operands={all_ops}, Constant={constants}\nOutput: {wrapped_expr}",
+        "Minimalist": f"{variables} | {all_ops} | {constants} => {wrapped_expr}"
+    }
+def generate_expression_instruction(expr_str):
+    max_var = infer_max_var(expr_str)
+    variables = [f"x_{i}" for i in range(1, max_var + random.randint(1, (max_var) + 1))]
+    used_ops = extract_operators(expr_str)
+    extra_ops = list(set(ALL_OPERANDS) - set(used_ops))
+    added_ops = random.sample(extra_ops, random.randint(1, len(extra_ops))) if extra_ops else []
+    all_ops = sorted(set(used_ops + added_ops))
+    constants = ['C']
+    wrapped_expr = f"{expr_str}"
+    return {
+        #"instriction": f"{','.join(variables)}\n{', '.join(all_ops)}\n{', '.join(constants)}\n{wrapped_expr}"
+        "instriction": f"vars: {', '.join(variables)}\noper: {', '.join(all_ops)}\ncons: {', '.join(constants)}\nexpr: {wrapped_expr}"
+    }
+#print(generate_expression_instruction("x_1 - (x_4 - C)*(x_3 + exp(C*x_2) + C)"))

scripts/data/data_cleaning.py ADDED Viewed

	@@ -0,0 +1,90 @@

+import re
+import pandas as pd
+import numpy as np
+from sympy import sympify, Eq
+from sympy.parsing.sympy_parser import parse_expr
+from sympy.core.sympify import SympifyError
+from concurrent.futures import ProcessPoolExecutor
+import multiprocessing as mp
+from sympy import simplify, sympify
+from sympy.core.sympify import SympifyError
+import swifter
+import random
+from joblib import Parallel, delayed
+from tqdm.auto import tqdm
+def apply_chunk(chunk, func):
+    """Helper function to apply a function to a chunk of data."""
+    return chunk.apply(func)
+def parallel_apply(series, func, n_jobs=None):
+    n_jobs = mp.cpu_count() if n_jobs is None else n_jobs
+    # Split into roughly equal chunks
+    chunks = np.array_split(series, n_jobs)
+    with mp.Pool(n_jobs) as pool:
+        # Use the helper function instead of a lambda
+        results = pool.starmap(apply_chunk, [(chunk, func) for chunk in chunks])
+    # Concatenate the resulting Series
+    return pd.concat(results)
+def canonicalize_expr(expr, canonicalizer=simplify):
+    canon = canonicalizer(expr)
+    return (hash(canon), canon, expr)
+def replace_constants(equation):
+    # Match positive/negative floats and integers not part of variable names
+    pattern = r'(?<![\w.])(?:[-+]?\d*\.\d+|\d+)(?![\w.])'
+    return re.sub(pattern, 'C', equation)
+def augment_expression(equation, var_prefix='x', max_index=10, p=0.5):
+    """
+    1. Replace all standalone numeric constants (including scientific notation) with 'C'.
+    2. For each occurrence of a variable (e.g., x_1), with probability p replace it
+       by a randomly chosen new variable x_1…x_max_index; otherwise leave as is.
+    """
+    # Step 1: Replace constants (including scientific notation)
+    const_pattern = r'(?<![\w.])(?:[-+]?\d*\.\d+(?:[eE][-+]?\d+)?|\d+(?:[eE][-+]?\d+)?)(?![\w.])'
+    equation = re.sub(const_pattern, 'C', equation)
+    # Step 2: Replace variables with probability p
+    var_pattern = rf'\b{var_prefix}_\d+\b'
+    def repl(match):
+        if random.random() < p:
+            new_idx = random.randint(1, max_index)
+            return f"{var_prefix}_{new_idx}"
+        return match.group(0)
+    return re.sub(var_pattern, repl, equation)
+def is_valid_equation(equation_str):
+    """Verifica se uma string representa uma expressão matemática válida para o SymPy."""
+    if not isinstance(equation_str, str):
+        return False
+    if pd.isna(equation_str) or equation_str.strip() == '':
+        return False
+    try:
+        # Tenta analisar a expressão
+        expr = parse_expr(equation_str.strip())
+        return True
+    except (SympifyError, SyntaxError, ValueError, TypeError, AttributeError):
+        print(f"Erro ao analisar a equação: {equation_str}")
+        return False
+def canonical_form(expr_str):
+    """
+    Recebe uma expressão como string e retorna sua forma canônica (simplificada).
+    """
+    try:
+        #expr_str = sympify(expr_str)
+        canonica = simplify(expr_str).expand()
+        return str(canonica)
+    except SympifyError as e:
+        return f"Erro ao interpretar a expressão: {expr_str}"

scripts/data/data_processing.py ADDED Viewed

	@@ -0,0 +1,108 @@

+import os
+import sys
+import argparse
+import pandas as pd
+import numpy as np
+import multiprocessing as mp
+from tqdm.auto import tqdm
+from sklearn.model_selection import train_test_split
+# Adjust import paths for custom modules
+def setup_paths():
+    for folder in ["../scripts", "../classes"]:
+        path = os.path.abspath(os.path.join(folder))
+        if path not in sys.path:
+            sys.path.append(path)
+setup_paths()
+# Local imports after path setup
+import scripts.data.data_cleaning as dc
+from expression import Expression
+from data.parallel_utils import augment_dataframe_parallel
+def parallel_apply(series, func, n_jobs=None):
+    """Apply a function to a pandas Series in parallel."""
+    def apply_chunk(chunk, func):
+        return chunk.apply(func)
+    n_jobs = mp.cpu_count() if n_jobs is None else n_jobs
+    chunks = np.array_split(series, n_jobs)
+    with mp.Pool(n_jobs) as pool:
+        results = pool.starmap(apply_chunk, [(chunk, func) for chunk in chunks])
+    return pd.concat(results)
+def process_chunk(chunk):
+    """Clean and transform a single data chunk."""
+    chunk = chunk[['eq']]
+    chunk = chunk[~chunk['eq'].str.contains('ERROR_simplify')]
+    chunk['eq'] = parallel_apply(chunk['eq'], dc.augment_expression)
+    chunk.rename(columns={'eq': 'infix_expr'}, inplace=True)
+    chunk['prefix_expr'] = parallel_apply(chunk['infix_expr'], Expression.infix_to_prefix)
+    return chunk
+def process_file(file_path, chunk_size=100000):
+    """Process the CSV file in chunks."""
+    processed_chunks = []
+    total_rows = sum(1 for _ in open(file_path)) - 1
+    total_chunks = (total_rows // chunk_size) + 1
+    with tqdm(total=total_chunks, desc="Processing chunks") as pbar:
+        for chunk in pd.read_csv(file_path, chunksize=chunk_size):
+            processed_chunk = process_chunk(chunk)
+            processed_chunks.append(processed_chunk)
+            pbar.update(1)
+    return pd.concat(processed_chunks, ignore_index=True)
+def augment_df(df):
+    """Apply augmentation to both infix and prefix expressions."""
+    df = augment_dataframe_parallel(df, expression_col="infix_expr", n_jobs=4)
+    df.rename(columns={
+        'simple': 'i_simple',
+        'key_value': 'i_key_value',
+        'delimiter': 'i_delimiter',
+        'minimalist': 'i_minimalist'
+    }, inplace=True)
+    df = augment_dataframe_parallel(df, expression_col="prefix_expr", n_jobs=4)
+    df.rename(columns={
+        'simple': 'p_simple',
+        'key_value': 'p_key_value',
+        'delimiter': 'p_delimiter',
+        'minimalist': 'p_minimalist'
+    }, inplace=True)
+    return df
+def split_and_save(df, base_file_path):
+    """Split into train/val/test and save them."""
+    train_df, temp_df = train_test_split(df, test_size=0.3, random_state=42)
+    val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)
+    file = os.path.basename(base_file_path)
+    base_dir = f'../data/processed/{file.replace(".csv", "")}'
+    os.makedirs(base_dir, exist_ok=True)
+    train_df.to_csv(os.path.join(base_dir, f"train_{file}"), index=False)
+    val_df.to_csv(os.path.join(base_dir, f"val_{file}"), index=False)
+    test_df.to_csv(os.path.join(base_dir, f"test_{file}"), index=False)
+    df.to_csv(os.path.join(base_dir, file), index=False)
+def main():
+    parser = argparse.ArgumentParser(description="Process a raw equation CSV file.")
+    parser.add_argument("file_path", type=str, help="Path to the raw CSV file to process.", default="../data/raw/13k.csv")
+    args = parser.parse_args()
+    file_path = args.file_path
+    if not os.path.exists(file_path):
+        print(f"Error: File not found at {file_path}")
+        sys.exit(1)
+    df_processed = process_file(file_path)
+    df_processed.drop_duplicates(subset=['infix_expr'], inplace=True)
+    df_augmented = augment_df(df_processed)
+    split_and_save(df_augmented, file_path)
+if __name__ == '__main__':
+    main()

scripts/data/parallel_utils.py ADDED Viewed

	@@ -0,0 +1,31 @@

+# parallel_utils.py
+from joblib import Parallel, delayed
+import pandas as pd
+from .data_augmentation import generate_expression_instructions, generate_expression_instruction
+def augment_dataframe_parallel(df, expression_col="expression", n_jobs=-1):
+    """
+    Parallelized augmentation of a DataFrame with math expressions.
+    Args:
+        df (pd.DataFrame): DataFrame with a column of expressions.
+        expression_col (str): Name of the column with expressions.
+        n_jobs (int): Number of parallel workers (-1 = all cores).
+    Returns:
+        pd.DataFrame: Original DataFrame with new instruction columns.
+    """
+    expressions = df[expression_col].tolist()
+    augmented_data = Parallel(n_jobs=n_jobs)(
+        delayed(generate_expression_instruction)(expr) for expr in expressions
+    )
+    df_aug = df.copy()
+    df_aug["instruction"] = [item["instriction"] for item in augmented_data]
+    #df_aug["simple"] = [item["Simple_Instruct"] for item in augmented_data]
+    #df_aug["key_value"] = [item["Key_Value"] for item in augmented_data]
+    #df_aug["delimiter"] = [item["Delimiter_Based"] for item in augmented_data]
+    #df_aug["minimalist"] = [item["Minimalist"] for item in augmented_data]
+    return df_aug

scripts/data/prepare_experiment_data.py ADDED Viewed

	@@ -0,0 +1,513 @@

+#!/usr/bin/env python3
+"""
+Data preparation script for training experiments.
+Prepares data in two formats:
+- EXP-A: JSON structured format
+- EXP-B: EOS token format (GPT-2's <|endoftext|>)
+Usage:
+    python scripts/data/prepare_experiment_data.py \
+        --dataset_repo_id augustocsc/sintetico_natural \
+        --data_dir 700K \
+        --data_column i_prompt_n \
+        --output_base_dir ./data/experiments
+"""
+import argparse
+import json
+import logging
+import re
+import sys
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple
+from datasets import load_dataset, Dataset, DatasetDict
+import pandas as pd
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+def parse_original_format(text: str) -> Optional[Dict]:
+    """
+    Parse the original format into components.
+    Original format:
+        vars: x_1, x_2
+        oper: *, +, sin
+        cons: C
+        expr: C*sin(x_1) + x_2
+    Returns:
+        Dictionary with vars, ops, cons, expr or None if parsing fails
+    """
+    result = {
+        'vars': [],
+        'ops': [],
+        'cons': None,
+        'expr': None,
+        'raw_text': text
+    }
+    lines = text.strip().split('\n')
+    for line in lines:
+        line = line.strip()
+        if not line:
+            continue
+        if line.startswith('vars:') or line.startswith('Variables:'):
+            # Extract variables
+            var_part = line.split(':', 1)[1].strip()
+            vars_list = [v.strip() for v in var_part.split(',') if v.strip()]
+            result['vars'] = vars_list
+        elif line.startswith('oper:') or line.startswith('Operators:'):
+            # Extract operators
+            op_part = line.split(':', 1)[1].strip()
+            ops_list = [o.strip() for o in op_part.split(',') if o.strip()]
+            result['ops'] = ops_list
+        elif line.startswith('cons:') or line.startswith('Constants:'):
+            # Extract constants
+            cons_part = line.split(':', 1)[1].strip()
+            result['cons'] = cons_part if cons_part else None
+        elif line.startswith('expr:'):
+            # Extract expression - everything after 'expr:'
+            expr_part = line.split(':', 1)[1].strip()
+            # Clean expression: remove any markers or trailing content
+            expr_part = expr_part.split('<|')[0].strip()  # Remove any existing markers
+            expr_part = expr_part.split('\n')[0].strip()  # Remove newlines
+            result['expr'] = expr_part
+    # Validate we got the essential parts
+    if not result['expr']:
+        return None
+    return result
+def convert_to_json_format(parsed: Dict) -> str:
+    """
+    Convert parsed data to JSON format (EXP-A).
+    Output format:
+        {"vars": ["x_1", "x_2"], "ops": ["*", "+", "sin"], "cons": "C", "expr": "C*sin(x_1) + x_2"}
+    """
+    json_obj = {
+        'vars': parsed['vars'],
+        'ops': parsed['ops'],
+    }
+    if parsed['cons']:
+        json_obj['cons'] = parsed['cons']
+    json_obj['expr'] = parsed['expr']
+    return json.dumps(json_obj, ensure_ascii=False)
+def convert_to_eos_format(parsed: Dict) -> str:
+    """
+    Convert parsed data to EOS token format (EXP-B).
+    Output format:
+        vars: x_1, x_2
+        oper: *, +, sin
+        cons: C
+        expr: C*sin(x_1) + x_2<|endoftext|>
+    """
+    lines = []
+    if parsed['vars']:
+        lines.append(f"vars: {', '.join(parsed['vars'])}")
+    if parsed['ops']:
+        lines.append(f"oper: {', '.join(parsed['ops'])}")
+    if parsed['cons']:
+        lines.append(f"cons: {parsed['cons']}")
+    # Add expression with EOS token
+    lines.append(f"expr: {parsed['expr']}<|endoftext|>")
+    return '\n'.join(lines)
+def process_example_json(example: Dict) -> Dict:
+    """Process a single example into JSON format."""
+    text = example['text']
+    parsed = parse_original_format(text)
+    if parsed is None:
+        logger.warning(f"Failed to parse: {text[:100]}...")
+        return {'text': '', 'valid': False}
+    json_text = convert_to_json_format(parsed)
+    return {'text': json_text, 'valid': True}
+def process_example_eos(example: Dict) -> Dict:
+    """Process a single example into EOS format."""
+    text = example['text']
+    parsed = parse_original_format(text)
+    if parsed is None:
+        logger.warning(f"Failed to parse: {text[:100]}...")
+        return {'text': '', 'valid': False}
+    eos_text = convert_to_eos_format(parsed)
+    return {'text': eos_text, 'valid': True}
+def validate_json_format(text: str) -> bool:
+    """Validate JSON format is correct."""
+    try:
+        obj = json.loads(text)
+        return 'expr' in obj and 'vars' in obj and 'ops' in obj
+    except:
+        return False
+def validate_eos_format(text: str) -> bool:
+    """Validate EOS format is correct."""
+    return '<|endoftext|>' in text and 'expr:' in text
+def process_dataset(
+    dataset_repo_id: str,
+    data_dir: str,
+    data_column: str,
+    output_base_dir: Path,
+    max_samples: Optional[int] = None
+) -> Dict:
+    """
+    Process the dataset into both formats.
+    Args:
+        dataset_repo_id: HuggingFace dataset repository ID
+        data_dir: Subdirectory within the dataset
+        data_column: Column containing the text data
+        output_base_dir: Base directory for output
+        max_samples: Optional limit on number of samples (for testing)
+    Returns:
+        Dictionary with processing statistics
+    """
+    logger.info(f"Loading dataset from {dataset_repo_id}/{data_dir}...")
+    # Load dataset
+    dataset = load_dataset(
+        dataset_repo_id,
+        data_dir=data_dir,
+        split=None
+    )
+    if not isinstance(dataset, dict):
+        dataset = {'train': dataset}
+    logger.info(f"Loaded {len(dataset)} split(s): {list(dataset.keys())}")
+    # Show sample
+    if 'train' in dataset:
+        sample = dataset['train'][0][data_column]
+        logger.info(f"\nSample ORIGINAL format:\n{sample}\n")
+    # Create output directories
+    output_json = output_base_dir / 'exp_a_json'
+    output_eos = output_base_dir / 'exp_b_eos'
+    output_json.mkdir(parents=True, exist_ok=True)
+    output_eos.mkdir(parents=True, exist_ok=True)
+    statistics = {
+        'total': 0,
+        'json_valid': 0,
+        'eos_valid': 0,
+        'json_invalid': 0,
+        'eos_invalid': 0,
+        'splits': {}
+    }
+    for split_name, split_data in dataset.items():
+        logger.info(f"\n{'='*60}")
+        logger.info(f"Processing {split_name} split ({len(split_data)} examples)")
+        logger.info('='*60)
+        # Rename column if needed
+        if data_column != 'text':
+            split_data = split_data.rename_column(data_column, 'text')
+        # Limit samples if specified
+        if max_samples and len(split_data) > max_samples:
+            logger.info(f"Limiting to {max_samples} samples for testing")
+            split_data = split_data.select(range(max_samples))
+        statistics['total'] += len(split_data)
+        # Process to JSON format
+        logger.info("\nConverting to JSON format (EXP-A)...")
+        json_data = split_data.map(
+            process_example_json,
+            desc=f"JSON format ({split_name})"
+        )
+        # Filter valid examples
+        json_valid = json_data.filter(lambda x: x['valid'])
+        json_invalid_count = len(json_data) - len(json_valid)
+        logger.info(f"JSON format: {len(json_valid)}/{len(json_data)} valid")
+        if len(json_valid) > 0:
+            logger.info(f"\nSample JSON format:\n{json_valid[0]['text']}\n")
+        # Process to EOS format
+        logger.info("\nConverting to EOS format (EXP-B)...")
+        eos_data = split_data.map(
+            process_example_eos,
+            desc=f"EOS format ({split_name})"
+        )
+        # Filter valid examples
+        eos_valid = eos_data.filter(lambda x: x['valid'])
+        eos_invalid_count = len(eos_data) - len(eos_valid)
+        logger.info(f"EOS format: {len(eos_valid)}/{len(eos_data)} valid")
+        if len(eos_valid) > 0:
+            logger.info(f"\nSample EOS format:\n{eos_valid[0]['text']}\n")
+        # Update statistics
+        statistics['json_valid'] += len(json_valid)
+        statistics['json_invalid'] += json_invalid_count
+        statistics['eos_valid'] += len(eos_valid)
+        statistics['eos_invalid'] += eos_invalid_count
+        statistics['splits'][split_name] = {
+            'total': len(split_data),
+            'json_valid': len(json_valid),
+            'eos_valid': len(eos_valid)
+        }
+        # Save JSON format
+        json_df = pd.DataFrame({'text': [ex['text'] for ex in json_valid]})
+        json_file = output_json / f'{split_name}.csv'
+        json_df.to_csv(json_file, index=False)
+        logger.info(f"Saved JSON: {json_file} ({len(json_df)} examples)")
+        # Save EOS format
+        eos_df = pd.DataFrame({'text': [ex['text'] for ex in eos_valid]})
+        eos_file = output_eos / f'{split_name}.csv'
+        eos_df.to_csv(eos_file, index=False)
+        logger.info(f"Saved EOS: {eos_file} ({len(eos_df)} examples)")
+    return statistics
+def validate_output_files(output_base_dir: Path) -> Dict:
+    """
+    Validate the generated output files.
+    Returns:
+        Validation results dictionary
+    """
+    logger.info("\n" + "="*60)
+    logger.info("VALIDATION OF OUTPUT FILES")
+    logger.info("="*60)
+    results = {
+        'exp_a_json': {'valid': True, 'issues': []},
+        'exp_b_eos': {'valid': True, 'issues': []}
+    }
+    # Validate JSON format (EXP-A)
+    json_dir = output_base_dir / 'exp_a_json'
+    for csv_file in json_dir.glob('*.csv'):
+        logger.info(f"\nValidating {csv_file.name}...")
+        df = pd.read_csv(csv_file)
+        valid_count = 0
+        invalid_samples = []
+        for idx, row in df.iterrows():
+            text = row['text']
+            if validate_json_format(text):
+                valid_count += 1
+            else:
+                if len(invalid_samples) < 3:
+                    invalid_samples.append(text[:100])
+        rate = valid_count / len(df) * 100 if len(df) > 0 else 0
+        logger.info(f"  Valid: {valid_count}/{len(df)} ({rate:.1f}%)")
+        if invalid_samples:
+            results['exp_a_json']['valid'] = False
+            results['exp_a_json']['issues'].extend(invalid_samples)
+    # Validate EOS format (EXP-B)
+    eos_dir = output_base_dir / 'exp_b_eos'
+    for csv_file in eos_dir.glob('*.csv'):
+        logger.info(f"\nValidating {csv_file.name}...")
+        df = pd.read_csv(csv_file)
+        valid_count = 0
+        invalid_samples = []
+        for idx, row in df.iterrows():
+            text = row['text']
+            if validate_eos_format(text):
+                valid_count += 1
+            else:
+                if len(invalid_samples) < 3:
+                    invalid_samples.append(text[:100])
+        rate = valid_count / len(df) * 100 if len(df) > 0 else 0
+        logger.info(f"  Valid: {valid_count}/{len(df)} ({rate:.1f}%)")
+        if invalid_samples:
+            results['exp_b_eos']['valid'] = False
+            results['exp_b_eos']['issues'].extend(invalid_samples)
+    return results
+def print_final_report(statistics: Dict, validation: Dict):
+    """Print final processing report."""
+    logger.info("\n" + "="*60)
+    logger.info("FINAL REPORT")
+    logger.info("="*60)
+    logger.info(f"\nTotal examples processed: {statistics['total']}")
+    logger.info("\nEXP-A (JSON Format):")
+    logger.info(f"  Valid: {statistics['json_valid']}")
+    logger.info(f"  Invalid: {statistics['json_invalid']}")
+    json_rate = statistics['json_valid'] / statistics['total'] * 100 if statistics['total'] > 0 else 0
+    logger.info(f"  Success rate: {json_rate:.1f}%")
+    logger.info(f"  Validation: {'PASS' if validation['exp_a_json']['valid'] else 'FAIL'}")
+    logger.info("\nEXP-B (EOS Format):")
+    logger.info(f"  Valid: {statistics['eos_valid']}")
+    logger.info(f"  Invalid: {statistics['eos_invalid']}")
+    eos_rate = statistics['eos_valid'] / statistics['total'] * 100 if statistics['total'] > 0 else 0
+    logger.info(f"  Success rate: {eos_rate:.1f}%")
+    logger.info(f"  Validation: {'PASS' if validation['exp_b_eos']['valid'] else 'FAIL'}")
+    logger.info("\nPer-split breakdown:")
+    for split_name, split_stats in statistics['splits'].items():
+        logger.info(f"\n  {split_name.upper()}:")
+        logger.info(f"    Total: {split_stats['total']}")
+        logger.info(f"    JSON valid: {split_stats['json_valid']}")
+        logger.info(f"    EOS valid: {split_stats['eos_valid']}")
+    logger.info("\n" + "="*60)
+    all_valid = validation['exp_a_json']['valid'] and validation['exp_b_eos']['valid']
+    if all_valid:
+        logger.info("STATUS: ALL VALIDATIONS PASSED")
+    else:
+        logger.info("STATUS: SOME VALIDATIONS FAILED")
+    logger.info("="*60)
+    return all_valid
+def main():
+    parser = argparse.ArgumentParser(
+        description="Prepare experiment data in JSON and EOS formats"
+    )
+    parser.add_argument(
+        "--dataset_repo_id",
+        type=str,
+        default="augustocsc/sintetico_natural",
+        help="HuggingFace dataset repository ID"
+    )
+    parser.add_argument(
+        "--data_dir",
+        type=str,
+        default="700K",
+        help="Subdirectory within the dataset"
+    )
+    parser.add_argument(
+        "--data_column",
+        type=str,
+        default="i_prompt_n",
+        help="Column containing text data"
+    )
+    parser.add_argument(
+        "--output_base_dir",
+        type=str,
+        default="./data/experiments",
+        help="Base directory for output"
+    )
+    parser.add_argument(
+        "--max_samples",
+        type=int,
+        default=None,
+        help="Maximum samples per split (for testing)"
+    )
+    parser.add_argument(
+        "--skip_validation",
+        action="store_true",
+        help="Skip output file validation"
+    )
+    args = parser.parse_args()
+    output_base_dir = Path(args.output_base_dir)
+    logger.info("="*60)
+    logger.info("EXPERIMENT DATA PREPARATION")
+    logger.info("="*60)
+    logger.info(f"Dataset: {args.dataset_repo_id}/{args.data_dir}")
+    logger.info(f"Column: {args.data_column}")
+    logger.info(f"Output: {output_base_dir}")
+    if args.max_samples:
+        logger.info(f"Max samples: {args.max_samples}")
+    logger.info("="*60)
+    try:
+        # Process dataset
+        statistics = process_dataset(
+            dataset_repo_id=args.dataset_repo_id,
+            data_dir=args.data_dir,
+            data_column=args.data_column,
+            output_base_dir=output_base_dir,
+            max_samples=args.max_samples
+        )
+        # Validate output
+        if not args.skip_validation:
+            validation = validate_output_files(output_base_dir)
+        else:
+            validation = {
+                'exp_a_json': {'valid': True, 'issues': []},
+                'exp_b_eos': {'valid': True, 'issues': []}
+            }
+        # Print report
+        all_valid = print_final_report(statistics, validation)
+        if all_valid:
+            logger.info("\nData preparation completed successfully!")
+            logger.info(f"\nOutput directories:")
+            logger.info(f"  EXP-A (JSON): {output_base_dir / 'exp_a_json'}")
+            logger.info(f"  EXP-B (EOS):  {output_base_dir / 'exp_b_eos'}")
+            sys.exit(0)
+        else:
+            logger.error("\nData preparation completed with validation errors!")
+            sys.exit(1)
+    except Exception as e:
+        logger.error(f"\nFailed to prepare data: {e}")
+        import traceback
+        traceback.print_exc()
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

scripts/data/prepare_training_data_fixed.py ADDED Viewed

	@@ -0,0 +1,408 @@

+"""
+Data preparation script that adds proper <|endofex|> markers to training data.
+This script processes the existing dataset and wraps expressions with end-of-expression
+markers so the model learns to stop generation correctly.
+Usage:
+    python scripts/data/prepare_training_data_fixed.py \
+        --dataset_repo_id augustocsc/sintetico_natural \
+        --data_dir 700K \
+        --data_column i_prompt_n \
+        --output_dir ./data/processed/700K_fixed \
+        --validate
+"""
+import argparse
+import logging
+import os
+import sys
+from pathlib import Path
+from typing import Dict, Tuple
+from datasets import load_dataset, Dataset, DatasetDict
+import pandas as pd
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+def add_end_markers(example: Dict) -> Dict:
+    """
+    Add end-of-expression markers to training data.
+    This function:
+    1. Locates the expression in the text (after 'expr:')
+    2. Finds the natural end boundary (before 'vars:', newlines, etc.)
+    3. Inserts <|endofex|> marker at the end
+    4. Preserves any remaining content after the marker
+    Args:
+        example: Dictionary containing 'text' field with training data
+    Returns:
+        Dictionary with modified 'text' field containing end markers
+    """
+    text = example['text']
+    # Check if expression part exists
+    if 'expr:' not in text:
+        logger.warning(f"No 'expr:' found in text: {text[:100]}...")
+        return {'text': text}
+    # Split at expr: and add marker after expression
+    parts = text.split('expr:', 1)
+    if len(parts) != 2:
+        logger.warning(f"Unexpected format in text: {text[:100]}...")
+        return {'text': text}
+    prefix = parts[0]
+    expression_part = parts[1]
+    # Check if marker already exists
+    if '<|endofex|>' in expression_part:
+        logger.debug("Marker already present, skipping")
+        return {'text': text}
+    # Find natural end of expression (before vars:, newline, etc)
+    end_idx = len(expression_part)
+    boundaries = ['\nvars:', '\nVariables:', '\n\n', '\nvar:', '\nVariable:']
+    for boundary in boundaries:
+        idx = expression_part.find(boundary)
+        if idx != -1 and idx < end_idx:
+            end_idx = idx
+    # Insert marker
+    clean_expr = expression_part[:end_idx].strip()
+    remaining = expression_part[end_idx:]
+    # Reconstruct text with marker
+    new_text = f"{prefix}expr: {clean_expr}<|endofex|>{remaining}"
+    return {'text': new_text}
+def validate_markers(example: Dict) -> Dict:
+    """
+    Validate that markers are properly present in the text.
+    Args:
+        example: Dictionary containing 'text' field
+    Returns:
+        Dictionary with validation metadata
+    """
+    text = example['text']
+    start_count = text.count('<|startofex|>')
+    end_count = text.count('<|endofex|>')
+    # Valid if we have at least one end marker
+    # (start marker is optional depending on format)
+    valid = end_count > 0
+    return {
+        'valid': valid,
+        'start_count': start_count,
+        'end_count': end_count,
+        'text': text
+    }
+def process_dataset(
+    dataset_repo_id: str,
+    data_dir: str,
+    data_column: str,
+    output_dir: Path,
+    validate: bool = True
+) -> Tuple[DatasetDict, Dict]:
+    """
+    Process the dataset by adding end markers to all splits.
+    Args:
+        dataset_repo_id: HuggingFace dataset repository ID
+        data_dir: Subdirectory within the dataset (e.g., '700K')
+        data_column: Column to use for training data
+        output_dir: Directory to save processed dataset
+        validate: Whether to run validation after processing
+    Returns:
+        Tuple of (processed_dataset, statistics)
+    """
+    logger.info(f"Loading dataset from {dataset_repo_id}/{data_dir}...")
+    try:
+        # Load dataset from HuggingFace Hub
+        dataset = load_dataset(
+            dataset_repo_id,
+            data_dir=data_dir,
+            split=None  # Load all splits
+        )
+        if not isinstance(dataset, dict):
+            # If single split, convert to dict
+            dataset = {'train': dataset}
+        logger.info(f"Loaded {len(dataset)} split(s): {list(dataset.keys())}")
+        # Show sample before processing
+        if 'train' in dataset and len(dataset['train']) > 0:
+            logger.info(f"\nSample BEFORE processing:")
+            logger.info(f"{dataset['train'][0][data_column][:200]}...")
+    except Exception as e:
+        logger.error(f"Failed to load dataset: {e}")
+        raise
+    # Process each split
+    processed_dataset = {}
+    statistics = {
+        'total_examples': 0,
+        'processed_examples': 0,
+        'already_marked': 0,
+        'splits': {}
+    }
+    for split_name, split_data in dataset.items():
+        logger.info(f"\nProcessing {split_name} split ({len(split_data)} examples)...")
+        # Rename column to 'text' if needed
+        if data_column != 'text':
+            split_data = split_data.rename_column(data_column, 'text')
+        # Count examples that already have markers
+        already_marked = sum(1 for ex in split_data if '<|endofex|>' in ex['text'])
+        statistics['already_marked'] += already_marked
+        if already_marked > 0:
+            logger.info(f"Found {already_marked} examples already with markers")
+        # Apply marker addition
+        processed_split = split_data.map(
+            add_end_markers,
+            desc=f"Adding markers to {split_name}"
+        )
+        processed_dataset[split_name] = processed_split
+        # Update statistics
+        split_stats = {
+            'total': len(split_data),
+            'processed': len(processed_split),
+            'already_marked': already_marked
+        }
+        statistics['splits'][split_name] = split_stats
+        statistics['total_examples'] += len(split_data)
+        statistics['processed_examples'] += len(processed_split)
+        # Show sample after processing
+        if len(processed_split) > 0:
+            logger.info(f"\nSample AFTER processing:")
+            logger.info(f"{processed_split[0]['text'][:200]}...")
+    # Validate if requested
+    if validate:
+        logger.info("\n" + "="*60)
+        logger.info("VALIDATION")
+        logger.info("="*60)
+        for split_name, split_data in processed_dataset.items():
+            logger.info(f"\nValidating {split_name} split...")
+            # Apply validation
+            validated = split_data.map(validate_markers)
+            # Count valid examples
+            valid_count = sum(validated['valid'])
+            invalid_count = len(validated) - valid_count
+            valid_rate = valid_count / len(validated) * 100
+            logger.info(f"Valid examples: {valid_count}/{len(validated)} ({valid_rate:.1f}%)")
+            if invalid_count > 0:
+                logger.warning(f"Found {invalid_count} invalid examples!")
+                # Show first few invalid examples
+                invalid_examples = [
+                    ex for ex in validated if not ex['valid']
+                ][:3]
+                for i, ex in enumerate(invalid_examples):
+                    logger.warning(f"\nInvalid example {i+1}:")
+                    logger.warning(f"Start markers: {ex['start_count']}")
+                    logger.warning(f"End markers: {ex['end_count']}")
+                    logger.warning(f"Text: {ex['text'][:200]}...")
+            # Update statistics
+            statistics['splits'][split_name]['valid'] = valid_count
+            statistics['splits'][split_name]['invalid'] = invalid_count
+            statistics['splits'][split_name]['valid_rate'] = valid_rate
+    # Convert back to DatasetDict
+    processed_dataset = DatasetDict(processed_dataset)
+    return processed_dataset, statistics
+def save_dataset(dataset: DatasetDict, output_dir: Path, data_dir: str):
+    """
+    Save processed dataset to local directory.
+    Args:
+        dataset: Processed dataset to save
+        output_dir: Directory to save to
+        data_dir: Original data directory name (for filename)
+    """
+    output_dir.mkdir(parents=True, exist_ok=True)
+    logger.info(f"\nSaving processed dataset to {output_dir}...")
+    for split_name, split_data in dataset.items():
+        # Save as CSV
+        output_file = output_dir / f"{split_name}_{data_dir}.csv"
+        # Convert to pandas and save
+        df = split_data.to_pandas()
+        df.to_csv(output_file, index=False)
+        logger.info(f"Saved {split_name} split: {output_file} ({len(df)} examples)")
+    logger.info("Dataset saved successfully!")
+def print_statistics(statistics: Dict):
+    """
+    Print processing statistics in a formatted table.
+    Args:
+        statistics: Dictionary containing processing statistics
+    """
+    logger.info("\n" + "="*60)
+    logger.info("PROCESSING STATISTICS")
+    logger.info("="*60)
+    logger.info(f"\nTotal examples: {statistics['total_examples']}")
+    logger.info(f"Processed examples: {statistics['processed_examples']}")
+    logger.info(f"Already marked: {statistics['already_marked']}")
+    logger.info("\nPer-split statistics:")
+    logger.info("-"*60)
+    for split_name, split_stats in statistics['splits'].items():
+        logger.info(f"\n{split_name.upper()}:")
+        logger.info(f"  Total: {split_stats['total']}")
+        logger.info(f"  Processed: {split_stats['processed']}")
+        logger.info(f"  Already marked: {split_stats.get('already_marked', 0)}")
+        if 'valid' in split_stats:
+            logger.info(f"  Valid: {split_stats['valid']}")
+            logger.info(f"  Invalid: {split_stats['invalid']}")
+            logger.info(f"  Valid rate: {split_stats['valid_rate']:.1f}%")
+    logger.info("="*60)
+def main():
+    parser = argparse.ArgumentParser(
+        description="Prepare training data with proper end-of-expression markers"
+    )
+    parser.add_argument(
+        "--dataset_repo_id",
+        type=str,
+        required=True,
+        help="HuggingFace dataset repository ID"
+    )
+    parser.add_argument(
+        "--data_dir",
+        type=str,
+        required=True,
+        help="Subdirectory within the dataset (e.g., '700K')"
+    )
+    parser.add_argument(
+        "--data_column",
+        type=str,
+        required=True,
+        help="Column to use for training data (e.g., 'i_prompt_n')"
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        required=True,
+        help="Directory to save processed dataset"
+    )
+    parser.add_argument(
+        "--validate",
+        action="store_true",
+        help="Run validation after processing"
+    )
+    parser.add_argument(
+        "--push_to_hub",
+        action="store_true",
+        help="Push processed dataset to HuggingFace Hub"
+    )
+    parser.add_argument(
+        "--hub_repo_id",
+        type=str,
+        default=None,
+        help="HuggingFace repository ID for pushing (if --push_to_hub)"
+    )
+    args = parser.parse_args()
+    # Convert output_dir to Path
+    output_dir = Path(args.output_dir)
+    # Process dataset
+    try:
+        processed_dataset, statistics = process_dataset(
+            dataset_repo_id=args.dataset_repo_id,
+            data_dir=args.data_dir,
+            data_column=args.data_column,
+            output_dir=output_dir,
+            validate=args.validate
+        )
+        # Print statistics
+        print_statistics(statistics)
+        # Save to local directory
+        save_dataset(processed_dataset, output_dir, args.data_dir)
+        # Push to Hub if requested
+        if args.push_to_hub:
+            if not args.hub_repo_id:
+                logger.error("--hub_repo_id required when using --push_to_hub")
+                sys.exit(1)
+            logger.info(f"\nPushing to HuggingFace Hub: {args.hub_repo_id}")
+            processed_dataset.push_to_hub(args.hub_repo_id)
+            logger.info("Successfully pushed to Hub!")
+        # Check if any validation failed
+        if args.validate:
+            all_valid = all(
+                split_stats.get('invalid', 0) == 0
+                for split_stats in statistics['splits'].values()
+            )
+            if not all_valid:
+                logger.error("\n⚠️ Some examples failed validation!")
+                sys.exit(1)
+            else:
+                logger.info("\n✅ All examples validated successfully!")
+        logger.info("\n✅ Data preparation complete!")
+    except Exception as e:
+        logger.error(f"\n❌ Error during processing: {e}")
+        import traceback
+        traceback.print_exc()
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

scripts/evaluate.py ADDED Viewed

	@@ -0,0 +1,432 @@

+# Script para avaliacao customizada de modelos treinados
+# Projeto Seriguela - Avaliacao de expressoes simbolicas geradas
+import argparse
+import json
+import os
+import sys
+import re
+from collections import Counter
+from datetime import datetime
+import numpy as np
+import torch
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+from tqdm import tqdm
+# Add parent directory to path for imports
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from classes.expression import Expression
+def parse_args():
+    parser = argparse.ArgumentParser(description="Evaluate a trained model on expression generation")
+    parser.add_argument("--model_path", type=str, required=True,
+                        help="Path to model (local or HuggingFace Hub)")
+    parser.add_argument("--base_model", type=str, default=None,
+                        help="Base model for PEFT (if model_path is adapter)")
+    parser.add_argument("--dataset_repo_id", type=str, default="augustocsc/sintetico_natural",
+                        help="HuggingFace dataset repository")
+    parser.add_argument("--data_dir", type=str, default="700K",
+                        help="Data directory within dataset")
+    parser.add_argument("--data_column", type=str, default="i_prompt_n",
+                        help="Column name for prompts (i_prompt_n for infix, p_prompt_n for prefix)")
+    parser.add_argument("--num_samples", type=int, default=500,
+                        help="Number of samples to evaluate")
+    parser.add_argument("--num_generations", type=int, default=1,
+                        help="Number of generations per prompt")
+    parser.add_argument("--max_new_tokens", type=int, default=128,
+                        help="Maximum new tokens to generate")
+    parser.add_argument("--temperature", type=float, default=0.7,
+                        help="Sampling temperature")
+    parser.add_argument("--top_p", type=float, default=0.9,
+                        help="Top-p sampling parameter")
+    parser.add_argument("--output_dir", type=str, default="./evaluation_results",
+                        help="Directory to save evaluation results")
+    parser.add_argument("--seed", type=int, default=42,
+                        help="Random seed")
+    parser.add_argument("--device", type=str, default="auto",
+                        help="Device to use (auto, cuda, cpu)")
+    return parser.parse_args()
+def extract_expression_from_output(output: str, is_prefix: bool = False) -> str:
+    """Extract the expression from model output."""
+    # Try marker-based first
+    start_marker = "<|startofex|>"
+    end_marker = "<|endofex|>"
+    if start_marker in output and end_marker in output:
+        start_idx = output.find(start_marker) + len(start_marker)
+        end_idx = output.find(end_marker)
+        if start_idx < end_idx:
+            return output[start_idx:end_idx].strip()
+    # Fallback: Extract first complete expression after start marker
+    if start_marker in output:
+        start_idx = output.find(start_marker) + len(start_marker)
+        remaining = output[start_idx:].strip()
+        # Split at common boundaries
+        for boundary in ["\nvars:", "\nVariables:", "\nOperators:", "\n\n", "<|endoftext|>"]:
+            if boundary in remaining:
+                remaining = remaining.split(boundary)[0].strip()
+                break
+        # Remove any trailing incomplete text - take just the first line
+        remaining = remaining.split("\n")[0].strip()
+        # Limit length if unreasonably long
+        if len(remaining) > 150:
+            remaining = remaining[:150]
+        return remaining
+    # Last resort: look for "expr:" or "Expression:" pattern
+    match = re.search(r'(?:expr|Expression):\s*(.+?)(?:\n|$)', output, re.IGNORECASE)
+    if match:
+        return match.group(1).strip()
+    # Give up: return first line, limited length
+    first_line = output.strip().split("\n")[0]
+    return first_line[:100] if len(first_line) > 100 else first_line
+def validate_expression(expr_str: str, is_prefix: bool = False) -> dict:
+    """Validate if expression is syntactically correct."""
+    result = {
+        "valid": False,
+        "parseable": False,
+        "error": None,
+        "expression_obj": None
+    }
+    if not expr_str or expr_str.strip() == "":
+        result["error"] = "Empty expression"
+        return result
+    try:
+        expr_obj = Expression(expr_str, is_prefix=is_prefix)
+        result["parseable"] = True
+        result["valid"] = True
+        result["expression_obj"] = expr_obj
+    except Exception as e:
+        result["error"] = str(e)
+    return result
+def check_prompt_adherence(expr_str: str, prompt: str, is_prefix: bool = False) -> dict:
+    """Check if expression adheres to prompt constraints."""
+    result = {
+        "uses_allowed_vars": False,
+        "uses_allowed_ops": False,
+        "all_constraints_met": False
+    }
+    # Extract allowed vars and ops from prompt
+    # Typical prompt format: "Variables: x_1, x_2, x_3\nOperators: +, -, *, sin\n..."
+    # Extract variables from prompt
+    var_match = re.search(r"Variables?:\s*([^\n]+)", prompt, re.IGNORECASE)
+    allowed_vars = set()
+    if var_match:
+        var_str = var_match.group(1)
+        # Match patterns like x_1, x_2, etc.
+        allowed_vars = set(re.findall(r"x_\d+", var_str))
+    # Extract operators from prompt
+    op_match = re.search(r"Operators?:\s*([^\n]+)", prompt, re.IGNORECASE)
+    allowed_ops = set()
+    if op_match:
+        op_str = op_match.group(1)
+        # Common operators
+        ops = ['+', '-', '*', '/', '**', 'sin', 'cos', 'tan', 'log', 'sqrt', 'exp']
+        for op in ops:
+            if op in op_str:
+                allowed_ops.add(op)
+    # Check variables in expression
+    expr_vars = set(re.findall(r"x_\d+", expr_str))
+    if allowed_vars:
+        result["uses_allowed_vars"] = expr_vars.issubset(allowed_vars)
+    else:
+        result["uses_allowed_vars"] = True  # No constraint specified
+    # Check operators (simplified check)
+    result["uses_allowed_ops"] = True  # Default to true if no ops specified
+    if allowed_ops:
+        # This is a simplified check - would need more sophisticated parsing for accuracy
+        for op in ['sin', 'cos', 'tan', 'log', 'sqrt', 'exp']:
+            if op in expr_str and op not in allowed_ops:
+                result["uses_allowed_ops"] = False
+                break
+    result["all_constraints_met"] = result["uses_allowed_vars"] and result["uses_allowed_ops"]
+    return result
+def load_model_and_tokenizer(model_path: str, base_model: str = None, device: str = "auto"):
+    """Load model and tokenizer."""
+    print(f"Loading model from: {model_path}")
+    # Determine device
+    if device == "auto":
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Check if this is a PEFT model
+    is_peft = os.path.exists(os.path.join(model_path, "adapter_config.json")) if os.path.isdir(model_path) else False
+    if is_peft or base_model:
+        # Load base model first
+        base = base_model or "gpt2"
+        print(f"Loading base model: {base}")
+        model = AutoModelForCausalLM.from_pretrained(base)
+        model.resize_token_embeddings(len(tokenizer))
+        # Load PEFT adapter
+        print("Loading PEFT adapter...")
+        model = PeftModel.from_pretrained(model, model_path)
+        model = model.merge_and_unload()  # Merge for faster inference
+    else:
+        # Load full model
+        model = AutoModelForCausalLM.from_pretrained(model_path)
+        model.resize_token_embeddings(len(tokenizer))
+    model = model.to(device)
+    model.eval()
+    return model, tokenizer, device
+def generate_expression(model, tokenizer, prompt: str, device: str,
+                        max_new_tokens: int = 128, temperature: float = 0.7,
+                        top_p: float = 0.9, num_return_sequences: int = 1):
+    """Generate expression(s) from prompt."""
+    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
+    inputs = {k: v.to(device) for k, v in inputs.items()}
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=max_new_tokens,
+            temperature=temperature,
+            top_p=top_p,
+            do_sample=True,
+            num_return_sequences=num_return_sequences,
+            pad_token_id=tokenizer.pad_token_id,
+            eos_token_id=tokenizer.eos_token_id,
+        )
+    generated = tokenizer.batch_decode(outputs, skip_special_tokens=False)
+    return generated
+def evaluate_model(args):
+    """Main evaluation function."""
+    # Set seed
+    torch.manual_seed(args.seed)
+    np.random.seed(args.seed)
+    # Load model
+    model, tokenizer, device = load_model_and_tokenizer(
+        args.model_path, args.base_model, args.device
+    )
+    # Load dataset
+    print(f"Loading dataset: {args.dataset_repo_id}/{args.data_dir}")
+    try:
+        dataset = load_dataset(
+            args.dataset_repo_id,
+            data_files={
+                "test": f"{args.data_dir}/test_{args.data_dir}.csv"
+            }
+        )["test"]
+    except Exception as e:
+        print(f"Error loading test set, trying validation: {e}")
+        dataset = load_dataset(
+            args.dataset_repo_id,
+            data_files={
+                "validation": f"{args.data_dir}/val_{args.data_dir}.csv"
+            }
+        )["validation"]
+    # Sample if needed
+    if len(dataset) > args.num_samples:
+        indices = np.random.choice(len(dataset), args.num_samples, replace=False)
+        dataset = dataset.select(indices)
+    print(f"Evaluating on {len(dataset)} samples...")
+    # Determine if prefix or infix
+    is_prefix = args.data_column.startswith("p_")
+    # Evaluation metrics
+    metrics = {
+        "total_samples": 0,
+        "total_generations": 0,
+        "valid_expressions": 0,
+        "parseable_expressions": 0,
+        "uses_allowed_vars": 0,
+        "uses_allowed_ops": 0,
+        "all_constraints_met": 0,
+        "unique_expressions": set(),
+        "expression_lengths": [],
+        "errors": Counter(),
+    }
+    results = []
+    # Generate and evaluate
+    for idx, sample in enumerate(tqdm(dataset, desc="Evaluating")):
+        prompt = sample[args.data_column]
+        # Extract just the prompt part (before the expression)
+        # Typically the prompt ends before <|startofex|>
+        if "<|startofex|>" in prompt:
+            prompt_only = prompt.split("<|startofex|>")[0] + "<|startofex|>"
+        else:
+            prompt_only = prompt
+        generations = generate_expression(
+            model, tokenizer, prompt_only, device,
+            max_new_tokens=args.max_new_tokens,
+            temperature=args.temperature,
+            top_p=args.top_p,
+            num_return_sequences=args.num_generations
+        )
+        metrics["total_samples"] += 1
+        for gen_output in generations:
+            metrics["total_generations"] += 1
+            # Extract expression
+            expr_str = extract_expression_from_output(gen_output, is_prefix)
+            # Validate
+            validation = validate_expression(expr_str, is_prefix)
+            # Check adherence
+            adherence = check_prompt_adherence(expr_str, prompt_only, is_prefix)
+            # Update metrics
+            if validation["valid"]:
+                metrics["valid_expressions"] += 1
+            if validation["parseable"]:
+                metrics["parseable_expressions"] += 1
+                metrics["unique_expressions"].add(expr_str)
+                metrics["expression_lengths"].append(len(expr_str))
+            if validation["error"]:
+                metrics["errors"][validation["error"][:50]] += 1
+            if adherence["uses_allowed_vars"]:
+                metrics["uses_allowed_vars"] += 1
+            if adherence["uses_allowed_ops"]:
+                metrics["uses_allowed_ops"] += 1
+            if adherence["all_constraints_met"]:
+                metrics["all_constraints_met"] += 1
+            results.append({
+                "sample_idx": idx,
+                "prompt": prompt_only[:200],  # Truncate for storage
+                "generated_output": gen_output[:500],
+                "extracted_expression": expr_str,
+                "valid": validation["valid"],
+                "parseable": validation["parseable"],
+                "error": validation["error"],
+                "uses_allowed_vars": adherence["uses_allowed_vars"],
+                "uses_allowed_ops": adherence["uses_allowed_ops"],
+            })
+    # Calculate final metrics
+    total_gen = metrics["total_generations"]
+    final_metrics = {
+        "model_path": args.model_path,
+        "dataset": f"{args.dataset_repo_id}/{args.data_dir}",
+        "data_column": args.data_column,
+        "is_prefix": is_prefix,
+        "num_samples": metrics["total_samples"],
+        "num_generations": total_gen,
+        "temperature": args.temperature,
+        "top_p": args.top_p,
+        # Validity metrics
+        "valid_rate": metrics["valid_expressions"] / total_gen if total_gen > 0 else 0,
+        "parseable_rate": metrics["parseable_expressions"] / total_gen if total_gen > 0 else 0,
+        # Adherence metrics
+        "uses_allowed_vars_rate": metrics["uses_allowed_vars"] / total_gen if total_gen > 0 else 0,
+        "uses_allowed_ops_rate": metrics["uses_allowed_ops"] / total_gen if total_gen > 0 else 0,
+        "constraints_met_rate": metrics["all_constraints_met"] / total_gen if total_gen > 0 else 0,
+        # Diversity metrics
+        "unique_expressions": len(metrics["unique_expressions"]),
+        "diversity_rate": len(metrics["unique_expressions"]) / total_gen if total_gen > 0 else 0,
+        "avg_expression_length": np.mean(metrics["expression_lengths"]) if metrics["expression_lengths"] else 0,
+        # Error distribution (top 10)
+        "top_errors": dict(metrics["errors"].most_common(10)),
+        "timestamp": datetime.now().isoformat(),
+    }
+    # Print results
+    print("\n" + "="*60)
+    print("EVALUATION RESULTS")
+    print("="*60)
+    print(f"Model: {args.model_path}")
+    print(f"Dataset: {args.dataset_repo_id}/{args.data_dir}")
+    print(f"Format: {'Prefix' if is_prefix else 'Infix'}")
+    print("-"*60)
+    print(f"Total samples: {metrics['total_samples']}")
+    print(f"Total generations: {total_gen}")
+    print("-"*60)
+    print("VALIDITY METRICS:")
+    print(f"  Valid rate: {final_metrics['valid_rate']:.2%}")
+    print(f"  Parseable rate: {final_metrics['parseable_rate']:.2%}")
+    print("-"*60)
+    print("ADHERENCE METRICS:")
+    print(f"  Uses allowed vars: {final_metrics['uses_allowed_vars_rate']:.2%}")
+    print(f"  Uses allowed ops: {final_metrics['uses_allowed_ops_rate']:.2%}")
+    print(f"  All constraints met: {final_metrics['constraints_met_rate']:.2%}")
+    print("-"*60)
+    print("DIVERSITY METRICS:")
+    print(f"  Unique expressions: {final_metrics['unique_expressions']}")
+    print(f"  Diversity rate: {final_metrics['diversity_rate']:.2%}")
+    print(f"  Avg expression length: {final_metrics['avg_expression_length']:.1f}")
+    print("="*60)
+    # Save results
+    os.makedirs(args.output_dir, exist_ok=True)
+    # Create filename from model path
+    model_name = args.model_path.replace("/", "_").replace("\\", "_")
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    # Save metrics
+    metrics_file = os.path.join(args.output_dir, f"metrics_{model_name}_{timestamp}.json")
+    with open(metrics_file, "w") as f:
+        json.dump(final_metrics, f, indent=2)
+    print(f"\nMetrics saved to: {metrics_file}")
+    # Save detailed results
+    results_file = os.path.join(args.output_dir, f"results_{model_name}_{timestamp}.json")
+    with open(results_file, "w") as f:
+        json.dump(results, f, indent=2)
+    print(f"Detailed results saved to: {results_file}")
+    return final_metrics
+if __name__ == "__main__":
+    args = parse_args()
+    evaluate_model(args)