rogeliorichman commited on
Commit
92c68e3
·
verified ·
1 Parent(s): 4d3449c

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ data/sample2.pdf filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual Environment
24
+ venv/
25
+ ENV/
26
+ env/
27
+ .env
28
+
29
+ # IDE
30
+ .idea/
31
+ .vscode/
32
+ *.swp
33
+ *.swo
34
+ .DS_Store
35
+
36
+ # Testing
37
+ .coverage
38
+ htmlcov/
39
+ .tox/
40
+ .nox/
41
+ .pytest_cache/
42
+
43
+ # Logs
44
+ *.log
45
+ logs/
46
+
47
+ # Local development
48
+ .env.local
49
+ .env.development.local
50
+ .env.test.local
51
+ .env.production.local
52
+
53
+ # API Keys
54
+ .env
55
+ *.pem
56
+ *.key
57
+
58
+ # Gradio
59
+ .gradio/
60
+
61
+ # private file
62
+ /data/sample3.pdf
CONTRIBUTING.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contributing to AI LectureForge
2
+
3
+ First off, thank you for considering contributing to AI LectureForge! It's people like you that make AI LectureForge such a great tool.
4
+
5
+ ## Code of Conduct
6
+
7
+ By participating in this project, you are expected to uphold our Code of Conduct:
8
+
9
+ - Use welcoming and inclusive language
10
+ - Be respectful of differing viewpoints and experiences
11
+ - Gracefully accept constructive criticism
12
+ - Focus on what is best for the community
13
+ - Show empathy towards other community members
14
+
15
+ ## How Can I Contribute?
16
+
17
+ ### Reporting Bugs
18
+
19
+ Before creating bug reports, please check the issue list as you might find out that you don't need to create one. When you are creating a bug report, please include as many details as possible:
20
+
21
+ * Use a clear and descriptive title
22
+ * Describe the exact steps which reproduce the problem
23
+ * Provide specific examples to demonstrate the steps
24
+ * Describe the behavior you observed after following the steps
25
+ * Explain which behavior you expected to see instead and why
26
+ * Include screenshots if possible
27
+
28
+ ### Suggesting Enhancements
29
+
30
+ If you have a suggestion for the project, we'd love to hear it. Enhancement suggestions are tracked as GitHub issues. When creating an enhancement suggestion, please include:
31
+
32
+ * A clear and descriptive title
33
+ * A detailed description of the proposed enhancement
34
+ * Examples of how the enhancement would be used
35
+ * Any potential drawbacks or challenges
36
+
37
+ ### Pull Requests
38
+
39
+ 1. Fork the repo and create your branch from `main`
40
+ 2. If you've added code that should be tested, add tests
41
+ 3. If you've changed APIs, update the documentation
42
+ 4. Ensure the test suite passes
43
+ 5. Make sure your code follows the existing style
44
+ 6. Issue that pull request!
45
+
46
+ ## Development Process
47
+
48
+ 1. Create a new branch:
49
+ ```bash
50
+ git checkout -b feature/my-feature
51
+ # or
52
+ git checkout -b bugfix/my-bugfix
53
+ ```
54
+
55
+ 2. Make your changes and commit:
56
+ ```bash
57
+ git add .
58
+ git commit -m "Description of changes"
59
+ ```
60
+
61
+ 3. Push to your fork:
62
+ ```bash
63
+ git push origin feature/my-feature
64
+ ```
65
+
66
+ ### Style Guidelines
67
+
68
+ - Follow PEP 8 style guide for Python code
69
+ - Use descriptive variable names
70
+ - Comment your code when necessary
71
+ - Keep functions focused and modular
72
+ - Use type hints where possible
73
+
74
+ ### Testing
75
+
76
+ - Write unit tests for new features
77
+ - Ensure all tests pass before submitting PR
78
+ - Include both positive and negative test cases
79
+
80
+ ## Project Structure
81
+
82
+ ```
83
+ transcript_transformer/
84
+ ├── src/
85
+ │ ├── core/ # Core transformation logic
86
+ │ ├── utils/ # Utility functions
87
+ │ └── app.py # Main application
88
+ ├── tests/ # Test files
89
+ └── requirements.txt # Project dependencies
90
+ ```
91
+
92
+ ## Getting Help
93
+
94
+ If you need help, you can:
95
+ - Open an issue with your question
96
+ - Reach out to the maintainers
97
+ - Check the documentation
98
+
99
+ Thank you for contributing to AI LectureForge! 🎓✨
README.md CHANGED
@@ -1,12 +1,265 @@
1
  ---
2
- title: AI Agent Script Builder
3
- emoji: 😻
4
- colorFrom: blue
5
- colorTo: indigo
6
  sdk: gradio
7
- sdk_version: 5.23.3
8
- app_file: app.py
9
- pinned: false
10
  ---
 
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: AI_Agent_Script_Builder
3
+ app_file: src/app.py
 
 
4
  sdk: gradio
5
+ sdk_version: 5.13.1
 
 
6
  ---
7
+ # 🎓 AI Agent Script Builder
8
 
9
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
10
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
11
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
12
+ [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](http://makeapullrequest.com)
13
+ [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/rogeliorichman/AI_Script_Generator)
14
+
15
+ > Transform transcripts and PDFs into timed, structured teaching scripts using an autonomous AI agent
16
+
17
+ AI Agent Script Builder is an advanced autonomous agent that converts PDF transcripts, raw text, and conversational content into well-structured teaching scripts. It seamlessly processes inputs, extracting and analyzing the content to create organized, pedagogically scripts with time markers. Designed for educators, students, content creators, and anyone looking to transform information into clear explanations.
18
+
19
+ ## 🤖 AI Agent Architecture
20
+
21
+ AI Agent Script Builder functions as a **specialized AI agent** that autonomously processes and transforms content with minimal human intervention:
22
+
23
+ ### Agent Capabilities
24
+ - **Autonomous Processing**: Independently analyzes content, determines structure, and generates complete scripts
25
+ - **Decision Making**: Intelligently allocates time, prioritizes topics, and structures content based on input analysis
26
+ - **Contextual Adaptation**: Adjusts to different languages, styles, and requirements through guiding prompts
27
+ - **Obstacle Management**: Implements progressive retry strategies when facing API quota limitations
28
+ - **Goal-Oriented Operation**: Consistently works toward transforming unstructured information into coherent educational scripts
29
+
30
+ ### Agent Limitations
31
+ - **Domain Specificity**: Specialized for educational script generation rather than general-purpose tasks
32
+ - **External API Dependency**: Relies on third-party language models (Gemini/OpenAI) for core reasoning
33
+ - **No Continuous Learning**: Does not improve through experience or previous interactions
34
+
35
+ This architecture enables the system to function autonomously within its specialized domain while maintaining high-quality output and resilience to common obstacles.
36
+
37
+ ## 🔗 Live Demo
38
+
39
+ Try it out: [AI Agent Script Builder on Hugging Face Spaces](https://huggingface.co/spaces/rogeliorichman/AI_Script_Generator)
40
+
41
+ ## ✨ Features
42
+
43
+ - 🤖 PDF transcript and raw text processing
44
+ - 🤖 AI-powered content transformation
45
+ - 📚 Structured teaching script generation
46
+ - 🔄 Coherent topic organization
47
+ - 🔌 Support for multiple AI providers (Gemini/OpenAI)
48
+ - ⏱️ Time-marked sections for pacing
49
+ - 🌐 Multilingual interface (English/Spanish) with flag selector
50
+ - 🌍 Generation in ANY language through the guiding prompt (not limited to UI languages)
51
+ - 🧠 Autonomous decision-making for content organization and pacing
52
+ - 🛡️ Self-healing capabilities with progressive retry strategies for API limitations
53
+
54
+ ## Output Format
55
+
56
+ The generated scripts follow a structured format:
57
+
58
+ ### Time Markers
59
+ - Each section includes time markers (e.g., `[11:45]`) to help pace delivery
60
+ - Customizable duration: From as short as 2 minutes to 60 minutes, with timing adjusted accordingly
61
+
62
+ ### Structure
63
+ - Introduction with learning objectives
64
+ - Time-marked content sections
65
+ - Examples and practical applications
66
+ - Interactive elements (questions, exercises)
67
+ - Recap and key takeaways
68
+
69
+ For example:
70
+ ```
71
+ [00:00] Introduction to Topic
72
+ - Learning objectives
73
+ - Key concepts overview
74
+
75
+ [11:45] Main Concept Explanation
76
+ - Detailed explanation
77
+ - Practical example
78
+ - Student interaction point
79
+
80
+ [23:30] Advanced Applications
81
+ ...
82
+ ```
83
+
84
+ ## 🚀 Quick Start
85
+
86
+ ### Prerequisites
87
+
88
+ - Python 3.8 or higher
89
+ - Virtual environment (recommended)
90
+ - Gemini API key (or OpenAI API key)
91
+
92
+ ### Installation
93
+
94
+ ```bash
95
+ # Clone the repository
96
+ git clone https://github.com/RogelioRichmanAstronaut/AI-Script-Generator.git
97
+ cd AI-Script-Generator
98
+
99
+ # Create and activate virtual environment
100
+ python -m venv venv
101
+ source venv/bin/activate # On Windows: .\venv\Scripts\activate
102
+
103
+ # Install dependencies
104
+ pip install -r requirements.txt
105
+
106
+ # Set up environment variables (choose one API key based on your preference)
107
+ export GEMINI_API_KEY='your-gemini-api-key' # Primary option
108
+ # OR
109
+ export OPENAI_API_KEY='your-openai-api-key' # Alternative option
110
+
111
+ # On Windows use:
112
+ # set GEMINI_API_KEY=your-gemini-api-key
113
+ # set OPENAI_API_KEY=your-openai-api-key
114
+ ```
115
+
116
+ ### Usage
117
+
118
+ ```bash
119
+ # Run with Python path set
120
+ PYTHONPATH=$PYTHONPATH:. python src/app.py
121
+
122
+ # Access the web interface
123
+ # Open http://localhost:7860 in your browser
124
+ ```
125
+
126
+ ## 🛠️ Technical Approach
127
+
128
+ ### Prompt Engineering Strategy
129
+
130
+ Our system uses a sophisticated multi-stage prompting approach:
131
+
132
+ 1. **Content Analysis & Chunking**
133
+ - Smart text segmentation for handling large documents (9000+ words)
134
+ - Contextual overlap between chunks to maintain coherence
135
+ - Key topic and concept extraction from each segment
136
+
137
+ 2. **Structure Generation**
138
+ - Time-based sectioning (customizable from 2-60 minutes)
139
+ - Educational flow design with clear progression
140
+ - Integration of pedagogical elements (examples, exercises, questions)
141
+
142
+ 3. **Educational Enhancement**
143
+ - Transformation of casual content into formal teaching script
144
+ - Addition of practical examples and case studies
145
+ - Integration of interaction points and reflection questions
146
+ - Time markers for pacing guidance
147
+
148
+ 4. **Coherence Validation**
149
+ - Cross-reference checking between sections
150
+ - Verification of topic flow and progression
151
+ - Consistency check for terminology and concepts
152
+ - Quality assessment of educational elements
153
+
154
+ ### Challenges & Solutions
155
+
156
+ 1. **Context Length Management**
157
+ - Challenge: Handling documents beyond model context limits
158
+ - Solution: Implemented sliding window chunking with overlap
159
+ - Result: Successfully processes documents up to 9000+ words with extensibility for more
160
+
161
+ 2. **Educational Structure**
162
+ - Challenge: Converting conversational text to teaching format
163
+ - Solution:
164
+ - Structured templating system for different time formats (2-60 min)
165
+ - Integration of pedagogical elements (examples, exercises)
166
+ - Time-based sectioning with clear progression
167
+ - Result: Coherent, time-marked teaching scripts with interactive elements
168
+
169
+ 3. **Content Coherence**
170
+ - Challenge: Maintaining narrative flow across chunked content
171
+ - Solution:
172
+ - Contextual overlap between chunks
173
+ - Topic tracking across sections
174
+ - Cross-reference validation system
175
+ - Result: Seamless content flow with consistent terminology
176
+
177
+ 4. **Educational Quality**
178
+ - Challenge: Ensuring high pedagogical value
179
+ - Solution:
180
+ - Integration of learning objectives
181
+ - Strategic placement of examples and exercises
182
+ - Addition of reflection questions
183
+ - Time-appropriate pacing markers
184
+ - Result: Engaging, structured learning materials
185
+
186
+ ### Core Components
187
+
188
+ 1. **PDF Processing**: Extracts and cleans text from PDF transcripts
189
+ 2. **Text Processing**: Handles direct text input and cleans/structures it
190
+ 3. **Content Analysis**: Uses AI to understand and structure the content
191
+ 4. **Script Generation**: Transforms content into educational format
192
+
193
+ ### Implementation Details
194
+
195
+ 1. **PDF/Text Handling**
196
+ - Robust PDF text extraction
197
+ - Raw text input processing
198
+ - Clean-up of extracted content
199
+
200
+ 2. **AI Processing**
201
+ - Integration with Gemini API (primary)
202
+ - OpenAI API support (alternative)
203
+ - Structured prompt system for consistent output
204
+
205
+ 3. **Output Generation**
206
+ - Organized teaching scripts
207
+ - Clear section structure
208
+ - Learning points and key concepts
209
+
210
+ ### Architecture
211
+
212
+ The system follows a modular agent-based design:
213
+
214
+ - 📄 PDF/text processing module (Perception)
215
+ - 🔍 Text analysis component (Cognition)
216
+ - 🤖 AI integration layer (Decision-making)
217
+ - 📝 Output formatting system (Action)
218
+ - 🔄 Error handling system (Self-correction)
219
+
220
+ This agent architecture enables autonomous processing from raw input to final output with built-in adaptation to errors and limitations.
221
+
222
+ ## 🤝 Contributing
223
+
224
+ Contributions are what make the open source community amazing! Any contributions you make are **greatly appreciated**.
225
+
226
+ 1. Fork the Project
227
+ 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
228
+ 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
229
+ 4. Push to the Branch (`git push origin feature/AmazingFeature`)
230
+ 5. Open a Pull Request
231
+
232
+ See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines.
233
+
234
+ ## 📝 License
235
+
236
+ Distributed under the MIT License. See `LICENSE` for more information.
237
+
238
+ ## 🌟 Acknowledgments
239
+
240
+ - Special thanks to the Gemini and OpenAI teams for their amazing APIs
241
+ - Inspired by educators and communicators worldwide who make learning engaging
242
+
243
+ ## 📧 Contact
244
+
245
+ Project Link: [https://github.com/RogelioRichmanAstronaut/AI-Script-Generator](https://github.com/RogelioRichmanAstronaut/AI-Script-Generator)
246
+
247
+ ## 🔮 Roadmap
248
+
249
+ - [ ] Support for multiple output formats (PDF, PPTX)
250
+ - [ ] Interactive elements generation
251
+ - [ ] Custom templating system
252
+ - [ ] Copy to clipboard button for generated content
253
+ - [x] Multilingual capabilities
254
+ - [x] Content generation in any language via guiding prompt
255
+ - [x] UI language support
256
+ - [x] English
257
+ - [x] Spanish
258
+ - [ ] French
259
+ - [ ] German
260
+ - [ ] Integration with LMS platforms
261
+ - [x] Timestamp toggle - ability to show/hide time markers in the output text
262
+
263
+ ---
264
+
265
+ <p align="center">Made with ❤️ for educators, students, and communicators everywhere</p>
data/sample2.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5bf0997942205ed54293dd5b2a480b6d5efcc7d4146548dd68c20a9d7e3f7318
3
+ size 155966
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ transformers>=4.30.0
3
+ torch>=2.0.0
4
+ pypdf2>=3.0.0
5
+ python-dotenv>=0.19.0
6
+ numpy>=1.21.0
7
+ tqdm>=4.65.0
8
+ openai>=1.0.0
9
+ tiktoken>=0.5.0
setup.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from setuptools import setup, find_packages
2
+
3
+ setup(
4
+ name="transcript_transformer",
5
+ version="0.1.0",
6
+ packages=find_packages(),
7
+ install_requires=[
8
+ line.strip()
9
+ for line in open("requirements.txt")
10
+ if line.strip() and not line.startswith("#")
11
+ ],
12
+ python_requires=">=3.8",
13
+ )
src/__init__.py ADDED
File without changes
src/app.py ADDED
@@ -0,0 +1,383 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import gradio as gr
3
+ import re
4
+ from dotenv import load_dotenv
5
+ from src.core.transformer import TranscriptTransformer
6
+ from src.utils.pdf_processor import PDFProcessor
7
+ from src.utils.text_processor import TextProcessor
8
+
9
+ load_dotenv()
10
+
11
+ # Translations dictionary for UI elements
12
+ TRANSLATIONS = {
13
+ "en": {
14
+ "title": "AI Script Generator",
15
+ "subtitle": "Transform transcripts and PDFs into timed, structured teaching scripts using AI",
16
+ "input_type_label": "Input Type",
17
+ "input_type_options": ["PDF", "Raw Text"],
18
+ "upload_pdf_label": "Upload Transcript (PDF)",
19
+ "paste_text_label": "Paste Transcript Text",
20
+ "paste_text_placeholder": "Paste your transcript text here...",
21
+ "guiding_prompt_label": "Guiding Prompt (Optional)",
22
+ "guiding_prompt_placeholder": "Additional instructions to customize the output. Examples: 'Use a more informal tone', 'Focus only on section X', 'Generate the content in Spanish', 'Include more practical programming examples', etc.",
23
+ "guiding_prompt_info": "The Guiding Prompt allows you to provide specific instructions to modify the generated content, like output/desired LANGUAGE. You can use it to change the tone, style, focus ONLY on specific sections of the text, specify the output language (e.g., 'Generate in Spanish/French/German'), or give any other instruction that helps personalize the final result.",
24
+ "duration_label": "Target Lecture Duration (minutes)",
25
+ "examples_label": "Include Practical Examples",
26
+ "thinking_model_label": "Use Experimental Thinking Model (Gemini Only)",
27
+ "submit_button": "Transform Transcript",
28
+ "output_label": "Generated Teaching Transcript",
29
+ "error_no_pdf": "Error: No PDF file uploaded",
30
+ "error_no_text": "Error: No text provided",
31
+ "error_prefix": "Error processing transcript: ",
32
+ "language_selector": "Language / Idioma",
33
+ "show_timestamps": "Show Timestamps",
34
+ "hide_timestamps": "Hide Timestamps"
35
+ },
36
+ "es": {
37
+ "title": "Generador de Guiones IA",
38
+ "subtitle": "Transforma transcripciones y PDFs en guiones de enseñanza estructurados y cronometrados usando IA",
39
+ "input_type_label": "Tipo de Entrada",
40
+ "input_type_options": ["PDF", "Texto"],
41
+ "upload_pdf_label": "Subir Transcripción (PDF)",
42
+ "paste_text_label": "Pegar Texto de Transcripción",
43
+ "paste_text_placeholder": "Pega tu texto de transcripción aquí...",
44
+ "guiding_prompt_label": "Instrucciones Guía (Opcional)",
45
+ "guiding_prompt_placeholder": "Instrucciones adicionales para personalizar el resultado. Ejemplos: 'Usa un tono más informal', 'Enfócate solo en la sección X', 'Genera el contenido en inglés', 'Incluye más ejemplos prácticos de programación', etc.",
46
+ "guiding_prompt_info": "Las Instrucciones Guía te permiten proporcionar indicaciones específicas para modificar el contenido generado, como el IDIOMA deseado. Puedes usarlas para cambiar el tono, estilo, enfocarte SOLO en secciones específicas del texto, especificar el idioma de salida (ej., 'Generar en inglés/francés/alemán'), o dar cualquier otra instrucción que ayude a personalizar el resultado final.",
47
+ "duration_label": "Duración Objetivo de la Clase (minutos)",
48
+ "examples_label": "Incluir Ejemplos Prácticos",
49
+ "thinking_model_label": "Usar Modelo de Pensamiento Experimental (Solo Gemini)",
50
+ "submit_button": "Transformar Transcripción",
51
+ "output_label": "Guión de Enseñanza Generado",
52
+ "error_no_pdf": "Error: No se ha subido ningún archivo PDF",
53
+ "error_no_text": "Error: No se ha proporcionado texto",
54
+ "error_prefix": "Error al procesar la transcripción: ",
55
+ "language_selector": "Language / Idioma",
56
+ "show_timestamps": "Mostrar Marcas de Tiempo",
57
+ "hide_timestamps": "Ocultar Marcas de Tiempo"
58
+ }
59
+ }
60
+
61
+ # Language-specific prompt suffixes to append automatically
62
+ LANGUAGE_PROMPTS = {
63
+ "en": "", # Default language doesn't need special instructions
64
+ "es": "Generate the content in Spanish. Genera todo el contenido en español."
65
+ }
66
+
67
+ class TranscriptTransformerApp:
68
+ def __init__(self):
69
+ self.pdf_processor = PDFProcessor()
70
+ self.text_processor = TextProcessor()
71
+ self.current_language = "en" # Default language
72
+ self.last_generated_content = "" # Store the last generated content
73
+ self.content_with_timestamps = "" # Store content with timestamps
74
+ self.content_without_timestamps = "" # Store content without timestamps
75
+
76
+ def process_transcript(self,
77
+ language: str,
78
+ input_type: str,
79
+ file_obj: gr.File = None,
80
+ raw_text_input: str = "",
81
+ initial_prompt: str = "",
82
+ target_duration: int = 30,
83
+ include_examples: bool = True,
84
+ use_gemini: bool = True,
85
+ use_thinking_model: bool = False) -> str:
86
+ """
87
+ Process uploaded transcript and transform it into a teaching transcript
88
+
89
+ Args:
90
+ language: Selected UI language
91
+ input_type: Type of input (PDF or Raw Text)
92
+ file_obj: Uploaded PDF file (if input_type is PDF)
93
+ raw_text_input: Raw text input (if input_type is Raw Text)
94
+ initial_prompt: Additional guiding instructions for the content generation
95
+ target_duration: Target lecture duration in minutes
96
+ include_examples: Whether to include practical examples
97
+ use_gemini: Whether to use Gemini API instead of OpenAI
98
+ use_thinking_model: Requires use_gemini=True
99
+
100
+ Returns:
101
+ str: Generated teaching transcript
102
+ """
103
+ try:
104
+ # Force enable Gemini if thinking model is selected
105
+ if use_thinking_model:
106
+ use_gemini = True
107
+
108
+ self.transformer = TranscriptTransformer(
109
+ use_gemini=use_gemini,
110
+ use_thinking_model=use_thinking_model
111
+ )
112
+
113
+ # Get text based on input type
114
+ if input_type == TRANSLATIONS[language]["input_type_options"][0]: # PDF
115
+ if file_obj is None:
116
+ return TRANSLATIONS[language]["error_no_pdf"]
117
+ raw_text = self.pdf_processor.extract_text(file_obj.name)
118
+ else: # Raw Text
119
+ if not raw_text_input.strip():
120
+ return TRANSLATIONS[language]["error_no_text"]
121
+ raw_text = raw_text_input
122
+
123
+ # Modify initial prompt based on language if no explicit language instruction is given
124
+ modified_prompt = initial_prompt
125
+
126
+ # Check if user has specified a language in the prompt
127
+ language_keywords = ["spanish", "español", "english", "inglés", "french", "francés", "german", "alemán"]
128
+ user_specified_language = any(keyword in initial_prompt.lower() for keyword in language_keywords)
129
+
130
+ # Only append language instruction if user hasn't specified one and we have a non-default language
131
+ if not user_specified_language and language in LANGUAGE_PROMPTS and LANGUAGE_PROMPTS[language]:
132
+ if modified_prompt:
133
+ modified_prompt += " " + LANGUAGE_PROMPTS[language]
134
+ else:
135
+ modified_prompt = LANGUAGE_PROMPTS[language]
136
+
137
+ # Transform to teaching transcript with user guidance
138
+ lecture_transcript = self.transformer.transform_to_lecture(
139
+ text=raw_text,
140
+ target_duration=target_duration,
141
+ include_examples=include_examples,
142
+ initial_prompt=modified_prompt
143
+ )
144
+
145
+ # Store the generated content
146
+ self.content_with_timestamps = lecture_transcript
147
+
148
+ # Create a version without timestamps
149
+ self.content_without_timestamps = self.remove_timestamps(lecture_transcript)
150
+
151
+ # Default: show content with timestamps
152
+ self.last_generated_content = lecture_transcript
153
+
154
+ return lecture_transcript
155
+
156
+ except Exception as e:
157
+ return f"{TRANSLATIONS[language]['error_prefix']}{str(e)}"
158
+
159
+ def remove_timestamps(self, text):
160
+ """Remove all timestamps (e.g., [00:00]) from the text"""
161
+ # Regex to match the timestamp pattern [MM:SS] or [HH:MM:SS]
162
+ return re.sub(r'\[\d{1,2}:\d{2}(:\d{2})?\]', '', text)
163
+
164
+ def toggle_timestamps(self, show_timestamps):
165
+ """Toggle visibility of timestamps in output"""
166
+ if show_timestamps:
167
+ return self.content_with_timestamps
168
+ else:
169
+ return self.content_without_timestamps
170
+
171
+ def update_ui_language(self, language):
172
+ """Update UI elements based on selected language"""
173
+ self.current_language = language
174
+
175
+ translations = TRANSLATIONS[language]
176
+
177
+ return [
178
+ translations["title"],
179
+ translations["subtitle"],
180
+ translations["input_type_label"],
181
+ gr.update(choices=translations["input_type_options"], value=translations["input_type_options"][0]),
182
+ translations["upload_pdf_label"],
183
+ translations["paste_text_label"],
184
+ translations["paste_text_placeholder"],
185
+ translations["guiding_prompt_label"],
186
+ translations["guiding_prompt_placeholder"],
187
+ translations["guiding_prompt_info"],
188
+ translations["duration_label"],
189
+ translations["examples_label"],
190
+ translations["thinking_model_label"],
191
+ translations["submit_button"],
192
+ translations["output_label"]
193
+ ]
194
+
195
+ def launch(self):
196
+ """Launch the Gradio interface"""
197
+ # Get the path to the example PDF
198
+ example_pdf = os.path.join(os.path.dirname(os.path.dirname(__file__)), "data", "sample2.pdf")
199
+
200
+ with gr.Blocks(title=TRANSLATIONS["en"]["title"]) as interface:
201
+ # Header with title and language selector side by side
202
+ with gr.Row():
203
+ with gr.Column(scale=4):
204
+ title_md = gr.Markdown("# " + TRANSLATIONS["en"]["title"])
205
+ with gr.Column(scale=1):
206
+ language_selector = gr.Dropdown(
207
+ choices=["🇺🇸 English", "🇪🇸 Español"],
208
+ value="🇺🇸 English",
209
+ label=TRANSLATIONS["en"]["language_selector"],
210
+ elem_id="language-selector",
211
+ interactive=True
212
+ )
213
+
214
+ # Subtitle
215
+ subtitle_md = gr.Markdown(TRANSLATIONS["en"]["subtitle"])
216
+
217
+ # Input type row
218
+ with gr.Row():
219
+ input_type = gr.Radio(
220
+ choices=TRANSLATIONS["en"]["input_type_options"],
221
+ label=TRANSLATIONS["en"]["input_type_label"],
222
+ value=TRANSLATIONS["en"]["input_type_options"][0]
223
+ )
224
+
225
+ # File/text input columns
226
+ with gr.Row():
227
+ with gr.Column(visible=True) as pdf_column:
228
+ file_input = gr.File(
229
+ label=TRANSLATIONS["en"]["upload_pdf_label"],
230
+ file_types=[".pdf"]
231
+ )
232
+
233
+ with gr.Column(visible=False) as text_column:
234
+ text_input = gr.Textbox(
235
+ label=TRANSLATIONS["en"]["paste_text_label"],
236
+ lines=10,
237
+ placeholder=TRANSLATIONS["en"]["paste_text_placeholder"]
238
+ )
239
+
240
+ # Guiding prompt
241
+ with gr.Row():
242
+ initial_prompt = gr.Textbox(
243
+ label=TRANSLATIONS["en"]["guiding_prompt_label"],
244
+ lines=3,
245
+ value="",
246
+ placeholder=TRANSLATIONS["en"]["guiding_prompt_placeholder"],
247
+ info=TRANSLATIONS["en"]["guiding_prompt_info"]
248
+ )
249
+
250
+ # Settings row
251
+ with gr.Row():
252
+ target_duration = gr.Number(
253
+ label=TRANSLATIONS["en"]["duration_label"],
254
+ value=30,
255
+ minimum=2,
256
+ maximum=60,
257
+ step=1
258
+ )
259
+
260
+ include_examples = gr.Checkbox(
261
+ label=TRANSLATIONS["en"]["examples_label"],
262
+ value=True
263
+ )
264
+
265
+ use_thinking_model = gr.Checkbox(
266
+ label=TRANSLATIONS["en"]["thinking_model_label"],
267
+ value=True
268
+ )
269
+
270
+ # Submit button
271
+ with gr.Row():
272
+ submit_btn = gr.Button(TRANSLATIONS["en"]["submit_button"])
273
+
274
+ # Output area
275
+ output = gr.Textbox(
276
+ label=TRANSLATIONS["en"]["output_label"],
277
+ lines=25
278
+ )
279
+
280
+ # Toggle timestamps button and Copy button
281
+ with gr.Row():
282
+ timestamps_checkbox = gr.Checkbox(
283
+ label=TRANSLATIONS["en"]["show_timestamps"],
284
+ value=True,
285
+ interactive=True
286
+ )
287
+
288
+ # Map language dropdown values to language codes
289
+ lang_map = {
290
+ "🇺🇸 English": "en",
291
+ "🇪🇸 Español": "es"
292
+ }
293
+
294
+ # Handle visibility of input columns based on selection
295
+ def update_input_visibility(language_display, choice):
296
+ language = lang_map.get(language_display, "en")
297
+ return [
298
+ gr.update(visible=(choice == TRANSLATIONS[language]["input_type_options"][0])), # pdf_column
299
+ gr.update(visible=(choice == TRANSLATIONS[language]["input_type_options"][1])) # text_column
300
+ ]
301
+
302
+ # Get language code from display value
303
+ def get_language_code(language_display):
304
+ return lang_map.get(language_display, "en")
305
+
306
+ # Update UI elements when language changes
307
+ def update_ui_with_display(language_display):
308
+ language = get_language_code(language_display)
309
+ self.current_language = language
310
+
311
+ translations = TRANSLATIONS[language]
312
+
313
+ return [
314
+ "# " + translations["title"], # Title with markdown formatting
315
+ translations["subtitle"],
316
+ translations["input_type_label"],
317
+ gr.update(choices=translations["input_type_options"], value=translations["input_type_options"][0], label=translations["input_type_label"]),
318
+ gr.update(label=translations["upload_pdf_label"]),
319
+ gr.update(label=translations["paste_text_label"], placeholder=translations["paste_text_placeholder"]),
320
+ gr.update(label=translations["guiding_prompt_label"], placeholder=translations["guiding_prompt_placeholder"], info=translations["guiding_prompt_info"]),
321
+ gr.update(label=translations["duration_label"]),
322
+ gr.update(label=translations["examples_label"]),
323
+ gr.update(label=translations["thinking_model_label"]),
324
+ translations["submit_button"],
325
+ gr.update(label=translations["output_label"]),
326
+ gr.update(label=translations["show_timestamps"])
327
+ ]
328
+
329
+ input_type.change(
330
+ fn=lambda lang_display, choice: update_input_visibility(lang_display, choice),
331
+ inputs=[language_selector, input_type],
332
+ outputs=[pdf_column, text_column]
333
+ )
334
+
335
+ # Language change event
336
+ language_selector.change(
337
+ fn=update_ui_with_display,
338
+ inputs=language_selector,
339
+ outputs=[
340
+ title_md, subtitle_md,
341
+ input_type, input_type,
342
+ file_input, text_input,
343
+ initial_prompt,
344
+ target_duration, include_examples, use_thinking_model,
345
+ submit_btn, output,
346
+ timestamps_checkbox
347
+ ]
348
+ )
349
+
350
+ # Toggle timestamps event
351
+ timestamps_checkbox.change(
352
+ fn=self.toggle_timestamps,
353
+ inputs=[timestamps_checkbox],
354
+ outputs=[output]
355
+ )
356
+
357
+ # Set up submission logic with language code conversion
358
+ submit_btn.click(
359
+ fn=lambda lang_display, *args: self.process_transcript(get_language_code(lang_display), *args),
360
+ inputs=[
361
+ language_selector,
362
+ input_type,
363
+ file_input,
364
+ text_input,
365
+ initial_prompt,
366
+ target_duration,
367
+ include_examples,
368
+ use_thinking_model
369
+ ],
370
+ outputs=output
371
+ )
372
+
373
+ # Example for PDF input
374
+ gr.Examples(
375
+ examples=[[example_pdf, "", "", 30, True, True]],
376
+ inputs=[file_input, text_input, initial_prompt, target_duration, include_examples, use_thinking_model]
377
+ )
378
+
379
+ interface.launch(share=True)
380
+
381
+ if __name__ == "__main__":
382
+ app = TranscriptTransformerApp()
383
+ app.launch()
src/core/__init__.py ADDED
File without changes
src/core/transformer.py ADDED
@@ -0,0 +1,698 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import logging
3
+ import json
4
+ import time
5
+ from typing import List, Dict, Optional, Callable, Any
6
+ import openai
7
+ from src.utils.text_processor import TextProcessor
8
+
9
+ # Configure logging
10
+ logging.basicConfig(level=logging.INFO)
11
+ logger = logging.getLogger(__name__)
12
+
13
+ class WordCountError(Exception):
14
+ """Raised when word count requirements are not met"""
15
+ pass
16
+
17
+ class TranscriptTransformer:
18
+ """Transforms conversational transcripts into teaching material using LLM"""
19
+
20
+ MAX_RETRIES = 3 # Initial retries for content generation
21
+ EXTENDED_RETRIES = 3 # Additional retries with longer waits
22
+ EXTENDED_RETRY_DELAYS = [5, 10, 15] # Wait times in seconds for extended retries
23
+ CHUNK_SIZE = 6000 # Target words per chunk
24
+ LARGE_DEVIATION_THRESHOLD = 0.20 # 20% maximum deviation
25
+ MAX_TOKENS = 64000 # Nuevo límite absoluto basado en 64k tokens de salida
26
+
27
+ def __init__(self, use_gemini: bool = True, use_thinking_model: bool = False):
28
+ """Initialize the transformer with selected LLM client"""
29
+ self.text_processor = TextProcessor()
30
+ self.use_gemini = use_gemini
31
+ self.use_thinking_model = use_thinking_model
32
+
33
+ if use_thinking_model:
34
+ if not use_gemini:
35
+ raise ValueError("Thinking model requires use_gemini=True")
36
+
37
+ logger.info("Initializing with Gemini Flash Thinking API")
38
+ self.openai_client = openai.OpenAI(
39
+ api_key=os.getenv('GEMINI_API_KEY'),
40
+ base_url="https://generativelanguage.googleapis.com/v1alpha"
41
+ )
42
+ self.model_name = "gemini-2.0-flash-thinking-exp-01-21"
43
+ elif use_gemini:
44
+ logger.info("Initializing with Gemini API")
45
+ self.openai_client = openai.OpenAI(
46
+ api_key=os.getenv('GEMINI_API_KEY'),
47
+ base_url="https://generativelanguage.googleapis.com/v1beta"
48
+ )
49
+ self.model_name = "gemini-2.0-flash-exp"
50
+ else:
51
+ logger.info("Initializing with OpenAI API")
52
+ self.openai_client = openai.OpenAI(
53
+ api_key=os.getenv('OPENAI_API_KEY')
54
+ )
55
+ self.model_name = "gpt-3.5-turbo"
56
+
57
+ # Target word counts
58
+ self.words_per_minute = 130 # Average speaking rate
59
+
60
+ def _api_call_with_enhanced_retries(self, call_func: Callable[[], Any]) -> Any:
61
+ """
62
+ Wrapper function for API calls with enhanced retry logic
63
+
64
+ Args:
65
+ call_func: Function that makes the actual API call
66
+
67
+ Returns:
68
+ The result of the successful API call
69
+
70
+ Raises:
71
+ Exception: If all retries fail
72
+ """
73
+ # Initial retries (already handled by openai client)
74
+ try:
75
+ return call_func()
76
+ except Exception as e:
77
+ error_str = str(e)
78
+
79
+ # Check if it's a quota error (429)
80
+ if "429" in error_str or "Too Many Requests" in error_str or "RESOURCE_EXHAUSTED" in error_str:
81
+ logger.warning(f"Quota error detected: {error_str}")
82
+ logger.info(f"Starting extended retries with longer waits...")
83
+
84
+ # Extended retries with longer waits
85
+ for i in range(self.EXTENDED_RETRIES):
86
+ wait_time = self.EXTENDED_RETRY_DELAYS[i]
87
+ logger.info(f"Extended retry {i+1}/{self.EXTENDED_RETRIES}: Waiting {wait_time} seconds before retry")
88
+ time.sleep(wait_time)
89
+
90
+ try:
91
+ return call_func()
92
+ except Exception as retry_error:
93
+ # If last retry, re-raise
94
+ if i == self.EXTENDED_RETRIES - 1:
95
+ logger.error(f"All extended retries failed: {str(retry_error)}")
96
+ raise
97
+ # Otherwise log and continue to next retry
98
+ logger.warning(f"Extended retry {i+1} failed: {str(retry_error)}")
99
+ else:
100
+ # Not a quota error, re-raise
101
+ raise
102
+
103
+ def _validate_word_count(self, total_words: int, target_words: int, min_words: int, max_words: int) -> None:
104
+ """Validate word count with flexible thresholds and log warnings/errors"""
105
+ deviation = abs(total_words - target_words) / target_words
106
+
107
+ if deviation > self.LARGE_DEVIATION_THRESHOLD:
108
+ logger.error(
109
+ f"Word count {total_words} significantly outside target range "
110
+ f"({min_words}-{max_words}). Deviation: {deviation:.2%}"
111
+ )
112
+ elif total_words < min_words or total_words > max_words:
113
+ logger.warning(
114
+ f"Word count {total_words} slightly outside target range "
115
+ f"({min_words}-{max_words}). Deviation: {deviation:.2%}"
116
+ )
117
+
118
+ def transform_to_lecture(self,
119
+ text: str,
120
+ target_duration: int = 30,
121
+ include_examples: bool = True,
122
+ initial_prompt: Optional[str] = None) -> str:
123
+ """
124
+ Transform input text into a structured teaching transcript
125
+
126
+ Args:
127
+ text: Input transcript text
128
+ target_duration: Target lecture duration in minutes
129
+ include_examples: Whether to include practical examples
130
+ initial_prompt: Additional user instructions to guide the generation
131
+
132
+ Returns:
133
+ str: Generated teaching transcript, regardless of word count validation
134
+ """
135
+ logger.info(f"Starting transformation for {target_duration} minute lecture")
136
+
137
+ # Clean and preprocess text
138
+ cleaned_text = self.text_processor.clean_text(text)
139
+ input_words = self.text_processor.count_words(cleaned_text)
140
+ logger.info(f"Input text cleaned. Word count: {input_words}")
141
+
142
+ # Calculate target word count
143
+ target_words = self.words_per_minute * target_duration
144
+ min_words = int(target_words * 0.95) # Minimum 95% of target
145
+ max_words = int(target_words * 1.05) # Maximum 105% of target
146
+
147
+ logger.info(f"Target word count: {target_words} (min: {min_words}, max: {max_words})")
148
+
149
+ # Generate detailed lecture structure with topics
150
+ structure_data = self._generate_detailed_structure(
151
+ text=cleaned_text,
152
+ target_duration=target_duration,
153
+ initial_prompt=initial_prompt
154
+ )
155
+ logger.info("Detailed lecture structure generated")
156
+ logger.info(f"Topics identified: {[t['title'] for t in structure_data['topics']]}")
157
+
158
+ # Calculate section word counts
159
+ section_words = {
160
+ 'intro': int(target_words * 0.1),
161
+ 'main': int(target_words * 0.7),
162
+ 'practical': int(target_words * 0.15),
163
+ 'summary': int(target_words * 0.05)
164
+ }
165
+
166
+ try:
167
+ logger.info("Generating content by sections with topic tracking")
168
+
169
+ # Introduction with learning objectives and topic preview
170
+ intro = self._generate_section(
171
+ 'introduction',
172
+ structure_data,
173
+ cleaned_text,
174
+ section_words['intro'],
175
+ include_examples,
176
+ is_first=True,
177
+ initial_prompt=initial_prompt
178
+ )
179
+ intro_words = self.text_processor.count_words(intro)
180
+ logger.info(f"Introduction generated: {intro_words} words")
181
+
182
+ # Track context for coherence
183
+ context = {
184
+ 'current_section': 'introduction',
185
+ 'covered_topics': [],
186
+ 'pending_topics': [t['title'] for t in structure_data['topics']],
187
+ 'key_terms': set(),
188
+ 'current_narrative': intro[-1000:], # Last 1000 words for context
189
+ 'learning_objectives': structure_data['learning_objectives']
190
+ }
191
+
192
+ # Main content with topic progression
193
+ main_content = self._generate_main_content(
194
+ structure_data,
195
+ cleaned_text,
196
+ section_words['main'],
197
+ include_examples,
198
+ context,
199
+ initial_prompt=initial_prompt
200
+ )
201
+ main_words = self.text_processor.count_words(main_content)
202
+ logger.info(f"Main content generated: {main_words} words")
203
+
204
+ # Update context after main content
205
+ context['current_section'] = 'main'
206
+ context['current_narrative'] = main_content[-1000:]
207
+
208
+ # Practical applications tied to main topics
209
+ practical = self._generate_section(
210
+ 'practical',
211
+ structure_data,
212
+ cleaned_text,
213
+ section_words['practical'],
214
+ include_examples,
215
+ context=context,
216
+ initial_prompt=initial_prompt
217
+ )
218
+ practical_words = self.text_processor.count_words(practical)
219
+ logger.info(f"Practical section generated: {practical_words} words")
220
+
221
+ # Update context for summary
222
+ context['current_section'] = 'practical'
223
+ context['current_narrative'] = practical[-500:]
224
+
225
+ # Summary with topic reinforcement
226
+ summary = self._generate_section(
227
+ 'summary',
228
+ structure_data,
229
+ cleaned_text,
230
+ section_words['summary'],
231
+ include_examples,
232
+ is_last=True,
233
+ context=context,
234
+ initial_prompt=initial_prompt
235
+ )
236
+ summary_words = self.text_processor.count_words(summary)
237
+ logger.info(f"Summary generated: {summary_words} words")
238
+
239
+ # Combine all sections
240
+ full_content = f"{intro}\n\n{main_content}\n\n{practical}\n\n{summary}"
241
+ total_words = self.text_processor.count_words(full_content)
242
+ logger.info(f"Total content generated: {total_words} words")
243
+
244
+ # Log warnings/errors but don't raise exceptions
245
+ self._validate_word_count(total_words, target_words, min_words, max_words)
246
+
247
+ # Validate coherence
248
+ self._validate_coherence(full_content, structure_data)
249
+ logger.info("Content coherence validated")
250
+
251
+ return full_content
252
+
253
+ except Exception as e:
254
+ logger.error(f"Error during content generation: {str(e)}")
255
+ # If we have partial content, return it
256
+ if 'full_content' in locals():
257
+ logger.warning("Returning partial content despite errors")
258
+ return full_content
259
+ raise # Re-raise only if we have no content at all
260
+
261
+ def _generate_detailed_structure(self,
262
+ text: str,
263
+ target_duration: int,
264
+ initial_prompt: Optional[str] = None) -> Dict:
265
+ """Generate detailed lecture structure with topics and objectives"""
266
+ logger.info("Generating detailed lecture structure")
267
+
268
+ user_instructions = f"\nAdditional user instructions:\n{initial_prompt}\n" if initial_prompt else ""
269
+
270
+ prompt = f"""
271
+ You are an expert educator creating a detailed lecture outline.
272
+ {user_instructions}
273
+ Analyze this transcript and create a structured JSON output with the following:
274
+
275
+ 1. Title of the lecture
276
+ 2. 3-5 clear learning objectives
277
+ 3. 3-4 main topics, each with:
278
+ - Title
279
+ - Key concepts
280
+ - Subtopics
281
+ - Time allocation (in minutes)
282
+ - Connection to learning objectives
283
+ 4. Practical application ideas
284
+ 5. Key terms to track
285
+
286
+ IMPORTANT: Response MUST be valid JSON. Format exactly like this, with no additional text:
287
+ {{
288
+ "title": "string",
289
+ "learning_objectives": ["string"],
290
+ "topics": [
291
+ {{
292
+ "title": "string",
293
+ "key_concepts": ["string"],
294
+ "subtopics": ["string"],
295
+ "duration_minutes": number,
296
+ "objective_links": [number]
297
+ }}
298
+ ],
299
+ "practical_applications": ["string"],
300
+ "key_terms": ["string"]
301
+ }}
302
+
303
+ Target duration: {target_duration} minutes
304
+
305
+ Transcript excerpt:
306
+ {text[:2000]}
307
+ """
308
+
309
+ try:
310
+ # Common parameters
311
+ params = {
312
+ "model": self.model_name,
313
+ "messages": [
314
+ {"role": "system", "content": "You are an expert educator. Output ONLY valid JSON, no other text."},
315
+ {"role": "user", "content": prompt}
316
+ ],
317
+ "temperature": 0.7,
318
+ "max_tokens": self.MAX_TOKENS if self.use_thinking_model else 4000
319
+ }
320
+
321
+ # Add thinking config if using experimental model
322
+ if self.use_thinking_model:
323
+ params["extra_body"] = {
324
+ "thinking_config": {
325
+ "include_thoughts": True
326
+ }
327
+ }
328
+
329
+ # Use the enhanced retry wrapper for API call
330
+ def api_call():
331
+ return self.openai_client.chat.completions.create(**params)
332
+
333
+ response = self._api_call_with_enhanced_retries(api_call)
334
+ content = response.choices[0].message.content.strip()
335
+ logger.debug(f"Raw structure response: {content}")
336
+
337
+ try:
338
+ structure_data = json.loads(content)
339
+ logger.info("Structure data parsed successfully")
340
+ return structure_data
341
+ except json.JSONDecodeError as e:
342
+ logger.warning(f"Failed to parse JSON directly: {str(e)}")
343
+
344
+ # Try to extract JSON if it's wrapped in other text
345
+ import re
346
+ json_match = re.search(r'({[\s\S]*})', content)
347
+ if json_match:
348
+ try:
349
+ structure_data = json.loads(json_match.group(1))
350
+ logger.info("Structure data extracted and parsed successfully")
351
+ return structure_data
352
+ except json.JSONDecodeError:
353
+ logger.warning("Failed to parse extracted JSON")
354
+
355
+ # If both attempts fail, use fallback structure
356
+ logger.warning("Using fallback structure")
357
+ return self._generate_fallback_structure(text, target_duration)
358
+
359
+ except Exception as e:
360
+ logger.error(f"Error generating structure: {str(e)}")
361
+ # Fallback in case of any error
362
+ return self._generate_fallback_structure(text, target_duration)
363
+
364
+ def _generate_fallback_structure(self, text: str, target_duration: int) -> Dict:
365
+ """Generate a simplified fallback structure in case of parsing failures"""
366
+ logger.info("Generating fallback structure")
367
+
368
+ params = {
369
+ "model": self.model_name,
370
+ "messages": [
371
+ {"role": "system", "content": "You are an expert educator. Output ONLY valid JSON, no other text."},
372
+ {"role": "user", "content": f"""
373
+ Create a simplified lecture outline based on this transcript.
374
+ Format as JSON with:
375
+ - title
376
+ - 3 learning objectives
377
+ - 2 main topics with title, key concepts, subtopics
378
+ - 2 practical applications
379
+ - 3 key terms
380
+
381
+ Target duration: {target_duration} minutes
382
+
383
+ Transcript excerpt:
384
+ {text[:2000]}
385
+ """}
386
+ ],
387
+ "temperature": 0.5,
388
+ "max_tokens": 2000
389
+ }
390
+
391
+ try:
392
+ # Use the enhanced retry wrapper for API call
393
+ def api_call():
394
+ return self.openai_client.chat.completions.create(**params)
395
+
396
+ response = self._api_call_with_enhanced_retries(api_call)
397
+ content = response.choices[0].message.content.strip()
398
+
399
+ try:
400
+ return json.loads(content)
401
+ except json.JSONDecodeError:
402
+ # Last resort fallback if everything fails
403
+ return {
404
+ "title": "Lecture on Transcript Topic",
405
+ "learning_objectives": ["Understand key concepts", "Apply knowledge", "Evaluate outcomes"],
406
+ "topics": [
407
+ {
408
+ "title": "Main Topic 1",
409
+ "key_concepts": ["Concept 1", "Concept 2"],
410
+ "subtopics": ["Subtopic 1", "Subtopic 2"],
411
+ "duration_minutes": target_duration // 2,
412
+ "objective_links": [1, 2]
413
+ },
414
+ {
415
+ "title": "Main Topic 2",
416
+ "key_concepts": ["Concept 3", "Concept 4"],
417
+ "subtopics": ["Subtopic 3", "Subtopic 4"],
418
+ "duration_minutes": target_duration // 2,
419
+ "objective_links": [2, 3]
420
+ }
421
+ ],
422
+ "practical_applications": ["Application 1", "Application 2"],
423
+ "key_terms": ["Term 1", "Term 2", "Term 3"]
424
+ }
425
+ except Exception as e:
426
+ logger.error(f"Error generating fallback structure: {str(e)}")
427
+ # Hardcoded last resort fallback
428
+ return {
429
+ "title": "Lecture on Transcript Topic",
430
+ "learning_objectives": ["Understand key concepts", "Apply knowledge", "Evaluate outcomes"],
431
+ "topics": [
432
+ {
433
+ "title": "Main Topic 1",
434
+ "key_concepts": ["Concept 1", "Concept 2"],
435
+ "subtopics": ["Subtopic 1", "Subtopic 2"],
436
+ "duration_minutes": target_duration // 2,
437
+ "objective_links": [1, 2]
438
+ },
439
+ {
440
+ "title": "Main Topic 2",
441
+ "key_concepts": ["Concept 3", "Concept 4"],
442
+ "subtopics": ["Subtopic 3", "Subtopic 4"],
443
+ "duration_minutes": target_duration // 2,
444
+ "objective_links": [2, 3]
445
+ }
446
+ ],
447
+ "practical_applications": ["Application 1", "Application 2"],
448
+ "key_terms": ["Term 1", "Term 2", "Term 3"]
449
+ }
450
+
451
+ def _generate_section(self,
452
+ section_type: str,
453
+ structure_data: Dict,
454
+ original_text: str,
455
+ target_words: int,
456
+ include_examples: bool,
457
+ context: Dict = None,
458
+ is_first: bool = False,
459
+ is_last: bool = False,
460
+ initial_prompt: Optional[str] = None) -> str:
461
+ """Generate a specific section of the lecture"""
462
+ logger.info(f"Generating {section_type} section (target: {target_words} words)")
463
+
464
+ # Calculate timing markers
465
+ if section_type == 'introduction':
466
+ time_marker = '[00:00]'
467
+ elif section_type == 'summary':
468
+ duration_mins = sum(topic.get('duration_minutes', 5) for topic in structure_data['topics'])
469
+ # Asegurar que duration_mins es un entero y nunca menor a 5
470
+ adjusted_mins = max(5, int(duration_mins - 5))
471
+ time_marker = f'[{adjusted_mins:02d}:00]'
472
+ else:
473
+ # For other sections, use appropriate time markers
474
+ time_marker = '[XX:XX]' # Will be replaced within the prompt
475
+
476
+ user_instructions = f"\nAdditional user instructions:\n{initial_prompt}\n" if initial_prompt else ""
477
+
478
+ # Base prompt with context-specific formatting
479
+ prompt = f"""
480
+ You are creating a {section_type} section for a {time_marker} teaching lecture on "{structure_data['title']}".
481
+ {user_instructions}
482
+ Target word count: {target_words} words (very important)
483
+
484
+ Learning objectives:
485
+ {', '.join(structure_data['learning_objectives'])}
486
+
487
+ Key terms:
488
+ {', '.join(structure_data['key_terms'])}
489
+
490
+ Original source:
491
+ {original_text[:500]}...
492
+ """
493
+
494
+ # Section-specific instructions
495
+ if section_type == 'introduction':
496
+ prompt += """
497
+ - Start with an engaging hook
498
+ - Present clear learning objectives
499
+ - Preview main topics
500
+ - Set expectations for the lecture
501
+ """
502
+ elif section_type == 'main':
503
+ prompt += f"""
504
+ Discuss one main topic in depth.
505
+
506
+ Topic: {context['current_topic']['title']}
507
+ Key concepts: {', '.join(context['current_topic']['key_concepts'])}
508
+ Subtopics: {', '.join(context['current_topic']['subtopics'])}
509
+
510
+ - Start with appropriate time marker
511
+ - Explain key concepts clearly
512
+ - Include real-world examples
513
+ - Connect to learning objectives
514
+ - Use appropriate time markers within the section
515
+ """
516
+ elif section_type == 'practical':
517
+ prompt += f"""
518
+ Create a practical applications section with:
519
+
520
+ - Start with appropriate time marker
521
+ - 2-3 practical examples or case studies
522
+ - Clear connections to the main topics
523
+ - Interactive elements (questions, exercises)
524
+
525
+ Practical applications to cover:
526
+ {', '.join(structure_data['practical_applications'])}
527
+ """
528
+ elif section_type == 'summary':
529
+ prompt += """
530
+ Create a concise summary:
531
+
532
+ - Start with appropriate time marker
533
+ - Reinforce key learning points
534
+ - Brief recap of main topics
535
+ - Call to action or follow-up suggestions
536
+ """
537
+
538
+ # Context-specific content
539
+ if context:
540
+ prompt += f"""
541
+
542
+ Previously covered topics:
543
+ {', '.join(context['covered_topics'])}
544
+
545
+ Pending topics:
546
+ {', '.join(context['pending_topics'])}
547
+
548
+ Recent narrative context:
549
+ {context['current_narrative']}
550
+ """
551
+
552
+ # First/last section specific instructions
553
+ if is_first:
554
+ prompt += """
555
+
556
+ This is the FIRST section of the lecture. Make it engaging and set the tone.
557
+ """
558
+ elif is_last:
559
+ prompt += """
560
+
561
+ This is the FINAL section of the lecture. Ensure proper closure and reinforcement.
562
+ """
563
+
564
+ # Add section-specific time markers for formatted output
565
+ if section_type != 'introduction':
566
+ prompt += """
567
+
568
+ IMPORTANT: Include appropriate time markers [MM:SS] throughout the section.
569
+ """
570
+
571
+ try:
572
+ # Prepare API call parameters
573
+ params = {
574
+ "model": self.model_name,
575
+ "messages": [
576
+ {"role": "system", "content": "You are an expert educator creating a teaching script."},
577
+ {"role": "user", "content": prompt}
578
+ ],
579
+ "temperature": 0.7,
580
+ "max_tokens": self._calculate_max_tokens(section_type, target_words)
581
+ }
582
+
583
+ # Add thinking config if using experimental model
584
+ if self.use_thinking_model:
585
+ params["extra_body"] = {
586
+ "thinking_config": {
587
+ "include_thoughts": True
588
+ }
589
+ }
590
+
591
+ # Use the enhanced retry wrapper for API call
592
+ def api_call():
593
+ return self.openai_client.chat.completions.create(**params)
594
+
595
+ response = self._api_call_with_enhanced_retries(api_call)
596
+ content = response.choices[0].message.content.strip()
597
+
598
+ # Validate output length
599
+ content_words = self.text_processor.count_words(content)
600
+ logger.info(f"Section generated: {content_words} words")
601
+
602
+ return content
603
+
604
+ except Exception as e:
605
+ logger.error(f"Error during content generation: {str(e)}")
606
+ # Provide a minimal fallback content to avoid complete failure
607
+ return f"{time_marker} {section_type.capitalize()} (Error during generation)\n\nWe apologize, but there was an error generating this section."
608
+
609
+ def _calculate_max_tokens(self, section_type: str, target_words: int) -> int:
610
+ """Calculate appropriate max_tokens based on section and model"""
611
+ # 1 token ≈ 4 caracteres (1 palabra ≈ 1.33 tokens)
612
+ base_tokens = int(target_words * 1.5) # Margen para formato
613
+
614
+ if self.use_thinking_model:
615
+ # Permite hasta 64k tokens pero limita por sección
616
+ section_limits = {
617
+ 'introduction': 8000,
618
+ 'main': 32000,
619
+ 'practical': 16000,
620
+ 'summary': 8000
621
+ }
622
+ return min(base_tokens * 2, section_limits.get(section_type, 16000))
623
+
624
+ # Límites para otros modelos
625
+ return min(base_tokens + 1000, self.MAX_TOKENS)
626
+
627
+ def _generate_main_content(self,
628
+ structure_data: Dict,
629
+ original_text: str,
630
+ target_words: int,
631
+ include_examples: bool,
632
+ context: Dict,
633
+ initial_prompt: Optional[str] = None) -> str:
634
+ """Generate main content with topic progression"""
635
+ logger.info(f"Generating main content (target: {target_words} words)")
636
+
637
+ # Calculate words per topic based on their duration ratios
638
+ total_duration = sum(t['duration_minutes'] for t in structure_data['topics'])
639
+ # Avoid division by zero
640
+ total_duration = total_duration if total_duration > 0 else 1
641
+
642
+ topic_words = {}
643
+
644
+ for topic in structure_data['topics']:
645
+ ratio = topic['duration_minutes'] / total_duration
646
+ topic_words[topic['title']] = int(target_words * ratio)
647
+
648
+ logger.info(f"Topic word allocations: {topic_words}")
649
+
650
+ # Generate content for each topic
651
+ topic_contents = []
652
+
653
+ for topic in structure_data['topics']:
654
+ topic_target = topic_words[topic['title']]
655
+
656
+ # Update context for topic
657
+ context['current_topic'] = topic
658
+ if topic['title'] in context['pending_topics']:
659
+ context['covered_topics'].append(topic['title'])
660
+ context['pending_topics'].remove(topic['title'])
661
+ context['key_terms'].update(topic['key_concepts'])
662
+
663
+ # Generate topic content
664
+ topic_content = self._generate_section(
665
+ f"main_topic_{topic['title']}",
666
+ structure_data,
667
+ original_text,
668
+ topic_target,
669
+ include_examples,
670
+ context=context,
671
+ initial_prompt=initial_prompt
672
+ )
673
+
674
+ topic_contents.append(topic_content)
675
+ context['current_narrative'] = topic_content[-1000:]
676
+
677
+ return "\n\n".join(topic_contents)
678
+
679
+ def _validate_coherence(self, content: str, structure_data: Dict):
680
+ """Validate content coherence against structure"""
681
+ logger.info("Validating content coherence")
682
+
683
+ # Check for learning objectives
684
+ for objective in structure_data['learning_objectives']:
685
+ if not any(term.lower() in content.lower() for term in objective.split()):
686
+ logger.warning(f"Learning objective not well covered: {objective}")
687
+
688
+ # Check for key terms
689
+ for term in structure_data['key_terms']:
690
+ if content.lower().count(term.lower()) < 2:
691
+ logger.warning(f"Key term underutilized: {term}")
692
+
693
+ # Check topic coverage
694
+ for topic in structure_data['topics']:
695
+ if not any(concept.lower() in content.lower() for concept in topic['key_concepts']):
696
+ logger.warning(f"Topic concepts not well covered: {topic['title']}")
697
+
698
+ logger.info("Coherence validation complete")
src/utils/__init__.py ADDED
File without changes
src/utils/pdf_processor.py ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import PyPDF2
2
+ from typing import Optional
3
+
4
+ class PDFProcessor:
5
+ """Handles PDF file processing and text extraction"""
6
+
7
+ def __init__(self):
8
+ """Initialize PDF processor"""
9
+ pass
10
+
11
+ def extract_text(self, pdf_path: str) -> str:
12
+ """
13
+ Extract text content from a PDF file
14
+
15
+ Args:
16
+ pdf_path: Path to the PDF file
17
+
18
+ Returns:
19
+ str: Extracted text content
20
+
21
+ Raises:
22
+ FileNotFoundError: If PDF file doesn't exist
23
+ PyPDF2.PdfReadError: If PDF file is invalid or corrupted
24
+ """
25
+ try:
26
+ with open(pdf_path, 'rb') as file:
27
+ # Create PDF reader object
28
+ reader = PyPDF2.PdfReader(file)
29
+
30
+ # Extract text from all pages
31
+ text = ""
32
+ for page in reader.pages:
33
+ text += page.extract_text() + "\n"
34
+
35
+ return text.strip()
36
+
37
+ except FileNotFoundError:
38
+ raise FileNotFoundError(f"PDF file not found: {pdf_path}")
39
+ except PyPDF2.PdfReadError as e:
40
+ raise PyPDF2.PdfReadError(f"Error reading PDF file: {str(e)}")
41
+ except Exception as e:
42
+ raise Exception(f"Unexpected error processing PDF: {str(e)}")
43
+
44
+ def get_metadata(self, pdf_path: str) -> dict:
45
+ """
46
+ Extract metadata from PDF file
47
+
48
+ Args:
49
+ pdf_path: Path to the PDF file
50
+
51
+ Returns:
52
+ dict: PDF metadata
53
+ """
54
+ try:
55
+ with open(pdf_path, 'rb') as file:
56
+ reader = PyPDF2.PdfReader(file)
57
+ return reader.metadata
58
+ except Exception as e:
59
+ return {"error": str(e)}
src/utils/text_processor.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from typing import List, Optional
3
+
4
+ class TextProcessor:
5
+ """Handles text preprocessing and cleaning"""
6
+
7
+ def __init__(self):
8
+ """Initialize text processor"""
9
+ self.sentence_endings = r'[.!?]'
10
+ self.word_pattern = r'\b\w+\b'
11
+
12
+ def clean_text(self, text: str) -> str:
13
+ """
14
+ Clean and normalize text
15
+
16
+ Args:
17
+ text: Input text to clean
18
+
19
+ Returns:
20
+ str: Cleaned text
21
+ """
22
+ # Remove extra whitespace
23
+ text = ' '.join(text.split())
24
+
25
+ # Fix common OCR errors
26
+ text = self._fix_ocr_errors(text)
27
+
28
+ # Normalize punctuation
29
+ text = self._normalize_punctuation(text)
30
+
31
+ return text.strip()
32
+
33
+ def split_into_sections(self, text: str) -> List[str]:
34
+ """
35
+ Split text into logical sections based on content
36
+
37
+ Args:
38
+ text: Input text to split
39
+
40
+ Returns:
41
+ List[str]: List of text sections
42
+ """
43
+ # Split on double newlines or section markers
44
+ sections = re.split(r'\n\s*\n|\n(?=[A-Z][^a-z]*:)', text)
45
+ return [s.strip() for s in sections if s.strip()]
46
+
47
+ def count_words(self, text: str) -> int:
48
+ """
49
+ Count words in text
50
+
51
+ Args:
52
+ text: Input text
53
+
54
+ Returns:
55
+ int: Word count
56
+ """
57
+ words = re.findall(self.word_pattern, text)
58
+ return len(words)
59
+
60
+ def _fix_ocr_errors(self, text: str) -> str:
61
+ """Fix common OCR errors"""
62
+ replacements = {
63
+ r'[|]': 'I', # Vertical bar to I
64
+ r'0': 'O', # Zero to O where appropriate
65
+ r'1': 'l', # One to l where appropriate
66
+ r'\s+': ' ' # Multiple spaces to single space
67
+ }
68
+
69
+ for pattern, replacement in replacements.items():
70
+ text = re.sub(pattern, replacement, text)
71
+ return text
72
+
73
+ def _normalize_punctuation(self, text: str) -> str:
74
+ """Normalize punctuation marks"""
75
+ # Replace multiple periods with single period
76
+ text = re.sub(r'\.{2,}', '.', text)
77
+
78
+ # Add space after punctuation if missing
79
+ text = re.sub(r'([.!?])([A-Z])', r'\1 \2', text)
80
+
81
+ # Fix spacing around punctuation
82
+ text = re.sub(r'\s+([.!?,])', r'\1', text)
83
+
84
+ return text