rogeliorichman commited on
Commit
b2b4dfa
·
verified ·
1 Parent(s): 2eea992

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ data/sample2.pdf filter=lfs diff=lfs merge=lfs -text
.github/workflows/update_space.yml ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Run Python script
2
+
3
+ on:
4
+ push:
5
+ branches:
6
+ - main
7
+
8
+ jobs:
9
+ build:
10
+ runs-on: ubuntu-latest
11
+
12
+ steps:
13
+ - name: Checkout
14
+ uses: actions/checkout@v2
15
+
16
+ - name: Set up Python
17
+ uses: actions/setup-python@v2
18
+ with:
19
+ python-version: '3.9'
20
+
21
+ - name: Install Gradio
22
+ run: python -m pip install gradio
23
+
24
+ - name: Log in to Hugging Face
25
+ run: python -c 'import huggingface_hub; huggingface_hub.login(token="${{ secrets.hf_token }}")'
26
+
27
+ - name: Deploy to Spaces
28
+ run: gradio deploy
.gitignore ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual Environment
24
+ venv/
25
+ ENV/
26
+ env/
27
+ .env
28
+
29
+ # IDE
30
+ .idea/
31
+ .vscode/
32
+ *.swp
33
+ *.swo
34
+ .DS_Store
35
+
36
+ # Testing
37
+ .coverage
38
+ htmlcov/
39
+ .tox/
40
+ .nox/
41
+ .pytest_cache/
42
+
43
+ # Logs
44
+ *.log
45
+ logs/
46
+
47
+ # Local development
48
+ .env.local
49
+ .env.development.local
50
+ .env.test.local
51
+ .env.production.local
52
+
53
+ # API Keys
54
+ .env
55
+ *.pem
56
+ *.key
57
+
58
+ # Gradio
59
+ .gradio/
60
+
61
+ # private file
62
+ /data/sample3.pdf
CONTRIBUTING.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contributing to AI LectureForge
2
+
3
+ First off, thank you for considering contributing to AI LectureForge! It's people like you that make AI LectureForge such a great tool.
4
+
5
+ ## Code of Conduct
6
+
7
+ By participating in this project, you are expected to uphold our Code of Conduct:
8
+
9
+ - Use welcoming and inclusive language
10
+ - Be respectful of differing viewpoints and experiences
11
+ - Gracefully accept constructive criticism
12
+ - Focus on what is best for the community
13
+ - Show empathy towards other community members
14
+
15
+ ## How Can I Contribute?
16
+
17
+ ### Reporting Bugs
18
+
19
+ Before creating bug reports, please check the issue list as you might find out that you don't need to create one. When you are creating a bug report, please include as many details as possible:
20
+
21
+ * Use a clear and descriptive title
22
+ * Describe the exact steps which reproduce the problem
23
+ * Provide specific examples to demonstrate the steps
24
+ * Describe the behavior you observed after following the steps
25
+ * Explain which behavior you expected to see instead and why
26
+ * Include screenshots if possible
27
+
28
+ ### Suggesting Enhancements
29
+
30
+ If you have a suggestion for the project, we'd love to hear it. Enhancement suggestions are tracked as GitHub issues. When creating an enhancement suggestion, please include:
31
+
32
+ * A clear and descriptive title
33
+ * A detailed description of the proposed enhancement
34
+ * Examples of how the enhancement would be used
35
+ * Any potential drawbacks or challenges
36
+
37
+ ### Pull Requests
38
+
39
+ 1. Fork the repo and create your branch from `main`
40
+ 2. If you've added code that should be tested, add tests
41
+ 3. If you've changed APIs, update the documentation
42
+ 4. Ensure the test suite passes
43
+ 5. Make sure your code follows the existing style
44
+ 6. Issue that pull request!
45
+
46
+ ## Development Process
47
+
48
+ 1. Create a new branch:
49
+ ```bash
50
+ git checkout -b feature/my-feature
51
+ # or
52
+ git checkout -b bugfix/my-bugfix
53
+ ```
54
+
55
+ 2. Make your changes and commit:
56
+ ```bash
57
+ git add .
58
+ git commit -m "Description of changes"
59
+ ```
60
+
61
+ 3. Push to your fork:
62
+ ```bash
63
+ git push origin feature/my-feature
64
+ ```
65
+
66
+ ### Style Guidelines
67
+
68
+ - Follow PEP 8 style guide for Python code
69
+ - Use descriptive variable names
70
+ - Comment your code when necessary
71
+ - Keep functions focused and modular
72
+ - Use type hints where possible
73
+
74
+ ### Testing
75
+
76
+ - Write unit tests for new features
77
+ - Ensure all tests pass before submitting PR
78
+ - Include both positive and negative test cases
79
+
80
+ ## Project Structure
81
+
82
+ ```
83
+ transcript_transformer/
84
+ ├── src/
85
+ │ ├── core/ # Core transformation logic
86
+ │ ├── utils/ # Utility functions
87
+ │ └── app.py # Main application
88
+ ├── tests/ # Test files
89
+ └── requirements.txt # Project dependencies
90
+ ```
91
+
92
+ ## Getting Help
93
+
94
+ If you need help, you can:
95
+ - Open an issue with your question
96
+ - Reach out to the maintainers
97
+ - Check the documentation
98
+
99
+ Thank you for contributing to AI LectureForge! 🎓✨
README.md CHANGED
@@ -1,12 +1,228 @@
1
  ---
2
- title: AI Script Generator
3
- emoji: 🐨
4
- colorFrom: yellow
5
- colorTo: blue
6
  sdk: gradio
7
- sdk_version: 5.18.0
8
- app_file: app.py
9
- pinned: false
10
  ---
 
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: AI_Script_Generator
3
+ app_file: src/app.py
 
 
4
  sdk: gradio
5
+ sdk_version: 5.13.1
 
 
6
  ---
7
+ # 🎓 AI Script Generator
8
 
9
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
10
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
11
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
12
+ [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](http://makeapullrequest.com)
13
+
14
+ > Transform transcripts and PDFs into timed, structured teaching scripts using AI
15
+
16
+ AI Script Generator is an advanced AI system that converts PDF transcripts, raw text, and conversational content into well-structured teaching scripts. It seamlessly processes inputs, extracting and analyzing the content to create organized, pedagogically sound scripts with time markers. Designed for educators, students, content creators, and anyone looking to transform information into clear explanations.
17
+
18
+ ## ✨ Features
19
+
20
+ - 🤖 PDF transcript and raw text processing
21
+ - 🤖 AI-powered content transformation
22
+ - 📚 Structured teaching script generation
23
+ - 🔄 Coherent topic organization
24
+ - 🔌 Support for multiple AI providers (Gemini/OpenAI)
25
+ - ⏱️ Time-marked sections for pacing
26
+
27
+ ## Output Format
28
+
29
+ The generated scripts follow a structured format:
30
+
31
+ ### Time Markers
32
+ - Each section includes time markers (e.g., `[11:45]`) to help pace delivery
33
+ - Customizable duration: From as short as 2 minutes to 60 minutes, with timing adjusted accordingly
34
+
35
+ ### Structure
36
+ - Introduction with learning objectives
37
+ - Time-marked content sections
38
+ - Examples and practical applications
39
+ - Interactive elements (questions, exercises)
40
+ - Recap and key takeaways
41
+
42
+ For example:
43
+ ```
44
+ [00:00] Introduction to Topic
45
+ - Learning objectives
46
+ - Key concepts overview
47
+
48
+ [11:45] Main Concept Explanation
49
+ - Detailed explanation
50
+ - Practical example
51
+ - Student interaction point
52
+
53
+ [23:30] Advanced Applications
54
+ ...
55
+ ```
56
+
57
+ ## 🚀 Quick Start
58
+
59
+ ### Prerequisites
60
+
61
+ - Python 3.8 or higher
62
+ - Virtual environment (recommended)
63
+ - Gemini API key (or OpenAI API key)
64
+
65
+ ### Installation
66
+
67
+ ```bash
68
+ # Clone the repository
69
+ git clone https://github.com/RogelioRichmanAstronaut/AI-Script-Generator.git
70
+ cd AI-Script-Generator
71
+
72
+ # Create and activate virtual environment
73
+ python -m venv venv
74
+ source venv/bin/activate # On Windows: .\venv\Scripts\activate
75
+
76
+ # Install dependencies
77
+ pip install -r requirements.txt
78
+
79
+ # Set up environment variables (choose one API key based on your preference)
80
+ export GEMINI_API_KEY='your-gemini-api-key' # Primary option
81
+ # OR
82
+ export OPENAI_API_KEY='your-openai-api-key' # Alternative option
83
+
84
+ # On Windows use:
85
+ # set GEMINI_API_KEY=your-gemini-api-key
86
+ # set OPENAI_API_KEY=your-openai-api-key
87
+ ```
88
+
89
+ ### Usage
90
+
91
+ ```bash
92
+ # Run with Python path set
93
+ PYTHONPATH=$PYTHONPATH:. python src/app.py
94
+
95
+ # Access the web interface
96
+ # Open http://localhost:7860 in your browser
97
+ ```
98
+
99
+ ## 🛠️ Technical Approach
100
+
101
+ ### Prompt Engineering Strategy
102
+
103
+ Our system uses a sophisticated multi-stage prompting approach:
104
+
105
+ 1. **Content Analysis & Chunking**
106
+ - Smart text segmentation for handling large documents (9000+ words)
107
+ - Contextual overlap between chunks to maintain coherence
108
+ - Key topic and concept extraction from each segment
109
+
110
+ 2. **Structure Generation**
111
+ - Time-based sectioning (customizable from 2-60 minutes)
112
+ - Educational flow design with clear progression
113
+ - Integration of pedagogical elements (examples, exercises, questions)
114
+
115
+ 3. **Educational Enhancement**
116
+ - Transformation of casual content into formal teaching script
117
+ - Addition of practical examples and case studies
118
+ - Integration of interaction points and reflection questions
119
+ - Time markers for pacing guidance
120
+
121
+ 4. **Coherence Validation**
122
+ - Cross-reference checking between sections
123
+ - Verification of topic flow and progression
124
+ - Consistency check for terminology and concepts
125
+ - Quality assessment of educational elements
126
+
127
+ ### Challenges & Solutions
128
+
129
+ 1. **Context Length Management**
130
+ - Challenge: Handling documents beyond model context limits
131
+ - Solution: Implemented sliding window chunking with overlap
132
+ - Result: Successfully processes documents up to 9000+ words with extensibility for more
133
+
134
+ 2. **Educational Structure**
135
+ - Challenge: Converting conversational text to teaching format
136
+ - Solution:
137
+ - Structured templating system for different time formats (2-60 min)
138
+ - Integration of pedagogical elements (examples, exercises)
139
+ - Time-based sectioning with clear progression
140
+ - Result: Coherent, time-marked teaching scripts with interactive elements
141
+
142
+ 3. **Content Coherence**
143
+ - Challenge: Maintaining narrative flow across chunked content
144
+ - Solution:
145
+ - Contextual overlap between chunks
146
+ - Topic tracking across sections
147
+ - Cross-reference validation system
148
+ - Result: Seamless content flow with consistent terminology
149
+
150
+ 4. **Educational Quality**
151
+ - Challenge: Ensuring high pedagogical value
152
+ - Solution:
153
+ - Integration of learning objectives
154
+ - Strategic placement of examples and exercises
155
+ - Addition of reflection questions
156
+ - Time-appropriate pacing markers
157
+ - Result: Engaging, structured learning materials
158
+
159
+ ### Core Components
160
+
161
+ 1. **PDF Processing**: Extracts and cleans text from PDF transcripts
162
+ 2. **Text Processing**: Handles direct text input and cleans/structures it
163
+ 3. **Content Analysis**: Uses AI to understand and structure the content
164
+ 4. **Script Generation**: Transforms content into educational format
165
+
166
+ ### Implementation Details
167
+
168
+ 1. **PDF/Text Handling**
169
+ - Robust PDF text extraction
170
+ - Raw text input processing
171
+ - Clean-up of extracted content
172
+
173
+ 2. **AI Processing**
174
+ - Integration with Gemini API (primary)
175
+ - OpenAI API support (alternative)
176
+ - Structured prompt system for consistent output
177
+
178
+ 3. **Output Generation**
179
+ - Organized teaching scripts
180
+ - Clear section structure
181
+ - Learning points and key concepts
182
+
183
+ ### Architecture
184
+
185
+ The system follows a modular design:
186
+
187
+ - 📄 PDF/text processing module
188
+ - 🔍 Text analysis component
189
+ - 🤖 AI integration layer
190
+ - 📝 Output formatting system
191
+
192
+ ## 🤝 Contributing
193
+
194
+ Contributions are what make the open source community amazing! Any contributions you make are **greatly appreciated**.
195
+
196
+ 1. Fork the Project
197
+ 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
198
+ 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
199
+ 4. Push to the Branch (`git push origin feature/AmazingFeature`)
200
+ 5. Open a Pull Request
201
+
202
+ See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines.
203
+
204
+ ## 📝 License
205
+
206
+ Distributed under the MIT License. See `LICENSE` for more information.
207
+
208
+ ## 🌟 Acknowledgments
209
+
210
+ - Thanks to all contributors who have helped shape AI Script Generator
211
+ - Special thanks to the Gemini and OpenAI teams for their amazing APIs
212
+ - Inspired by educators and communicators worldwide who make learning engaging
213
+
214
+ ## 📧 Contact
215
+
216
+ Project Link: [https://github.com/RogelioRichmanAstronaut/AI-Script-Generator](https://github.com/RogelioRichmanAstronaut/AI-Script-Generator)
217
+
218
+ ## 🔮 Roadmap
219
+
220
+ - [ ] Support for multiple output formats (PDF, PPTX)
221
+ - [ ] Interactive elements generation
222
+ - [ ] Custom templating system
223
+ - [ ] Multi-language support
224
+ - [ ] Integration with LMS platforms
225
+
226
+ ---
227
+
228
+ <p align="center">Made with ❤️ for educators, students, and communicators everywhere</p>
data/sample2.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5bf0997942205ed54293dd5b2a480b6d5efcc7d4146548dd68c20a9d7e3f7318
3
+ size 155966
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ transformers>=4.30.0
3
+ torch>=2.0.0
4
+ pypdf2>=3.0.0
5
+ python-dotenv>=0.19.0
6
+ numpy>=1.21.0
7
+ tqdm>=4.65.0
8
+ openai>=1.0.0
9
+ tiktoken>=0.5.0
setup.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from setuptools import setup, find_packages
2
+
3
+ setup(
4
+ name="transcript_transformer",
5
+ version="0.1.0",
6
+ packages=find_packages(),
7
+ install_requires=[
8
+ line.strip()
9
+ for line in open("requirements.txt")
10
+ if line.strip() and not line.startswith("#")
11
+ ],
12
+ python_requires=">=3.8",
13
+ )
src/__init__.py ADDED
File without changes
src/app.py ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import gradio as gr
3
+ from dotenv import load_dotenv
4
+ from src.core.transformer import TranscriptTransformer
5
+ from src.utils.pdf_processor import PDFProcessor
6
+ from src.utils.text_processor import TextProcessor
7
+
8
+ load_dotenv()
9
+
10
+ class TranscriptTransformerApp:
11
+ def __init__(self):
12
+ self.pdf_processor = PDFProcessor()
13
+ self.text_processor = TextProcessor()
14
+
15
+ def process_transcript(self,
16
+ input_type: str,
17
+ file_obj: gr.File = None,
18
+ raw_text_input: str = "",
19
+ initial_prompt: str = "",
20
+ target_duration: int = 30,
21
+ include_examples: bool = True,
22
+ use_gemini: bool = True,
23
+ use_thinking_model: bool = False) -> str:
24
+ """
25
+ Process uploaded transcript and transform it into a teaching transcript
26
+
27
+ Args:
28
+ input_type: Type of input (PDF or Raw Text)
29
+ file_obj: Uploaded PDF file (if input_type is PDF)
30
+ raw_text_input: Raw text input (if input_type is Raw Text)
31
+ initial_prompt: Additional guiding instructions for the content generation
32
+ target_duration: Target lecture duration in minutes
33
+ include_examples: Whether to include practical examples
34
+ use_gemini: Whether to use Gemini API instead of OpenAI
35
+ use_thinking_model: Requires use_gemini=True
36
+
37
+ Returns:
38
+ str: Generated teaching transcript
39
+ """
40
+ try:
41
+ # Force enable Gemini if thinking model is selected
42
+ if use_thinking_model:
43
+ use_gemini = True
44
+
45
+ self.transformer = TranscriptTransformer(
46
+ use_gemini=use_gemini,
47
+ use_thinking_model=use_thinking_model
48
+ )
49
+
50
+ # Get text based on input type
51
+ if input_type == "PDF":
52
+ if file_obj is None:
53
+ return "Error: No PDF file uploaded"
54
+ raw_text = self.pdf_processor.extract_text(file_obj.name)
55
+ else: # Raw Text
56
+ if not raw_text_input.strip():
57
+ return "Error: No text provided"
58
+ raw_text = raw_text_input
59
+
60
+ # Transform to teaching transcript with user guidance
61
+ lecture_transcript = self.transformer.transform_to_lecture(
62
+ text=raw_text,
63
+ target_duration=target_duration,
64
+ include_examples=include_examples,
65
+ initial_prompt=initial_prompt
66
+ )
67
+
68
+ return lecture_transcript
69
+
70
+ except Exception as e:
71
+ return f"Error processing transcript: {str(e)}"
72
+
73
+ def launch(self):
74
+ """Launch the Gradio interface"""
75
+ # Get the path to the example PDF
76
+ example_pdf = os.path.join(os.path.dirname(os.path.dirname(__file__)), "data", "sample2.pdf")
77
+
78
+ with gr.Blocks(title="AI Script Generator") as interface:
79
+ gr.Markdown("# AI Script Generator")
80
+ gr.Markdown("Transform transcripts and PDFs into timed, structured teaching scripts using AI")
81
+
82
+ with gr.Row():
83
+ input_type = gr.Radio(
84
+ choices=["PDF", "Raw Text"],
85
+ label="Input Type",
86
+ value="PDF"
87
+ )
88
+
89
+ with gr.Row():
90
+ with gr.Column(visible=True) as pdf_column:
91
+ file_input = gr.File(
92
+ label="Upload Transcript (PDF)",
93
+ file_types=[".pdf"]
94
+ )
95
+
96
+ with gr.Column(visible=False) as text_column:
97
+ text_input = gr.Textbox(
98
+ label="Paste Transcript Text",
99
+ lines=10,
100
+ placeholder="Paste your transcript text here..."
101
+ )
102
+
103
+ with gr.Row():
104
+ initial_prompt = gr.Textbox(
105
+ label="Guiding Prompt (Optional)",
106
+ lines=3,
107
+ value="",
108
+ placeholder="Additional instructions to customize the output. Examples: 'Use a more informal tone', 'Focus only on section X', 'Generate the content in Spanish', 'Include more practical programming examples', etc.",
109
+ info="The Guiding Prompt allows you to provide specific instructions to modify the generated content, like output/desired LANGUAGE. You can use it to change the tone, style, focus ONLY on specific sections of the text, specify the output language (e.g., 'Generate in Spanish/French/German'), or give any other instruction that helps personalize the final result."
110
+ )
111
+
112
+ with gr.Row():
113
+ target_duration = gr.Number(
114
+ label="Target Lecture Duration (minutes)",
115
+ value=30,
116
+ minimum=2,
117
+ maximum=60,
118
+ step=1
119
+ )
120
+
121
+ include_examples = gr.Checkbox(
122
+ label="Include Practical Examples",
123
+ value=True
124
+ )
125
+
126
+ use_thinking_model = gr.Checkbox(
127
+ label="Use Experimental Thinking Model (Gemini Only)",
128
+ value=True
129
+ )
130
+
131
+ with gr.Row():
132
+ submit_btn = gr.Button("Transform Transcript")
133
+
134
+ output = gr.Textbox(
135
+ label="Generated Teaching Transcript",
136
+ lines=25
137
+ )
138
+
139
+ # Handle visibility of input columns based on selection
140
+ def update_input_visibility(choice):
141
+ return [
142
+ gr.update(visible=(choice == "PDF")), # pdf_column
143
+ gr.update(visible=(choice == "Raw Text")) # text_column
144
+ ]
145
+
146
+ input_type.change(
147
+ fn=update_input_visibility,
148
+ inputs=input_type,
149
+ outputs=[pdf_column, text_column]
150
+ )
151
+
152
+ # Set up submission logic
153
+ submit_btn.click(
154
+ fn=self.process_transcript,
155
+ inputs=[
156
+ input_type,
157
+ file_input,
158
+ text_input,
159
+ initial_prompt,
160
+ target_duration,
161
+ include_examples,
162
+ use_thinking_model
163
+ ],
164
+ outputs=output
165
+ )
166
+
167
+ # Example for PDF input
168
+ gr.Examples(
169
+ examples=[[example_pdf, "", "", 30, True, True]],
170
+ inputs=[file_input, text_input, initial_prompt, target_duration, include_examples, use_thinking_model]
171
+ )
172
+
173
+ interface.launch(share=True)
174
+
175
+ if __name__ == "__main__":
176
+ app = TranscriptTransformerApp()
177
+ app.launch()
src/core/__init__.py ADDED
File without changes
src/core/transformer.py ADDED
@@ -0,0 +1,580 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import logging
3
+ import json
4
+ from typing import List, Dict, Optional
5
+ import openai
6
+ from src.utils.text_processor import TextProcessor
7
+
8
+ # Configure logging
9
+ logging.basicConfig(level=logging.INFO)
10
+ logger = logging.getLogger(__name__)
11
+
12
+ class WordCountError(Exception):
13
+ """Raised when word count requirements are not met"""
14
+ pass
15
+
16
+ class TranscriptTransformer:
17
+ """Transforms conversational transcripts into teaching material using LLM"""
18
+
19
+ MAX_RETRIES = 3 # Maximum retries for content generation
20
+ CHUNK_SIZE = 6000 # Target words per chunk
21
+ LARGE_DEVIATION_THRESHOLD = 0.20 # 20% maximum deviation
22
+ MAX_TOKENS = 64000 # Nuevo límite absoluto basado en 64k tokens de salida
23
+
24
+ def __init__(self, use_gemini: bool = True, use_thinking_model: bool = False):
25
+ """Initialize the transformer with selected LLM client"""
26
+ self.text_processor = TextProcessor()
27
+ self.use_gemini = use_gemini
28
+ self.use_thinking_model = use_thinking_model
29
+
30
+ if use_thinking_model:
31
+ if not use_gemini:
32
+ raise ValueError("Thinking model requires use_gemini=True")
33
+
34
+ logger.info("Initializing with Gemini Flash Thinking API")
35
+ self.openai_client = openai.OpenAI(
36
+ api_key=os.getenv('GEMINI_API_KEY'),
37
+ base_url="https://generativelanguage.googleapis.com/v1alpha"
38
+ )
39
+ self.model_name = "gemini-2.0-flash-thinking-exp-01-21"
40
+ elif use_gemini:
41
+ logger.info("Initializing with Gemini API")
42
+ self.openai_client = openai.OpenAI(
43
+ api_key=os.getenv('GEMINI_API_KEY'),
44
+ base_url="https://generativelanguage.googleapis.com/v1beta"
45
+ )
46
+ self.model_name = "gemini-2.0-flash-exp"
47
+ else:
48
+ logger.info("Initializing with OpenAI API")
49
+ self.openai_client = openai.OpenAI(
50
+ api_key=os.getenv('OPENAI_API_KEY')
51
+ )
52
+ self.model_name = "gpt-3.5-turbo"
53
+
54
+ # Target word counts
55
+ self.words_per_minute = 130 # Average speaking rate
56
+
57
+ def _validate_word_count(self, total_words: int, target_words: int, min_words: int, max_words: int) -> None:
58
+ """Validate word count with flexible thresholds and log warnings/errors"""
59
+ deviation = abs(total_words - target_words) / target_words
60
+
61
+ if deviation > self.LARGE_DEVIATION_THRESHOLD:
62
+ logger.error(
63
+ f"Word count {total_words} significantly outside target range "
64
+ f"({min_words}-{max_words}). Deviation: {deviation:.2%}"
65
+ )
66
+ elif total_words < min_words or total_words > max_words:
67
+ logger.warning(
68
+ f"Word count {total_words} slightly outside target range "
69
+ f"({min_words}-{max_words}). Deviation: {deviation:.2%}"
70
+ )
71
+
72
+ def transform_to_lecture(self,
73
+ text: str,
74
+ target_duration: int = 30,
75
+ include_examples: bool = True,
76
+ initial_prompt: Optional[str] = None) -> str:
77
+ """
78
+ Transform input text into a structured teaching transcript
79
+
80
+ Args:
81
+ text: Input transcript text
82
+ target_duration: Target lecture duration in minutes
83
+ include_examples: Whether to include practical examples
84
+ initial_prompt: Additional user instructions to guide the generation
85
+
86
+ Returns:
87
+ str: Generated teaching transcript, regardless of word count validation
88
+ """
89
+ logger.info(f"Starting transformation for {target_duration} minute lecture")
90
+
91
+ # Clean and preprocess text
92
+ cleaned_text = self.text_processor.clean_text(text)
93
+ input_words = self.text_processor.count_words(cleaned_text)
94
+ logger.info(f"Input text cleaned. Word count: {input_words}")
95
+
96
+ # Calculate target word count
97
+ target_words = self.words_per_minute * target_duration
98
+ min_words = int(target_words * 0.95) # Minimum 95% of target
99
+ max_words = int(target_words * 1.05) # Maximum 105% of target
100
+
101
+ logger.info(f"Target word count: {target_words} (min: {min_words}, max: {max_words})")
102
+
103
+ # Generate detailed lecture structure with topics
104
+ structure_data = self._generate_detailed_structure(
105
+ text=cleaned_text,
106
+ target_duration=target_duration,
107
+ initial_prompt=initial_prompt
108
+ )
109
+ logger.info("Detailed lecture structure generated")
110
+ logger.info(f"Topics identified: {[t['title'] for t in structure_data['topics']]}")
111
+
112
+ # Calculate section word counts
113
+ section_words = {
114
+ 'intro': int(target_words * 0.1),
115
+ 'main': int(target_words * 0.7),
116
+ 'practical': int(target_words * 0.15),
117
+ 'summary': int(target_words * 0.05)
118
+ }
119
+
120
+ try:
121
+ logger.info("Generating content by sections with topic tracking")
122
+
123
+ # Introduction with learning objectives and topic preview
124
+ intro = self._generate_section(
125
+ 'introduction',
126
+ structure_data,
127
+ cleaned_text,
128
+ section_words['intro'],
129
+ include_examples,
130
+ is_first=True,
131
+ initial_prompt=initial_prompt
132
+ )
133
+ intro_words = self.text_processor.count_words(intro)
134
+ logger.info(f"Introduction generated: {intro_words} words")
135
+
136
+ # Track context for coherence
137
+ context = {
138
+ 'current_section': 'introduction',
139
+ 'covered_topics': [],
140
+ 'pending_topics': [t['title'] for t in structure_data['topics']],
141
+ 'key_terms': set(),
142
+ 'current_narrative': intro[-1000:], # Last 1000 words for context
143
+ 'learning_objectives': structure_data['learning_objectives']
144
+ }
145
+
146
+ # Main content with topic progression
147
+ main_content = self._generate_main_content(
148
+ structure_data,
149
+ cleaned_text,
150
+ section_words['main'],
151
+ include_examples,
152
+ context,
153
+ initial_prompt=initial_prompt
154
+ )
155
+ main_words = self.text_processor.count_words(main_content)
156
+ logger.info(f"Main content generated: {main_words} words")
157
+
158
+ # Update context after main content
159
+ context['current_section'] = 'main'
160
+ context['current_narrative'] = main_content[-1000:]
161
+
162
+ # Practical applications tied to main topics
163
+ practical = self._generate_section(
164
+ 'practical',
165
+ structure_data,
166
+ cleaned_text,
167
+ section_words['practical'],
168
+ include_examples,
169
+ context=context,
170
+ initial_prompt=initial_prompt
171
+ )
172
+ practical_words = self.text_processor.count_words(practical)
173
+ logger.info(f"Practical section generated: {practical_words} words")
174
+
175
+ # Update context for summary
176
+ context['current_section'] = 'practical'
177
+ context['current_narrative'] = practical[-500:]
178
+
179
+ # Summary with topic reinforcement
180
+ summary = self._generate_section(
181
+ 'summary',
182
+ structure_data,
183
+ cleaned_text,
184
+ section_words['summary'],
185
+ include_examples,
186
+ is_last=True,
187
+ context=context,
188
+ initial_prompt=initial_prompt
189
+ )
190
+ summary_words = self.text_processor.count_words(summary)
191
+ logger.info(f"Summary generated: {summary_words} words")
192
+
193
+ # Combine all sections
194
+ full_content = f"{intro}\n\n{main_content}\n\n{practical}\n\n{summary}"
195
+ total_words = self.text_processor.count_words(full_content)
196
+ logger.info(f"Total content generated: {total_words} words")
197
+
198
+ # Log warnings/errors but don't raise exceptions
199
+ self._validate_word_count(total_words, target_words, min_words, max_words)
200
+
201
+ # Validate coherence
202
+ self._validate_coherence(full_content, structure_data)
203
+ logger.info("Content coherence validated")
204
+
205
+ return full_content
206
+
207
+ except Exception as e:
208
+ logger.error(f"Error during content generation: {str(e)}")
209
+ # If we have partial content, return it
210
+ if 'full_content' in locals():
211
+ logger.warning("Returning partial content despite errors")
212
+ return full_content
213
+ raise # Re-raise only if we have no content at all
214
+
215
+ def _generate_detailed_structure(self,
216
+ text: str,
217
+ target_duration: int,
218
+ initial_prompt: Optional[str] = None) -> Dict:
219
+ """Generate detailed lecture structure with topics and objectives"""
220
+ logger.info("Generating detailed lecture structure")
221
+
222
+ user_instructions = f"\nAdditional user instructions:\n{initial_prompt}\n" if initial_prompt else ""
223
+
224
+ prompt = f"""
225
+ You are an expert educator creating a detailed lecture outline.
226
+ {user_instructions}
227
+ Analyze this transcript and create a structured JSON output with the following:
228
+
229
+ 1. Title of the lecture
230
+ 2. 3-5 clear learning objectives
231
+ 3. 3-4 main topics, each with:
232
+ - Title
233
+ - Key concepts
234
+ - Subtopics
235
+ - Time allocation (in minutes)
236
+ - Connection to learning objectives
237
+ 4. Practical application ideas
238
+ 5. Key terms to track
239
+
240
+ IMPORTANT: Response MUST be valid JSON. Format exactly like this, with no additional text:
241
+ {{
242
+ "title": "string",
243
+ "learning_objectives": ["string"],
244
+ "topics": [
245
+ {{
246
+ "title": "string",
247
+ "key_concepts": ["string"],
248
+ "subtopics": ["string"],
249
+ "duration_minutes": number,
250
+ "objective_links": [number]
251
+ }}
252
+ ],
253
+ "practical_applications": ["string"],
254
+ "key_terms": ["string"]
255
+ }}
256
+
257
+ Target duration: {target_duration} minutes
258
+
259
+ Transcript excerpt:
260
+ {text[:2000]}
261
+ """
262
+
263
+ try:
264
+ # Common parameters
265
+ params = {
266
+ "model": self.model_name,
267
+ "messages": [
268
+ {"role": "system", "content": "You are an expert educator. Output ONLY valid JSON, no other text."},
269
+ {"role": "user", "content": prompt}
270
+ ],
271
+ "temperature": 0.7,
272
+ "max_tokens": self.MAX_TOKENS if self.use_thinking_model else 4000
273
+ }
274
+
275
+ # Add thinking config if using experimental model
276
+ if self.use_thinking_model:
277
+ params["extra_body"] = {
278
+ "thinking_config": {
279
+ "include_thoughts": True
280
+ }
281
+ }
282
+
283
+ response = self.openai_client.chat.completions.create(**params)
284
+ content = response.choices[0].message.content.strip()
285
+ logger.debug(f"Raw structure response: {content}")
286
+
287
+ try:
288
+ structure_data = json.loads(content)
289
+ logger.info("Structure data parsed successfully")
290
+ return structure_data
291
+ except json.JSONDecodeError as e:
292
+ logger.warning(f"Failed to parse JSON directly: {str(e)}")
293
+
294
+ # Try to extract JSON if it's wrapped in other text
295
+ import re
296
+ json_match = re.search(r'({[\s\S]*})', content)
297
+ if json_match:
298
+ try:
299
+ structure_data = json.loads(json_match.group(1))
300
+ logger.info("Structure data extracted and parsed successfully")
301
+ return structure_data
302
+ except json.JSONDecodeError:
303
+ logger.warning("Failed to parse extracted JSON")
304
+
305
+ # If both attempts fail, use fallback structure
306
+ logger.warning("Using fallback structure")
307
+ return self._generate_fallback_structure(text, target_duration)
308
+
309
+ except Exception as e:
310
+ logger.error(f"Error generating structure: {str(e)}")
311
+ return self._generate_fallback_structure(text, target_duration)
312
+
313
+ def _generate_fallback_structure(self, text: str, target_duration: int) -> Dict:
314
+ """Generate a basic fallback structure when JSON parsing fails"""
315
+ logger.info("Generating fallback structure")
316
+
317
+ # Generate a simpler structure prompt
318
+ prompt = f"""
319
+ Analyze this text and provide:
320
+ 1. A title (one line)
321
+ 2. Three learning objectives (one per line)
322
+ 3. Three main topics (one per line)
323
+ 4. Three key terms (one per line)
324
+
325
+ Text: {text[:1000]}
326
+ """
327
+
328
+ try:
329
+ response = self.openai_client.chat.completions.create(
330
+ model=self.model_name,
331
+ messages=[
332
+ {"role": "system", "content": "You are an expert educator. Provide concise, line-by-line responses."},
333
+ {"role": "user", "content": prompt}
334
+ ],
335
+ temperature=0.7,
336
+ max_tokens=1000
337
+ )
338
+
339
+ lines = response.choices[0].message.content.strip().split('\n')
340
+ lines = [line.strip() for line in lines if line.strip()]
341
+
342
+ # Extract components from lines
343
+ title = lines[0] if lines else "Lecture"
344
+ objectives = [obj for obj in lines[1:4] if obj][:3]
345
+ topics = [topic for topic in lines[4:7] if topic][:3]
346
+ terms = [term for term in lines[7:10] if term][:3]
347
+
348
+ # Calculate minutes per topic
349
+ main_time = int(target_duration * 0.7) # 70% for main content
350
+ topic_minutes = main_time // len(topics) if topics else main_time
351
+
352
+ # Create fallback structure
353
+ return {
354
+ "title": title,
355
+ "learning_objectives": objectives,
356
+ "topics": [
357
+ {
358
+ "title": topic,
359
+ "key_concepts": [topic], # Use topic as key concept
360
+ "subtopics": ["Overview", "Details", "Examples"],
361
+ "duration_minutes": topic_minutes,
362
+ "objective_links": [1] # Link to first objective
363
+ }
364
+ for topic in topics
365
+ ],
366
+ "practical_applications": [
367
+ "Real-world application example",
368
+ "Interactive exercise",
369
+ "Case study"
370
+ ],
371
+ "key_terms": terms
372
+ }
373
+
374
+ except Exception as e:
375
+ logger.error(f"Error generating fallback structure: {str(e)}")
376
+ # Return minimal valid structure
377
+ return {
378
+ "title": "Lecture Overview",
379
+ "learning_objectives": ["Understand key concepts", "Apply knowledge", "Analyze examples"],
380
+ "topics": [
381
+ {
382
+ "title": "Main Topic",
383
+ "key_concepts": ["Core concept"],
384
+ "subtopics": ["Overview"],
385
+ "duration_minutes": target_duration // 2,
386
+ "objective_links": [1]
387
+ }
388
+ ],
389
+ "practical_applications": ["Practical example"],
390
+ "key_terms": ["Key term"]
391
+ }
392
+
393
+ def _generate_section(self,
394
+ section_type: str,
395
+ structure_data: Dict,
396
+ original_text: str,
397
+ target_words: int,
398
+ include_examples: bool,
399
+ context: Dict = None,
400
+ is_first: bool = False,
401
+ is_last: bool = False,
402
+ initial_prompt: Optional[str] = None) -> str:
403
+ """Generate content for a specific section with coherence tracking"""
404
+ logger.info(f"Generating {section_type} section (target: {target_words} words)")
405
+
406
+ user_instructions = f"\nUser's guiding instructions:\n{initial_prompt}\n" if initial_prompt else ""
407
+
408
+ # Base prompt with structure
409
+ prompt = f"""
410
+ You are an expert educator creating a detailed lecture transcript.
411
+ {user_instructions}
412
+ Generate the {section_type} section with EXACTLY {target_words} words.
413
+
414
+ Lecture Title: {structure_data['title']}
415
+ Learning Objectives: {', '.join(structure_data['learning_objectives'])}
416
+
417
+ Current section purpose:
418
+ """
419
+
420
+ # Add section-specific guidance
421
+ if section_type == 'introduction':
422
+ prompt += """
423
+ - Start with an engaging hook
424
+ - Present clear learning objectives
425
+ - Preview main topics
426
+ - Set expectations for the lecture
427
+ """
428
+ elif section_type == 'main':
429
+ prompt += f"""
430
+ - Cover these topics: {[t['title'] for t in structure_data['topics']]}
431
+ - Build progressively on concepts
432
+ - Include clear transitions
433
+ - Reference previous concepts
434
+ """
435
+ elif section_type == 'practical':
436
+ prompt += """
437
+ - Apply concepts to real-world scenarios
438
+ - Connect to previous topics
439
+ - Include interactive elements
440
+ - Reinforce key learning points
441
+ """
442
+ elif section_type == 'summary':
443
+ prompt += """
444
+ - Reinforce key takeaways
445
+ - Connect back to objectives
446
+ - Provide next steps
447
+ - End with a strong conclusion
448
+ """
449
+
450
+ # Add context if available
451
+ if context:
452
+ prompt += f"""
453
+
454
+ Context:
455
+ - Covered topics: {', '.join(context['covered_topics'])}
456
+ - Pending topics: {', '.join(context['pending_topics'])}
457
+ - Key terms used: {', '.join(context['key_terms'])}
458
+ - Recent narrative: {context['current_narrative']}
459
+ """
460
+
461
+ # Add requirements
462
+ prompt += f"""
463
+
464
+ Requirements:
465
+ 1. STRICT word count: Generate EXACTLY {target_words} words
466
+ 2. Include practical examples: {include_examples}
467
+ 3. Use clear transitions
468
+ 4. Include engagement points
469
+ 5. Use time markers [MM:SS]
470
+ 6. Reference specific content from transcript
471
+ 7. Maintain narrative flow
472
+ 8. Use key terms consistently
473
+ """
474
+
475
+ response = self.openai_client.chat.completions.create(
476
+ model=self.model_name,
477
+ messages=[
478
+ {"role": "system", "content": "You are an expert educator creating a coherent lecture transcript."},
479
+ {"role": "user", "content": prompt}
480
+ ],
481
+ temperature=0.7,
482
+ max_tokens=self._calculate_max_tokens(section_type, target_words)
483
+ )
484
+
485
+ content = response.choices[0].message.content
486
+ word_count = self.text_processor.count_words(content)
487
+ logger.info(f"Section generated: {word_count} words")
488
+
489
+ return content
490
+
491
+ def _calculate_max_tokens(self, section_type: str, target_words: int) -> int:
492
+ """Calculate appropriate max_tokens based on section and model"""
493
+ # 1 token ≈ 4 caracteres (1 palabra ≈ 1.33 tokens)
494
+ base_tokens = int(target_words * 1.5) # Margen para formato
495
+
496
+ if self.use_thinking_model:
497
+ # Permite hasta 64k tokens pero limita por sección
498
+ section_limits = {
499
+ 'introduction': 8000,
500
+ 'main': 32000,
501
+ 'practical': 16000,
502
+ 'summary': 8000
503
+ }
504
+ return min(base_tokens * 2, section_limits.get(section_type, 16000))
505
+
506
+ # Límites para otros modelos
507
+ return min(base_tokens + 1000, self.MAX_TOKENS)
508
+
509
+ def _generate_main_content(self,
510
+ structure_data: Dict,
511
+ original_text: str,
512
+ target_words: int,
513
+ include_examples: bool,
514
+ context: Dict,
515
+ initial_prompt: Optional[str] = None) -> str:
516
+ """Generate main content with topic progression"""
517
+ logger.info(f"Generating main content (target: {target_words} words)")
518
+
519
+ # Calculate words per topic based on their duration ratios
520
+ total_duration = sum(t['duration_minutes'] for t in structure_data['topics'])
521
+ # Avoid division by zero
522
+ total_duration = total_duration if total_duration > 0 else 1
523
+
524
+ topic_words = {}
525
+
526
+ for topic in structure_data['topics']:
527
+ ratio = topic['duration_minutes'] / total_duration
528
+ topic_words[topic['title']] = int(target_words * ratio)
529
+
530
+ logger.info(f"Topic word allocations: {topic_words}")
531
+
532
+ # Generate content for each topic
533
+ topic_contents = []
534
+
535
+ for topic in structure_data['topics']:
536
+ topic_target = topic_words[topic['title']]
537
+
538
+ # Update context for topic
539
+ context['current_topic'] = topic['title']
540
+ if topic['title'] in context['pending_topics']:
541
+ context['covered_topics'].append(topic['title'])
542
+ context['pending_topics'].remove(topic['title'])
543
+ context['key_terms'].update(topic['key_concepts'])
544
+
545
+ # Generate topic content
546
+ topic_content = self._generate_section(
547
+ f"main_topic_{topic['title']}",
548
+ structure_data,
549
+ original_text,
550
+ topic_target,
551
+ include_examples,
552
+ context=context,
553
+ initial_prompt=initial_prompt
554
+ )
555
+
556
+ topic_contents.append(topic_content)
557
+ context['current_narrative'] = topic_content[-1000:]
558
+
559
+ return "\n\n".join(topic_contents)
560
+
561
+ def _validate_coherence(self, content: str, structure_data: Dict):
562
+ """Validate content coherence against structure"""
563
+ logger.info("Validating content coherence")
564
+
565
+ # Check for learning objectives
566
+ for objective in structure_data['learning_objectives']:
567
+ if not any(term.lower() in content.lower() for term in objective.split()):
568
+ logger.warning(f"Learning objective not well covered: {objective}")
569
+
570
+ # Check for key terms
571
+ for term in structure_data['key_terms']:
572
+ if content.lower().count(term.lower()) < 2:
573
+ logger.warning(f"Key term underutilized: {term}")
574
+
575
+ # Check topic coverage
576
+ for topic in structure_data['topics']:
577
+ if not any(concept.lower() in content.lower() for concept in topic['key_concepts']):
578
+ logger.warning(f"Topic concepts not well covered: {topic['title']}")
579
+
580
+ logger.info("Coherence validation complete")
src/utils/__init__.py ADDED
File without changes
src/utils/pdf_processor.py ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import PyPDF2
2
+ from typing import Optional
3
+
4
+ class PDFProcessor:
5
+ """Handles PDF file processing and text extraction"""
6
+
7
+ def __init__(self):
8
+ """Initialize PDF processor"""
9
+ pass
10
+
11
+ def extract_text(self, pdf_path: str) -> str:
12
+ """
13
+ Extract text content from a PDF file
14
+
15
+ Args:
16
+ pdf_path: Path to the PDF file
17
+
18
+ Returns:
19
+ str: Extracted text content
20
+
21
+ Raises:
22
+ FileNotFoundError: If PDF file doesn't exist
23
+ PyPDF2.PdfReadError: If PDF file is invalid or corrupted
24
+ """
25
+ try:
26
+ with open(pdf_path, 'rb') as file:
27
+ # Create PDF reader object
28
+ reader = PyPDF2.PdfReader(file)
29
+
30
+ # Extract text from all pages
31
+ text = ""
32
+ for page in reader.pages:
33
+ text += page.extract_text() + "\n"
34
+
35
+ return text.strip()
36
+
37
+ except FileNotFoundError:
38
+ raise FileNotFoundError(f"PDF file not found: {pdf_path}")
39
+ except PyPDF2.PdfReadError as e:
40
+ raise PyPDF2.PdfReadError(f"Error reading PDF file: {str(e)}")
41
+ except Exception as e:
42
+ raise Exception(f"Unexpected error processing PDF: {str(e)}")
43
+
44
+ def get_metadata(self, pdf_path: str) -> dict:
45
+ """
46
+ Extract metadata from PDF file
47
+
48
+ Args:
49
+ pdf_path: Path to the PDF file
50
+
51
+ Returns:
52
+ dict: PDF metadata
53
+ """
54
+ try:
55
+ with open(pdf_path, 'rb') as file:
56
+ reader = PyPDF2.PdfReader(file)
57
+ return reader.metadata
58
+ except Exception as e:
59
+ return {"error": str(e)}
src/utils/text_processor.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from typing import List, Optional
3
+
4
+ class TextProcessor:
5
+ """Handles text preprocessing and cleaning"""
6
+
7
+ def __init__(self):
8
+ """Initialize text processor"""
9
+ self.sentence_endings = r'[.!?]'
10
+ self.word_pattern = r'\b\w+\b'
11
+
12
+ def clean_text(self, text: str) -> str:
13
+ """
14
+ Clean and normalize text
15
+
16
+ Args:
17
+ text: Input text to clean
18
+
19
+ Returns:
20
+ str: Cleaned text
21
+ """
22
+ # Remove extra whitespace
23
+ text = ' '.join(text.split())
24
+
25
+ # Fix common OCR errors
26
+ text = self._fix_ocr_errors(text)
27
+
28
+ # Normalize punctuation
29
+ text = self._normalize_punctuation(text)
30
+
31
+ return text.strip()
32
+
33
+ def split_into_sections(self, text: str) -> List[str]:
34
+ """
35
+ Split text into logical sections based on content
36
+
37
+ Args:
38
+ text: Input text to split
39
+
40
+ Returns:
41
+ List[str]: List of text sections
42
+ """
43
+ # Split on double newlines or section markers
44
+ sections = re.split(r'\n\s*\n|\n(?=[A-Z][^a-z]*:)', text)
45
+ return [s.strip() for s in sections if s.strip()]
46
+
47
+ def count_words(self, text: str) -> int:
48
+ """
49
+ Count words in text
50
+
51
+ Args:
52
+ text: Input text
53
+
54
+ Returns:
55
+ int: Word count
56
+ """
57
+ words = re.findall(self.word_pattern, text)
58
+ return len(words)
59
+
60
+ def _fix_ocr_errors(self, text: str) -> str:
61
+ """Fix common OCR errors"""
62
+ replacements = {
63
+ r'[|]': 'I', # Vertical bar to I
64
+ r'0': 'O', # Zero to O where appropriate
65
+ r'1': 'l', # One to l where appropriate
66
+ r'\s+': ' ' # Multiple spaces to single space
67
+ }
68
+
69
+ for pattern, replacement in replacements.items():
70
+ text = re.sub(pattern, replacement, text)
71
+ return text
72
+
73
+ def _normalize_punctuation(self, text: str) -> str:
74
+ """Normalize punctuation marks"""
75
+ # Replace multiple periods with single period
76
+ text = re.sub(r'\.{2,}', '.', text)
77
+
78
+ # Add space after punctuation if missing
79
+ text = re.sub(r'([.!?])([A-Z])', r'\1 \2', text)
80
+
81
+ # Fix spacing around punctuation
82
+ text = re.sub(r'\s+([.!?,])', r'\1', text)
83
+
84
+ return text