PDF2Audio

Sleeping

App Files Files Community

PDF2Audio / PDF2Audio_Technical_Documentation.md

Stephen McCall

added documentation

72b3f92 9 months ago

preview code

raw

history blame contribute delete

15.8 kB

	# PDF2Audio: Technical Documentation

	## Table of Contents
	1. [Introduction](#introduction)
	2. [Application Overview](#application-overview)
	3. [File Structure](#file-structure)
	4. [Core Components](#core-components)
	- [Data Models](#data-models)
	- [PDF Processing](#pdf-processing)
	- [Text Generation](#text-generation)
	- [Audio Generation](#audio-generation)
	- [Instruction Templates](#instruction-templates)
	5. [User Interface](#user-interface)
	- [Main Layout](#main-layout)
	- [Input Controls](#input-controls)
	- [Output Display](#output-display)
	- [Editing Features](#editing-features)
	6. [Workflow](#workflow)
	7. [Key Functions](#key-functions)
	8. [Integration Points](#integration-points)
	9. [Customization Options](#customization-options)
	10. [Conclusion](#conclusion)

	## Introduction

	PDF2Audio is a Gradio-based web application that converts PDF documents, markdown files, and text files into audio content using OpenAI's GPT models for text generation and text-to-speech (TTS) services. The application allows users to upload documents, select from various instruction templates (podcast, lecture, summary, etc.), and customize the output with different voices and models.

	This technical documentation provides a detailed explanation of the `app.py` file, which contains all the functionality of the PDF2Audio application. It is designed to help developers and designers understand the codebase to use it as a foundation for similar applications.

	## Application Overview

	PDF2Audio follows a straightforward workflow:

	1. User uploads one or more PDF, markdown, or text files
	2. User selects an instruction template and customizes settings
	3. The application extracts text from the uploaded files
	4. An LLM (Language Learning Model) processes the text according to the selected template
	5. The generated dialogue is converted to audio using OpenAI's TTS service
	6. The user can listen to the audio, view the transcript, edit it, and regenerate if needed

	The application is built using the Gradio framework, which provides an easy-to-use interface for creating web applications with Python. The backend leverages OpenAI's API for both text generation and text-to-speech conversion.

	## File Structure

	The entire application is contained within a single `app.py` file, which includes:

	- Import statements for required libraries
	- Data model definitions
	- Instruction templates for different output formats
	- Core functionality for text extraction, dialogue generation, and audio synthesis
	- Gradio UI components and layout
	- Event handlers for user interactions

	## Core Components

	### Data Models

	The application uses Pydantic models to structure the dialogue data:

	```python
	class DialogueItem(BaseModel):
	text: str
	speaker: Literal["speaker-1", "speaker-2"]

	class Dialogue(BaseModel):
	scratchpad: str
	dialogue: List[DialogueItem]
	```

	These models ensure type safety and provide a structured way to handle the dialogue data throughout the application.

	### PDF Processing

	PDF processing is handled using the PyPDF library. The application extracts text from uploaded PDF files:

	```python
	if suffix == ".pdf":
	with file_path.open("rb") as f:
	reader = PdfReader(f)
	text = "\n\n".join(
	page.extract_text() for page in reader.pages if page.extract_text()
	)
	combined_text += text + "\n\n"
	```

	The application also supports markdown and plain text files:

	```python
	elif suffix in [".txt", ".md", ".mmd"]:
	with file_path.open("r", encoding="utf-8") as f:
	text = f.read()
	combined_text += text + "\n\n"
	```

	### Text Generation

	Text generation is performed using OpenAI's GPT models through the `promptic` library's `llm` decorator. The application uses a custom `conditional_llm` wrapper to dynamically configure the LLM based on user selections:

	```python
	@conditional_llm(
	model=text_model,
	api_base=api_base,
	api_key=openai_api_key,
	reasoning_effort=reasoning_effort,
	do_web_search=do_web_search,
	)
	def generate_dialogue(text: str, intro_instructions: str, text_instructions: str,
	scratch_pad_instructions: str, prelude_dialog: str,
	podcast_dialog_instructions: str,
	edited_transcript: str = None, user_feedback: str = None) -> Dialogue:
	# Function body contains the prompt template
	```

	The `generate_dialogue` function is decorated with `@retry` to handle validation errors and retry the API call if necessary.

	### Audio Generation

	Audio generation is handled by OpenAI's TTS API through the `get_mp3` function:

	```python
	def get_mp3(text: str, voice: str, audio_model: str, api_key: str = None,
	speaker_instructions: str ='Speak in an emotive and friendly tone.') -> bytes:

	client = OpenAI(
	api_key=api_key or os.getenv("OPENAI_API_KEY"),
	)

	with client.audio.speech.with_streaming_response.create(
	model=audio_model,
	voice=voice,
	input=text,
	instructions=speaker_instructions,
	) as response:
	with io.BytesIO() as file:
	for chunk in response.iter_bytes():
	file.write(chunk)
	return file.getvalue()
	```

	The application uses `concurrent.futures.ThreadPoolExecutor` to parallelize audio generation for each dialogue line, improving performance:

	```python
	with cf.ThreadPoolExecutor() as executor:
	futures = []
	for line in llm_output.dialogue:
	transcript_line = f"{line.speaker}: {line.text}"
	voice = speaker_1_voice if line.speaker == "speaker-1" else speaker_2_voice
	speaker_instructions=speaker_1_instructions if line.speaker == "speaker-1" else speaker_2_instructions
	future = executor.submit(get_mp3, line.text, voice, audio_model, openai_api_key, speaker_instructions)
	futures.append((future, transcript_line))
	characters += len(line.text)

	for future, transcript_line in futures:
	audio_chunk = future.result()
	audio += audio_chunk
	transcript += transcript_line + "\n\n"
	```

	### Instruction Templates

	The application includes a comprehensive set of instruction templates for different output formats:

	```python
	INSTRUCTION_TEMPLATES = {
	"podcast": { ... },
	"deep research analysis": { ... },
	"clean rendering": { ... },
	"SciAgents material discovery summary": { ... },
	"lecture": { ... },
	"summary": { ... },
	"short summary": { ... },
	"podcast (French)": { ... },
	"podcast (German)": { ... },
	"podcast (Spanish)": { ... },
	"podcast (Portuguese)": { ... },
	"podcast (Hindi)": { ... },
	"podcast (Chinese)": { ... },
	}
	```

	Each template contains five key components:
	1. `intro`: High-level task description
	2. `text_instructions`: How to process the input text
	3. `scratch_pad`: Hidden brainstorming area for the model
	4. `prelude`: Introduction to the main output
	5. `dialog`: Main output instructions

	These templates guide the LLM in generating appropriate content based on the selected format.

	## User Interface

	### Main Layout

	The Gradio UI is structured with a clean, responsive layout:

	```python
	with gr.Blocks(title="PDF to Audio", css="""
	#header {
	display: flex;
	align-items: center;
	justify-content: space-between;
	padding: 20px;
	background-color: transparent;
	border-bottom: 1px solid #ddd;
	}
	/* Additional CSS styles */
	""") as demo:

	cached_dialogue = gr.State()

	with gr.Row(elem_id="header"):
	# Header content

	with gr.Row(elem_id="main_container"):
	with gr.Column(scale=2):
	# Input controls
	with gr.Column(scale=3):
	# Template selection and customization

	# Output components
	# Editing features
	```

	### Input Controls

	The left column contains input controls for file uploading and model selection:

	```python
	files = gr.Files(label="PDFs (.pdf), markdown (.md, .mmd), or text files (.txt)",
	file_types=[".pdf", ".PDF", ".md", ".mmd", ".txt"])

	openai_api_key = gr.Textbox(
	label="OpenAI API Key",
	visible=True,
	placeholder="Enter your OpenAI API Key here...",
	type="password"
	)

	text_model = gr.Dropdown(
	label="Text Generation Model",
	choices=STANDARD_TEXT_MODELS,
	value="o3-mini",
	info="Select the model to generate the dialogue text."
	)

	# Additional input controls for audio model, voices, etc.
	```

	### Output Display

	The application provides several output components:

	```python
	audio_output = gr.Audio(label="Audio", format="mp3", interactive=False, autoplay=False)
	transcript_output = gr.Textbox(label="Transcript", lines=25, show_copy_button=True)
	original_text_output = gr.Textbox(label="Original Text", lines=10, visible=False)
	error_output = gr.Textbox(visible=False) # Hidden textbox to store error message
	```

	### Editing Features

	The application includes several features for editing and regenerating content:

	1. Transcript Editing:
	```python
	use_edited_transcript = gr.Checkbox(label="Use Edited Transcript", value=False)
	edited_transcript = gr.Textbox(label="Edit Transcript Here", lines=20, visible=False,
	show_copy_button=True, interactive=False)
	```

	2. Line-by-Line Editing:
	```python
	with gr.Accordion("Edit dialogue line‑by‑line", open=False) as editor_box:
	df_editor = gr.Dataframe(
	headers=["Speaker", "Line"],
	datatype=["str", "str"],
	wrap=True,
	interactive=True,
	row_count=(1, "dynamic"),
	col_count=(2, "fixed"),
	)
	```

	3. User Feedback:
	```python
	user_feedback = gr.Textbox(label="Provide Feedback or Notes", lines=10)
	```

	## Workflow

	The application workflow is managed through event handlers that connect UI components to backend functions:

	1. Template Selection:
	```python
	template_dropdown.change(
	fn=update_instructions,
	inputs=[template_dropdown],
	outputs=[intro_instructions, text_instructions, scratch_pad_instructions,
	prelude_dialog, podcast_dialog_instructions]
	)
	```

	2. Generate Audio:
	```python
	submit_btn.click(
	fn=validate_and_generate_audio,
	inputs=[
	files, openai_api_key, text_model, reasoning_effort, do_web_search, audio_model,
	speaker_1_voice, speaker_2_voice, speaker_1_instructions, speaker_2_instructions,
	api_base, intro_instructions, text_instructions, scratch_pad_instructions,
	prelude_dialog, podcast_dialog_instructions, edited_transcript, user_feedback,
	],
	outputs=[audio_output, transcript_output, original_text_output, error_output, cached_dialogue]
	)
	```

	3. Regenerate with Edits:
	```python
	regenerate_btn.click(
	fn=lambda use_edit, edit, *args: validate_and_generate_audio(
	*args[:12], # All inputs up to podcast_dialog_instructions
	edit if use_edit else "", # Use edited transcript if checkbox is checked
	*args[12:] # user_feedback and original_text_output
	),
	inputs=[
	use_edited_transcript, edited_transcript,
	# Additional inputs
	],
	outputs=[audio_output, transcript_output, original_text_output, error_output, cached_dialogue]
	)
	```

	4. Re-render Audio:
	```python
	rerender_btn.click(
	fn=render_audio_from_dialogue,
	inputs=[
	cached_dialogue, openai_api_key, audio_model,
	speaker_1_voice, speaker_2_voice, speaker_1_instructions, speaker_2_instructions,
	],
	outputs=[audio_output, transcript_output],
	)
	```

	## Key Functions

	### `validate_and_generate_audio`

	This function serves as the entry point for audio generation, validating inputs and handling errors:

	```python
	def validate_and_generate_audio(*args):
	files = args[0]
	if not files:
	return None, None, None, "Please upload at least one PDF (or MD/MMD/TXT) file before generating audio."
	try:
	audio_file, transcript, original_text, dialogue = generate_audio(*args)
	return audio_file, transcript, original_text, None, dialogue
	except Exception as e:
	return None, None, None, str(e), None
	```

	### `generate_audio`

	This is the core function that orchestrates the entire process:

	1. Validates the API key
	2. Extracts text from uploaded files
	3. Configures and calls the LLM to generate dialogue
	4. Processes any user edits or feedback
	5. Generates audio for each dialogue line
	6. Returns the audio file, transcript, and original text

	### `render_audio_from_dialogue`

	This function re-renders audio from an existing dialogue without regenerating the text:

	```python
	def render_audio_from_dialogue(
	cached_dialogue,
	openai_api_key: str,
	audio_model: str,
	speaker_1_voice: str,
	speaker_2_voice: str,
	speaker_1_instructions: str,
	speaker_2_instructions: str,
	) -> tuple[str, str]:
	# Function implementation
	```

	### `save_dialogue_edits`

	This function saves edits made in the dataframe editor:

	```python
	def save_dialogue_edits(df, cached_dialogue):
	if cached_dialogue is None:
	raise gr.Error("Nothing to edit yet – run Generate Audio first.")

	import pandas as pd
	new_dlg = df_to_dialogue(pd.DataFrame(df, columns=["Speaker", "Line"]))

	# regenerate plain transcript so the user sees the change immediately
	transcript_str = "\n".join(f"{d.speaker}: {d.text}" for d in new_dlg.dialogue)

	# Return updated state and transcript
	return new_dlg, gr.update(value=transcript_str), "Edits saved. Press Re‑render to hear them."
	```

	## Integration Points

	The application integrates with several external services and libraries:

	1. OpenAI API: Used for both text generation (GPT models) and text-to-speech conversion
	2. Promptic: A library for working with LLM prompts
	3. PyPDF: Used for extracting text from PDF files
	4. Gradio: The web UI framework
	5. Pydantic: Used for data validation and modeling
	6. Tenacity: Used for implementing retry logic

	## Customization Options

	The application offers several customization options:

	1. Instruction Templates: Multiple pre-defined templates for different output formats
	2. Model Selection: Support for various OpenAI models for both text and audio generation
	3. Voice Selection: Multiple voice options for the speakers
	4. Voice Instructions: Custom instructions for each speaker's voice
	5. API Base: Option to use a custom API endpoint for text generation
	6. Web Search: Option to enable web search during text generation
	7. Reasoning Effort: Control over the reasoning effort for compatible models

	## Conclusion

	PDF2Audio is a well-structured application that demonstrates effective use of modern AI APIs for content transformation. Its modular design and comprehensive feature set make it an excellent foundation for similar applications.

	Key strengths of the codebase include:

	1. Modularity: Clear separation of concerns between text extraction, dialogue generation, and audio synthesis
	2. Extensibility: Easy to add new instruction templates or customize existing ones
	3. Error Handling: Robust error handling with informative user feedback
	4. Performance Optimization: Parallel processing for audio generation
	5. User Experience: Rich UI with multiple editing and customization options

	Developers looking to build similar applications can leverage this codebase as a starting point, focusing on extending functionality or improving specific aspects rather than building from scratch.