Spaces:
Runtime error
A newer version of the Streamlit SDK is available: 1.56.0
Example Usage
This guide demonstrates how to use Schema Descriptor to generate descriptions for BigQuery datasets.
Prerequisites
Before you begin, make sure you have:
- A Google Cloud service account with access to BigQuery
- An OpenAI API key
- Schema Descriptor installed and configured (see the README.md)
Basic Usage
Step 1: Start the application
streamlit run app.py
This will open the application in your web browser.
Step 2: Configure Authentication
- Enter your OpenAI API key in the sidebar
- Upload your Google Cloud service account JSON key file
- The application will verify your credentials
Step 3: Select Project and Dataset
- Enter your Google Cloud project ID
- Select a dataset from the dropdown menu
- Verify that the tables are displayed correctly
Step 4: Configure Sampling Parameters
- Adjust the "Sample Size" slider to control how many rows to sample per table
- If your tables are partitioned, set date filters to sample a specific range
Step 5: Generate Descriptions
- Click "Check Cost" to see an estimate of the BigQuery usage (optional)
- Click "Create Data Descriptions" to start the process
- Watch the progress indicators as the application:
- Samples data from each table
- Sends information to the LLM
- Generates descriptions for the dataset, tables, and columns
Step 6: Review and Edit
- Review the automatically generated descriptions
- Edit any descriptions that need improvement or correction
- The editor supports markdown formatting for better readability
Step 7: Save to BigQuery
- When you're satisfied with the descriptions, click "Commit Changes to BigQuery"
- The application will update your BigQuery metadata with the new descriptions
- You'll see a confirmation message when complete
Advanced Features
Custom Instructions
You can provide custom instructions to the LLM by entering them in the "Additional Instructions" field. For example:
- "Focus on data governance aspects"
- "Highlight PII and sensitive data fields"
- "Use technical terminology appropriate for financial data"
Error Handling
If you encounter errors:
- Check the logs in the console where you started Streamlit
- Verify that your service account has the correct permissions
- For OpenAI API errors, check your rate limits and API key status
Caching
The application caches LLM responses to save costs. If you want to regenerate descriptions:
- Clear the cache by restarting the application
- Or use the "Force Refresh" option if implemented
Example Outputs
Below is an example of how your descriptions might look in BigQuery after using Schema Descriptor:
Dataset Description
Sales Data Warehouse (SDW)
This dataset contains comprehensive sales transaction data from our e-commerce platform. It includes customer information, product details, orders, and shipping data from January 2020 to present.
The data is refreshed daily through an ETL process and is used for sales reporting, customer analysis, and inventory management.
Table Description
Customer Orders Table
This table records all customer orders with associated metadata. Each row represents a unique order with details about the customer, timing, payment method, and order status.
The table is partitioned by order_date for efficient querying of specific time periods.
Column Descriptions
- customer_id: Unique identifier for the customer who placed the order
- order_date: Timestamp when the order was placed (YYYY-MM-DD format)
- payment_method: Method used for payment (e.g., "credit_card", "paypal", "gift_card")
- order_total: Total monetary value of the order in USD, excluding tax and shipping
Conclusion
Schema Descriptor makes it easy to maintain comprehensive, accurate documentation for your BigQuery resources with minimal manual effort.
For more details on the application's features and configuration options, refer to the README.md.



