File size: 4,780 Bytes
4ac593a
 
 
 
 
 
 
 
 
 
 
 
 
1521ef5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: Bigquery Metadata Generator
emoji: 📈
colorFrom: green
colorTo: red
sdk: streamlit
sdk_version: 1.43.0
app_file: app.py
pinned: false
short_description: Gpt 3.5 to generate column, table and dataset descriptions
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


# Schema Descriptor

A streamlit application that automatically generates data descriptions for BigQuery datasets and tables using OpenAI's language models.

## Overview

Schema Descriptor helps data teams create and maintain comprehensive documentation for their BigQuery datasets by:

1. Sampling data from tables
2. Generating human-readable descriptions using LLMs
3. Writing the descriptions back to BigQuery metadata
4. Providing a user interface to review and edit descriptions before committing

## Features

- **Authentication**: Secure authentication with Google Cloud Platform using service account keys
- **Cost Estimation**: Calculate the cost of BigQuery operations before running them
- **Customisable Sampling**: Control the number of rows sampled from each table
- **Date Filtering**: Automatic partition detection, allowing for filtering of larger tables bay date
- **Interactive UI**: Edit generated descriptions before committing them to BigQuery
- **Caching**: LLM responses are cached to reduce API costs
- **Error Resilience**: Retry logic and fallback mechanisms for API failures
- **Progress Tracking**: Detailed progress information during long operations

## Installation

1. Clone the repository
2. Create a virtual environment and activate it:
   ```
   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   ```
3. Install dependencies in the correct order (this is important due to dependency constraints):
   ```
   # Install key dependencies with specific versions first
   pip install protobuf==3.20.3
   pip install altair==4.2.2
   pip install streamlit==1.12.0
   pip install openai==0.28.0
   
   # Install remaining packages
   pip install -r requirements.txt --no-deps
   ```

### Dependency Constraints

This project has specific dependency requirements due to compatibility constraints:

- **protobuf**: Must be exactly 3.20.3 to work with both Streamlit and Google Cloud libraries
- **altair**: Must be 4.2.2 to work with Streamlit 1.12.0
- **streamlit**: Version 1.12.0 is required
- **openai**: Version 0.28.0 is required for the current API integration

Installing dependencies in a different order or with different versions may cause errors.

For detailed information about dependencies, see [DEPENDENCY_NOTES.md](DEPENDENCY_NOTES.md).

## Usage

1. Run the Streamlit application:
   ```
   streamlit run app.py
   ```

2. Enter your OpenAI API key in the sidebar

3. Upload your Google Cloud service account key (JSON file) in the sidebar

4. Enter your BigQuery project and dataset IDs

5. (Optional) Adjust sampling parameters and date filters

6. Click "Check Cost" to estimate the cost of your operation

7. Click "Create Data Descriptions" to generate descriptions for your dataset and tables

8. Review and edit the descriptions in the main window

9. Click "Commit Changes to BigQuery" to save the descriptions back to your BigQuery metadata

For more detailed instructions with screenshots, see [docs/example_usage.md](docs/example_usage.md).

If you run into issues, check the [docs/troubleshooting.md](docs/troubleshooting.md) guide.

## Project Structure

### Core Application
- `app.py`: Main Streamlit application and UI
- `config.py`: Configuration settings and environment variables
- `errors.py`: Custom exception classes for error handling

### Services
- `services/auth_service.py`: Authentication with Google Cloud
- `services/bigquery_service.py`: BigQuery operations and metadata management
- `services/llm_service.py`: Language model integration with error handling
- `services/data_dictionary_service.py`: Core business logic for data dictionaries

### Utilities
- `utils/bq_utils.py`: BigQuery utility functions
- `utils/text_utils.py`: Text processing utilities
- `utils/progress_utils.py`: Progress tracking and reporting

## Requirements

- Python 3.9+
- Google Cloud service account with BigQuery access
- OpenAI API key

## Security Note

This application requires access to your BigQuery data and uses OpenAI's API. Please ensure:

1. Your service account has appropriate permissions
2. You review generated descriptions before committing them to ensure no sensitive data is exposed

## Contributing

We welcome contributions to improve Schema Descriptor! Please see the [CONTRIBUTING.md](CONTRIBUTING.md) file for guidelines and instructions.

## License

This project is licensed under the MIT License - see the LICENSE file for details.