Spaces:
Running
Running
Brizilian E-Commerce Dataset
Schema
| Table | PK | FK | TS |
|---|---|---|---|
| geolocation | geolocation_zip_code_prefix | - | - |
| products | product_id | - | - |
| customers | customer_id | customer_zip_code_prefix -> geolocation.geolocation_zip_code_prefix | - |
| sellers | seller_id | seller_zip_code_prefix -> geolocation.geolocation_zip_code_prefix | - |
| orders | order_id | customer_id -> customers.customer_id | - |
| order_items | order_id, order_item_id | order_id -> orders.order_id, product_id -> products.product_id, seller_id -> sellers.seller_id | sort by: order_item_id |
| order_payments | order_id, payment_sequential | order_id -> orders.order_id | sort by: payment_sequential |
| order_reviews | review_id | order_id -> orders.order_id | sort by: review_answer_timestamp |
Download and Prepare
Download the dataset from Kaggle. Put the downloaded
files under data/ in this folder.
A problem of the raw dataset is that it does not 100% follow its relational schema, so we preprocess to clean the data.
Please run preprocess.py and find the preprocessed data under preprocessed/. Core changes other than the basic
changes include:
- Product category names table is ignored because it can be treated as categorical and the information is mainly textual, which is not a focus of this project.
- There are many different city names in geological information, which is textual and relies on general world knowledge to extract its information. Cities can be treated as categorical but the number of categories is way larger than typical categorical columns. For simplicity, we drop cities.
- Some zip code prefix in child tables of
geolocationare not provided ingeolocationtable. These rows are removed. - Geolocation zip code prefix is not unique, but it is used as a foreign key, which is essentially invalid. We group
the same zip code prefix and use the mean
geolocation_latandgeolocation_lng, and use the firstgeolocation_state. - The reviews table contains much textual information, which would be dropped.
- Neither
review_idnororder_idis unique fororder_reviews, while both are hashed ID values. Therefore, it is hard to infer the initial intention of these two IDs. For simplicity, we maintain only the first of reviews with the samereview_id.
The content of simplified is the same as the content of preprocessed. Composite primary keys by a foreign key with a
local ID will be ignored as if the table has no primary key for baseline models.