IRG / datasets /bec /README.md
Zilong-Zhao's picture
first commit
c4ac745

Brizilian E-Commerce Dataset

Schema

Table PK FK TS
geolocation geolocation_zip_code_prefix - -
products product_id - -
customers customer_id customer_zip_code_prefix -> geolocation.geolocation_zip_code_prefix -
sellers seller_id seller_zip_code_prefix -> geolocation.geolocation_zip_code_prefix -
orders order_id customer_id -> customers.customer_id -
order_items order_id, order_item_id order_id -> orders.order_id, product_id -> products.product_id, seller_id -> sellers.seller_id sort by: order_item_id
order_payments order_id, payment_sequential order_id -> orders.order_id sort by: payment_sequential
order_reviews review_id order_id -> orders.order_id sort by: review_answer_timestamp

Download and Prepare

Download the dataset from Kaggle. Put the downloaded files under data/ in this folder.

A problem of the raw dataset is that it does not 100% follow its relational schema, so we preprocess to clean the data. Please run preprocess.py and find the preprocessed data under preprocessed/. Core changes other than the basic changes include:

  1. Product category names table is ignored because it can be treated as categorical and the information is mainly textual, which is not a focus of this project.
  2. There are many different city names in geological information, which is textual and relies on general world knowledge to extract its information. Cities can be treated as categorical but the number of categories is way larger than typical categorical columns. For simplicity, we drop cities.
  3. Some zip code prefix in child tables of geolocation are not provided in geolocation table. These rows are removed.
  4. Geolocation zip code prefix is not unique, but it is used as a foreign key, which is essentially invalid. We group the same zip code prefix and use the mean geolocation_lat and geolocation_lng, and use the first geolocation_state.
  5. The reviews table contains much textual information, which would be dropped.
  6. Neither review_id nor order_id is unique for order_reviews, while both are hashed ID values. Therefore, it is hard to infer the initial intention of these two IDs. For simplicity, we maintain only the first of reviews with the same review_id.

The content of simplified is the same as the content of preprocessed. Composite primary keys by a foreign key with a local ID will be ignored as if the table has no primary key for baseline models.