yeongseonchoe commited on
Commit
6672631
Β·
verified Β·
1 Parent(s): a2f30e9

docs: add organization card content

Browse files
Files changed (1) hide show
  1. README.md +60 -3
README.md CHANGED
@@ -1,10 +1,67 @@
1
  ---
2
  title: README
3
- emoji: πŸ’»
4
- colorFrom: red
5
  colorTo: green
6
  sdk: static
7
  pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: README
3
+ emoji: "πŸ“Š"
4
+ colorFrom: blue
5
  colorTo: green
6
  sdk: static
7
  pinned: false
8
  ---
9
 
10
+ # kpubdata β€” Korean Public Data for Everyone
11
+
12
+ Making Korean government open data accessible worldwide with a single line of code.
13
+
14
+ ```python
15
+ from datasets import load_dataset
16
+
17
+ ds = load_dataset("kpubdata/seoul-apartment-trades")
18
+ df = ds["train"].to_pandas()
19
+ ```
20
+
21
+ ## Mission
22
+
23
+ Korean public data ([data.go.kr](https://www.data.go.kr)) is valuable but hard to access:
24
+ complex API authentication, XML responses, Korean-only documentation,
25
+ and no standard formats like Parquet or HuggingFace Datasets.
26
+
27
+ We bridge the gap β€” raw public data, cleaned and published as HuggingFace Datasets.
28
+ No feature engineering, no opinions. Just honest, well-documented government data ready to use.
29
+
30
+ ## Principles
31
+
32
+ - **Source fidelity**: Original Korean text values preserved as-is. English column names for accessibility.
33
+ - **Schema honesty**: What is declared in the config is exactly what you get. No phantom columns, no all-null surprises.
34
+ - **Global-first documentation**: Dataset cards in English with Korean domain context explained for international users.
35
+ - **No feature engineering**: We publish clean raw data. Users add derived features (geocoding, distances, etc.) themselves β€” just like Kaggle.
36
+
37
+ ## Available Datasets
38
+
39
+ | Dataset | Records | Period | Source | Description |
40
+ |---|---:|---|---|---|
41
+ | [seoul-apartment-trades](https://huggingface.co/datasets/kpubdata/seoul-apartment-trades) | ~234k | 2020–2024 | MOLIT via data.go.kr | Apartment sale transactions in Seoul, all 25 districts |
42
+
43
+ *More datasets coming β€” air quality, weather, transit, and more.*
44
+
45
+ ## How It Works
46
+
47
+ ```
48
+ [data.go.kr API] β†’ [kpubdata SDK] β†’ [kpubdata-builder pipeline] β†’ [HuggingFace Dataset]
49
+ ```
50
+
51
+ 1. **[kpubdata](https://github.com/yeongseon/kpubdata)** β€” Python SDK that handles API auth, pagination, and response parsing for Korean public data portals
52
+ 2. **[kpubdata-builder](https://github.com/yeongseon/kpubdata-builder)** β€” Pipeline that fetches, transforms, validates, and publishes datasets to HuggingFace
53
+
54
+ ## Contributing
55
+
56
+ We welcome contributions! If there is a Korean public dataset you would like to see on HuggingFace:
57
+
58
+ 1. Check if the source API is available on [data.go.kr](https://www.data.go.kr)
59
+ 2. Open an issue on [kpubdata-builder](https://github.com/yeongseon/kpubdata-builder/issues)
60
+ 3. Or submit a PR with a new dataset config (see [publishing standards](https://github.com/yeongseon/kpubdata-builder/blob/main/docs/hf-publishing-standards.md))
61
+
62
+ ## License
63
+
64
+ Datasets are published under licenses compatible with their original government data licenses.
65
+ Most Korean public data uses κ³΅κ³΅λˆ„λ¦¬ (Korea Open Government License), mapped to CC-BY-4.0.
66
+
67
+ See individual dataset cards for specific licensing details.