Spaces:
Running
Running
Expand org README: datasets-first reframe + clearer contribution paths
#3
by davanstrien HF Staff - opened
README.md
CHANGED
|
@@ -7,58 +7,28 @@ sdk: static
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
# π BigLAM
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
-
- π€ Train and release open-source models for LAM-relevant tasks
|
| 18 |
-
- π οΈ Develop tools and approaches tailored to LAM use cases
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
<details>
|
| 23 |
-
<summary><strong>β¨ Background</strong></summary>
|
| 24 |
-
|
| 25 |
-
BigLAM began as a [datasets hackathon](https://github.com/bigscience-workshop/lam) within the [BigScience πΈ](https://bigscience.huggingface.co/) project, a large-scale, open NLP collaboration.
|
| 26 |
-
|
| 27 |
-
Our goal: make LAM datasets more discoverable and usable to support researchers, institutions, and ML practitioners working with cultural heritage data.
|
| 28 |
-
</details>
|
| 29 |
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
-
<summary><strong>π What You'll Find</strong></summary>
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
|
| 37 |
-
- **Models**: fine-tuned for tasks like:
|
| 38 |
-
- Art/historical image classification
|
| 39 |
-
- Document layout analysis and OCR
|
| 40 |
-
- Metadata quality assessment
|
| 41 |
-
- Named entity recognition in heritage texts
|
| 42 |
-
- **Spaces**: tools for interactive exploration and demonstration
|
| 43 |
-
</details>
|
| 44 |
|
| 45 |
-
|
| 46 |
-
<summary><strong>π§© Get Involved</strong></summary>
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
-
|
| 51 |
-
- Join the discussion on [GitHub](https://github.com/bigscience-workshop/lam/discussions)
|
| 52 |
-
- Contribute your own tools or data
|
| 53 |
-
- Share your work using BigLAM resources
|
| 54 |
-
</details>
|
| 55 |
-
|
| 56 |
-
## π Why It Matters
|
| 57 |
-
|
| 58 |
-
Cultural heritage data is often underrepresented in machine learning. BigLAM helps address this by:
|
| 59 |
-
|
| 60 |
-
- Supporting inclusive and responsible AI
|
| 61 |
-
- Helping institutions experiment with ML for access, discovery, and preservation
|
| 62 |
-
- Ensuring that ML systems reflect diverse human knowledge and expression
|
| 63 |
-
- Developing tools and methods that work well with the unique formats, values, and needs of LAMs
|
| 64 |
|
|
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# π BigLAM
|
| 11 |
|
| 12 |
+
A community-run home for machine-learning-ready datasets from libraries, archives, and museums.
|
| 13 |
|
| 14 |
+
Most cultural-heritage data wasn't originally prepared with ML workflows in mind β it lives in catalogue systems, IIIF endpoints, METS/MODS records, and various idiosyncratic formats that each institution has its own version of. BigLAM is a place where those datasets get repackaged into formats ML practitioners can actually load and work with, contributed by the people who know the source material best.
|
| 15 |
|
| 16 |
+
The org started as a [datasets hackathon](https://github.com/bigscience-workshop/lam) inside the [BigScience](https://bigscience.huggingface.co/) project in 2022 and has grown into a standing community for cultural-heritage ML.
|
|
|
|
|
|
|
| 17 |
|
| 18 |
+
## What's here
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
The org is datasets-first: 46+ image, text, and tabular collections from libraries, archives, and museums, prepared so they load cleanly with the `datasets` library. A handful of [models](https://huggingface.co/biglam?other=model) and [spaces](https://huggingface.co/biglam?other=space) live here too β mostly early experiments from the BigScience-era hackathon.
|
| 21 |
|
| 22 |
+
For task-specific, deployable models built on top of these datasets, see the sibling org [small-models-for-glam](https://huggingface.co/small-models-for-glam).
|
|
|
|
| 23 |
|
| 24 |
+
## Contributing a dataset
|
| 25 |
|
| 26 |
+
If you've prepared a LAM dataset that other researchers might use, the best home is usually your **institution's own Hugging Face organisation** (e.g. [`NationalLibraryOfScotland`](https://huggingface.co/NationalLibraryOfScotland)). Institutional ownership signals authority over the data and makes long-term maintenance easier. Setting up a new org on the Hub is [free and quick](https://huggingface.co/organizations/new).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
+
If your institution isn't on the Hub yet, or you'd prefer to host the dataset here, [open a discussion](https://huggingface.co/spaces/biglam/README/discussions) and we'll help get it set up under BigLAM. Useful additions are typically datasets where the format conversion (METS/ALTO β parquet, IIIF manifest β loadable image splits, etc.) has already been done and the licensing is clear enough for open release.
|
|
|
|
| 29 |
|
| 30 |
+
**Already have a dataset here that should sit under your institution's org?** Open a discussion or issue on the dataset repo β we're happy to transfer ownership.
|
| 31 |
|
| 32 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
60+ contributors over the years. Day-to-day maintenance is light-touch; for help with a contribution, open a discussion and someone will see it.
|