13Aluminium commited on
Commit
e9133ed
·
verified ·
1 Parent(s): e4c7b83

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -75
README.md DELETED
@@ -1,75 +0,0 @@
1
- ---
2
- title: README
3
- emoji: 🐢
4
- colorFrom: indigo
5
- colorTo: indigo
6
- sdk: static
7
- pinned: true
8
- license: apache-2.0
9
- short_description: Sanskrit scripture datasets, structured for NLP tasks.
10
- thumbnail: >-
11
- https://cdn-uploads.huggingface.co/production/uploads/66fa59a2ec6983f03c2dd4e0/lF5toNU3-x6igbMxiA1wI.jpeg
12
- ---
13
-
14
-
15
- ## 1. Shrimad Bhagavad Gita
16
- [Dataset on Hugging Face](https://huggingface.co/datasets/snskrt/Shrimad_Bhagavad_Gita)
17
-
18
- **Short Description:** A structured, chapter-wise dataset of the Śrīmad Bhagavad Gītā with expanded verse counts, enabling fine-grained analysis and modeling of each śloka.
19
-
20
- ## 2. Devi Bhagavatam
21
- [Dataset on Hugging Face](https://huggingface.co/datasets/snskrt/Devi_Bhagavatam)
22
-
23
- **Dataset Structure:** Each record (CSV/JSON) includes:
24
-
25
- - `skanda` (string): Skanda number, e.g. "1"
26
- - `adhyaya_number` (string): Adhyāya index, e.g. "१.१"
27
- - `adhyaya_title` (string): Sanskrit chapter title
28
- - `a_index` (int): auxiliary sequence index
29
- - `m_index` (int): main sequence index
30
- - `text` (string): full śloka text
31
-
32
- **Dataset Description:**
33
- This dataset contains a complete, structured representation of the Śrīmad Devī-bhāgavatam mahāpurāṇe in CSV format, breaking down scripture into Skandas, Adhyāyas, and individual ślokas. Suited for NLP tasks like feature extraction, classification, translation, summarization, and generation.
34
-
35
- **Size:** ~18,702 ślokas
36
-
37
- ## 3. Shiv Mahapuran
38
- [Dataset on Hugging Face](https://huggingface.co/datasets/snskrt/Shiv_Mahapuran)
39
-
40
- **Dataset Description:**
41
- This dataset contains a complete, structured representation of the Śiva Mahāpurāṇa (Śivapurāṇa) in CSV format. Data is organized into Saṃhitās (seven surviving Saṃhitās), Khaṇḍas, Adhyāyas, and individual ślokas, enabling precise NLP work on classical Sanskrit scripture.
42
-
43
- **Size:** ~24,489 ślokas
44
-
45
- **Dataset Structure:** Each record (CSV/JSON) includes:
46
-
47
- - `samhita` (string): Name of the Saṃhitā, e.g. "Rudrasaṃhitā"
48
- - `khanda` (string): Khanda name, e.g. "Parvati kand"
49
- - `khanda_number` (string): Khanda index, e.g. "1"
50
- - `adhyay` (string): Adhyāya title or number, e.g. "1.1"
51
- - `shloka_number` (int): Position of the śloka within the Adhyāya
52
- - `shloka_text` (string): Full Sanskrit text of the śloka
53
-
54
- ## 4. Shiv Puran OCR (Image-Text)
55
- [Dataset on Hugging Face](https://huggingface.co/datasets/snskrt/Shiv_Puran_Image_text_OCR)
56
-
57
- **Dataset Description:**
58
- A dataset of cropped śloka images from the Vidyeśvara-saṃhitā, paired with their transcribed text. Perfect for training or evaluating OCR systems on classical Sanskrit script.
59
-
60
- **Contents:**
61
- - 734 cropped śloka images
62
- - A CSV mapping each image filename to its corresponding śloka text
63
-
64
- ## 5. Shiv Puran OCR (Object Detection)
65
- [Dataset on Hugging Face](https://huggingface.co/datasets/snskrt/Shiv_puran_OCR)
66
-
67
- **Dataset Description:**
68
- Annotations and imagery to train object detection models that differentiate śloka vs. non-śloka content in scanned scripture pages. Once detected, ślokas can be cropped for OCR or parallel corpus creation.
69
-
70
- **Annotation Structure:**
71
- - Pages 0–102: Vidyeśvara Saṃhitā (manually annotated)
72
- - Pages 103–463: Rudra Saṃhitā (model-inferred + manual corrections)
73
- - Pages 464–508: Śat Rudra Saṃhitā (model-inferred + manual corrections)
74
-
75
- **Data Includes:** Bounding-box coordinates and metadata for each detected region.