Using NLP for Administrative Document Text
This model classifies The Gazette of India documents by identifying the most probable issuing ministry. It facilitates policy research by enabling the systematic analysis of administrative documents in India. This project is currently under development.
Why?
The Gazette of India is the official public journal and authorized legal document of the Government of India, published weekly by the Directorate of Printing under the Ministry of Housing and Urban Affairs. It serves as the medium for the government to disseminate official notices, laws, ordinances, and other statutory information, making such documents legally enforceable and accessible to the public.
For economic policy research, The Gazette is invaluable as it provides authoritative records of policy changes, legislative amendments, and regulatory updates. Researchers can trace the evolution of economic policies, assess their legal foundations, and analyze their impacts on various sectors, thereby gaining comprehensive insights into the government's economic strategies and interventions.
Training Dataset
I use various sources, including government websites and the Internet Archive, to scrape over 780,000 issues of The Gazette of India and numerous state gazettes. Sources usually organize gazettes by issuing ministry, implicitly labelling them. Conversion to text uses Tesseract, unless otherwise mentioned.
Acknowledgements and Credits
I am Ritwiz Sarma, an econ postgrad and data science enthusiast from the Madras School of Economics, India. Working on this training dataset and model was part of my collaboration with Ankit Bhatia, PhD Candidate at Johns Hopkins SAIS, Washington DC US.
- Downloads last month
- -
Model tree for ritwizs/gazette_model
Base model
google-bert/bert-base-uncased