# Generate paper json files from a collection xml file, with fulltext extraction. This is a slightly re-arranged version of Sotaro Takeshita's code, which is available at https://github.com/gengo-proj/data-factory. ## Requirements - Docker - Python>=3.10 - python packages: - acl-anthology-py>=0.4.3 - bs4 - jsonschema ## Setup Start Grobid Docker container ```bash docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0 ``` Get the meta data from ACL Anthology ```bash git clone git@github.com:acl-org/acl-anthology.git ``` ## Usage ```bash python src/data/acl_anthology_crawler.py \ --base-output-dir \ --pdf-output-dir \ --anthology-data-dir ./acl-anthology/data/ ```