YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
This repository is for 2110446 Data Science and Data Engineering project.
Presentations: https://youtu.be/_z3PZ1i0JpA
How to run the project
Video: https://youtu.be/vnNhSXqAvZc
Prerequisites
- git
- docker
How to run
- Clone the repository
git clone https://huggingface.co/when-my-cat-learn-datasci/datasci-final-project-2024
- Change directory to the project
cd datasci-final-project-2024
- Start with command
chmod +x start.sh
./start.sh
OR
3 Build the docker image
docker compose build
- Start with docker-compose
docker-compose up
Project Structure
DataGathering
This folder is mainly for collecting data from other sources.
1. GoogleGeocoding
Gather geolocation of affiliation name using google geocoding API.
| Directory/File | Description |
|---|---|
| GoogleGeocoding.ipynb | Python jupiter notebook for gathering latitude and longitude of country |
| geocode_aff_country.csv | contains geolocation of each affiliation (aff_country, lat, lon) |
2. ScopusAPI
Query abstract scopus data using scopus API
| Directory/File | Description |
|---|---|
| ScopusAPI.ipynb | Python jupiter notebook for fetching abstract data from scopus API |
| example_abs_data.json | contains example data fetched from abstract API from scopus |
| example_abs_data.json | contains example data fetched from abstracts API from scopus |
3. SubjectCode
Webscraping scopus subject areas
| Directory/File | Description |
|---|---|
| SubjectCode.ipynb | Python jupiter notebook for scraping subject code from web |
| scopus_subject_areas.csv | contains subject code and it corresponds name (code,name) |
Pipeline
This folder is for data processing and data cleaning. It contains the following files:
| Directory/File | Description |
|---|---|
| Pipeline.ipynb | Python jupiter notebook for extract data from raw data |
| spark-3.5.1-bin-hadoop3.tgz | It will use in docker for connecting with spark |
Raw and Raw_Extra
This folder contains raw data and extra raw data that will be used in the project.
Visualization
| Directory/File | Description |
|---|---|
| dockerfile | Dockerfile with python image |
| main.py | Python script used by streamlit to visualize the data |
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support