Spaces:
Sleeping
Leveraging APIs for data sharing
JavaScript Object Notation (JSON Format)
Today, JSON is the de-facto standard for exchanging data between web and mobile clients and back-end services. ref
Well, when you’re writing frontend code in Javascript, getting JSON data back makes it easier to load that data into an object tree and work with it. And JSON formats data in a more succinct way, which saves bandwidth and improves response times when sending messages back and forth to a server.
In a world of APIs, cloud computing, and ever-growing data, JSON has a big role to play in greasing the wheels of a modern, open web. ref
REST APIs
Over the course of the ’00s, another Web services technology, called Representational State Transfer, or REST, began to overtake [all other tools] for the purpose of transferring data. One of the big advantages of programming using REST APIs is that you can use multiple data formats — not just XML, but JSON and HTML as well. As web developers came to prefer JSON over XML, so did they favor REST over SOAP. As Kostyantyn Kharchenko put it on the Svitla blog, “In many ways, the success of REST is due to the JSON format because of its easy use on various platforms.” ref
GraphQL APIs
GraphQL on the other hand is a query language which gives the client the power to request specific fields and elements it wants to retrieve from the server. It is, loosely speaking, some kind of SQL for the Web. It, therefore, has to have knowledge on the available data beforehand, which couples clients somehow to the server. (ref and another reference
Using keys
Local Computer
The load_dotenv() function from the dotenv package will look for a file named .env in the current directory and add all its variable definitions to the os.environ dictionary. If a .env file is not found in the current directory, then the parent directory is searched for it. The search keeps going up the directory hierarchy until a .env file is found or the top-level directory is reached.
In our .env file you store you keys with the variable name, an =, and the key with no spaces or quotes.
SAFEGRAPH_KEY=LNNmQ
GITHUB_PAT=ghp_1y
You can import your key to your environment using the following python code chunk. All other users would then need to use the same dotenv package and process for your script to work without anyone needing to store their key in their production script.
import os
from dotenv import load_dotenv
load_dotenv()
sfkey = os.environ.get("SAFEGRAPH_KEY")
Databricks
The recommended option focuses on dbutils.secrets. However, the Community Edition doesn't allow its use. On Community Edition, you are creating your own compute each time, and you could store your keys in the Spark tab under the Environment variables just like you would store them as shown in the .env example above.
When you are on a group compute environment, you don't want to store your keys in the Environment variables as others with access to the compute can retrieve this information. When I don't have access to the Databricks CLI, I create a second notebook called keys that stores my keys using a standard python chunk to create the variables.
SAFEGRAPH_KEY = "LNNmQ"
GITHUB_PAT = "ghp_1y"
print("SAFEGRAPH_KEY and GITHUB_PAT created"
Then, I source this file using the following command in
%run ./keys
Parsing JSON
JSON is simply a flat text file that follows a specific format. Python handles JSON files leveraging lists and dictionaries to represent the details in a JSON data object.
Using Pandas json_normalize()
Normalize semi-structured JSON data into a flat table. reference
Using Pyspark with json (.jsonl)
The Spark 3.5.0 Documentation on JSON Files documents how to handle semi-structured JSON data.
Thought questions
Let's parse the data into a table.
- Use the
graphql_noauth.pyand extend the currentquery()to includeimage,origin, andepisode. - Convert the episode field from key/value (wide format) to a JSON array (long format) with the keys
number,nameandair_dateand their associated arrays[]as the values. - Use
pd.parse_json()to build your table. What do you notice? - Save your new
.jsonloutput to thejsonlinesformat using the jsonlines python package. Your file should look like output.jsonl - Read the
.jsonlfile withpd.read_json()and the correct arguments.
.jsonl and array structure
Notice that the API returns the key/value pair for episode as
'episode': [{'episode': 'S01E10',
'name': 'Close Rick-counters of the Rick Kind',
'air_date': 'April 7, 2014'},
{'episode': 'S03E07',
'name': 'The Ricklantis Mixup',
'air_date': 'September 10, 2017'}]}