F1 project - querying APIs using python
In my previous article, I talked about the overview of my project and the technology choices I made, this article will start the deep dive series of it starting with querying the API using python. In this article we will go through the APIs queries and how I made them modular so that I could reuse the code as much as possible and at the same time allow me flexibility in the processing of this data.
We will first go through what I called simple data, meaning data I could extract from the API which have a low volume and could be fully updated each run or did not need any specific configuration doing the call.
Then we will go through the more complex datasets I queried and how I made it modular so that I can know easily what will be called and how it works.
Simple data
So for simple data, I identified a few data sources that could be queried often and were not too big, I used one script for all the calls and it allows me to query all of them.
The data sources concerned are the following:
It contains data regarding the driver, mainly relating to information, the team he is running for during which GP and so on, this data is mandatory for visualization since it will allow to have something more personalized and know who is the driver behind each number.
This source contains data regarding the meeting, in Formula 1, a meeting is an event, generally on a weekend during which there will be multiple sessions (qualifications, races...), so in this source you can find one row per meeting and information about the meeting, its location, the name of the circuit, the date of the meeting and many more.
As I just explained for the meetings, a session is a part of a meeting, for instance on a simple GP weekend meeting, you will find multiple sessions, practice sessions (1, 2, 3), qualifying and the race itself, this gives details about the sessions, its date, which type of session it is. This data will be used extensively to add details and ensure we treat the right data.
This data is regarding to pit stops, I added it quite late in my analytics process noticing I would need it for my analysis. So if you do not know what a pit stop is, it is simply when a car goes to its stand so that the tires can be changed or the mechanics can fix things. I wanted to use this data to remove some data points to only have the racing points and also be able to identify when did the pit stops happened.
For the script itself and the logic I used, it is very straightforward, I set up some logging for better readability and being able to have logs running while I run it, also being able to log out of the script if I make sure it runs automated (not configured in the version below), then go to querying the API and finally writing to a json locally stored. The only trick is using a for loop to loop all the calls and fo them one at a time.
Logging configuration:
Located at the top of the script just below the packages imports, this is a block you will find in many of my scripts.
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
handlers=[logging.StreamHandler()]
)
This gives the format of the logging infos which will appear during the run time and set the basic configuration for the rest of the script.
Get api data :
def get_api_data(api_base_url, api_call, params=None):
"""
Fetches data from the API.
Args:
api_base_url (str): Base URL of the API
api_call (str): Specific API endpoint
params (dict): Optional query parameters
"""
full_url = api_base_url + api_call
result_query = requests.get(full_url, params=params)
return result_query
This one is again pretty straightforward, first I create the full url using the base api url and the specific call I want to make (will be useful for the loop later on). Then I use the package requests and specifically the call get to obtain my data and return those.
Writing to json:
def write_to_json(data, base_path, file_name):
"""
Writes the data to a JSON file.
Args:
data : the json to be parsed
base_path (str): the path to write the file
file_name (str): the file name
"""
full_path = os.path.join(base_path, f"{file_name}.json")
with open(full_path, 'w') as f:
json.dump(data, f, indent=4)
logging.info(f"File saved successfully: {full_path}")
Here the function is writing json data to a specific directory, we use the json package and the dump function to ensure the json is well formatted for further processing and add a bit of logging as an info using the full path readability.
Main function and the global orchestration :
def main():
# Use forward slashes for path
save_path = '01_sources/000_initial_extract'
api_base_url = 'https://api.openf1.org/v1/'
api_call_list = ['drivers', 'sessions', 'meetings', 'pit']
for call in api_call_list:
# Set parameters for sessions endpoint
params = {'session_type': 'Race'} if call == 'sessions' else None
# Make the API call to get data
step_1 = get_api_data(
api_base_url=api_base_url,
api_call=call,
params=params
)
# Check if the API call was successful
if step_1.status_code == 200:
logging.info(f"Success on the call: {call}")
step_2 = step_1.json()
# Create relative path without leading dot
save_path_abs = os.path.join(os.getcwd(), save_path)
os.makedirs(save_path_abs, exist_ok=True)
# Write the result to a JSON file
write_to_json(data=step_2, base_path=save_path_abs, file_name=call)
else:
logging.error(f"Error on the call: {call}, Status Code: {step_1.status_code}")
This one looks a bit more massive but stays readable and easy enough to understand.
What is does is quite simple, I first set the variables I will need along the process, I app all the api calls as a list that will be split later on for the different calls and files.
Then I begin a loop through the different calls in the list with a simple for loop, it will allow me to process each call individually. There is one exception I made is that if the call made is for sessions, I only want to keep the race sessions and not get all the data from the sessions which are qualifying or practice since it adds a lot of data that I do not plan on using in the upcoming analysis.
Then I start calling the api with the previously defined function.
This done I check the results I got as follow :
if the result of the call is a code 200 I will process my data further, if not I will throw an error so that I stop processing this data. If it is indeed a code 200, I ensure that the path I plan to write my files on is existing and then use the write function to write my json file.
This is the global process for the simple data, it calls the API, then check if the call made is correct and if so it writes the data as a json file to a specific repository.
Complex data
What I qualify as complex data is data that needs parameters for each API call and for which the volume is a lot bigger, the overall process is the same, except that there will be multiple calls made for each API, depending on multiple parameters to ensure we capture all the data.
For the complex data, it contains 4 main API that I used:
Contains telemetry data about each car during the events, it is very detailed and we will use this data as an example for the script explanation.
This contains data around the laps for each driver and some aggregated metrics, this data will be used to identify for each time which lap it is.
the position data is key, what is the use of racing if you cannot have positions, its format is simple, it gives a position for each driver and its start date, so if a driver overpasses from second to first, it will have 2 rows for each driver with starting dates, it can grow to many rows for some drivers who had many position changes during the race.
This data contains the position as x, y and z, it has not been used in the global process but allows for shaping the different circuits and comparing positions.
Generating the URLs
For URL generation this time it becomes a lot more complex since the params will be moving each time, I did not query the API one time as I did for simple data but instead made multiple calls. This is a method that I prefer using globally since it allows me a lot more control on the data I extract, also allows me in case of a crash, to get back to where I was instead of re querying the whole API. I then put all the generated urls to a data frame so that it can be queried later on.
In the case of car data, I had to split the queries quite a lot, so instead of just querying car_data, which would lead in an absurd amount of data resulting on a very long process, both on the API server end and on my end, I split it in many different smaller calls.
The first split I made is by session, then I split it by driver (for the drivers who ran this session) so that I have for each session 20 calls made, but this was not enough, so I do another split by speed, after a few try and errors, I ended up splitting by ranges of speed of a 100 km per hour leading to the following for session 1, driver 2 :
which leads to a call looking like this :
Recommended by LinkedIn
https://api.openf1.org/v1/car_data?driver_number=4&session_key=9159&speed>=0&speed<100
After generating all these urls, it ends up being stored in a list that will be used moving forward.
def generate_urls_by_session(
base_url: str,
min_start: int,
step_size: int,
df: pd.DataFrame,
n_calls: int
) -> pd.DataFrame:
"""
Generates a DataFrame of session keys and corresponding API call URLs based on speed ranges.
Parameters:
base_url (str): The base URL for the API.
min_start (int): Starting minimum speed.
step_size (int): Incremental step size for speed ranges.
df (pd.DataFrame): DataFrame containing 'session_key' and 'driver_number' columns.
n_calls (int): Number of API call URLs to generate per session.
Returns:
pd.DataFrame: DataFrame with columns ['session_key', 'urls'] where 'urls' is a list of generated URLs.
"""
session_urls: Dict[str, List[str]] = {}
for _, row in df.iterrows():
session_key = row['session_key']
driver_number = row['driver_number']
urls = []
for i in range(n_calls):
min_value = min_start + i * step_size
max_value = min_value + step_size
url = (f"{base_url}?speed>={min_value}&speed<{max_value}"
f"&session_key={session_key}&driver_number={driver_number}")
urls.append(url)
session_urls.setdefault(session_key, []).extend(urls)
return pd.DataFrame(list(session_urls.items()), columns=['session_key', 'urls'])
Checking existing files
This bit of code is very important for the rest of the script, it allows me to scan the existing files (so the files already processed) and will be used later on to compare to the list of URLs generated in the previous step. Another solution could have been to log the files processed into a file, but this way allows a lot of flexibility and ensure that a file will not be replaced and remove a point of failure by having a file which could be manually updated or else.
The code simply scan all the files present in the directory and removes the naming convention and file type to only keep the session_key which is in the name of the saved file.
def scan_output_repository(save_directory: Path) -> Set[str]:
"""
Scans the directory for existing session JSON files and returns a set of session keys.
Parameters:
save_directory (Path): Path to the directory where JSON files are stored.
Returns:
Set[str]: Set of session keys for which JSON files already exist.
"""
if not save_directory.exists():
save_directory.mkdir(parents=True, exist_ok=True)
existing_sessions = {
f.stem.replace("car_data_", "")
for f in save_directory.glob("*.json")
}
return existing_sessions
Querying and saving
The next step in my data processing was the function to query the apis and save the data to a repository. In this query I start to reuse some of my previously generated code, I specifically use the previous scan repository function and compare it to the urls generated and log all the sessions already generated (this step is totally optional, it could also not be mentioned since it is already processed, but I prefer to have a too verbose log than not verbose enough).
The comparison done, my script will start querying the api to get my data, the script will here query the api, check if the call is successful, if it is the case, the data is appended and then written to multiple files (one per session).
Why creating one file per session see may wonder?
My answer will be simple, it is a lot easier to handle in case of an error, let's suppose I have the session 7779 on which I find a problem around data quality, I can simply delete this file and it will just relaunch the query for this specific session. This is to my mind a best practice for ease of use and error management.
def query_and_save(session_df: pd.DataFrame, save_directory: Path) -> None:
"""
Queries the API for each session and saves the response data to JSON files.
Parameters:
session_df (pd.DataFrame): DataFrame containing 'session_key' and 'urls'.
save_directory (Path): Directory to save the JSON output files.
"""
existing_sessions = scan_output_repository(save_directory)
for _, row in session_df.iterrows():
session_key = row['session_key']
if str(session_key) in existing_sessions:
logging.info(f"Session {session_key} already exists. Skipping...")
continue
urls: List[str] = row['urls']
session_data: List[dict] = [] # Collect data for current session
logging.info(f"--- Querying session {session_key} ---")
for url in urls:
try:
response = requests.get(url)
if response.status_code == 200:
data = response.json()
session_data.append(data)
else:
logging.warning(f"Failed to query URL: {url}, Status Code: {response.status_code}")
except Exception as e:
logging.error(f"Error querying URL {url}: {e}")
session_file = save_directory / f"car_data_{session_key}.json"
with session_file.open('w', encoding='utf-8') as f:
json.dump(session_data, f, indent=4)
logging.info(f"Data for session {session_key} saved to {session_file}")
Main function
All my sub routines are now defined, and I now will only have my main function to define.
To summarize and say it simply, my main function will be first defining the paths I am going to use in my process, then I will create the list of sessions and drivers I want to extract, this done I will generate the URLs and then only query those and finally save the results to individual files.
In between there are multiple parameters to fill the functions previously built.
def main() -> None:
# Define paths using Pathlib for better path handling
data_path = Path('./01_sources/000_initial_extract')
output_path = Path('./01_sources/001_initial_extract_advanced/car_data')
save_path_abs = output_path.resolve()
# Load driver data and session data
driver_file = data_path / 'drivers.json'
session_file = data_path / 'sessions.json'
df_drivers = pd.read_json(driver_file)[['driver_number', 'session_key']]
df_sessions = pd.read_json(session_file)[['session_key']]
# Merge driver and session data
df_driver_sessions = df_drivers.merge(df_sessions, on='session_key', how='inner').drop_duplicates()
# Parameters for URL generation
base_url = "https://api.openf1.org/v1/car_data"
min_start = 0 # Starting minimum speed
step_size = 100 # Incremental step for speed range
n_calls = 4 # Number of API calls per session
logging.info("----- Generating URLs for API calls -----")
session_df = generate_urls_by_session(base_url, min_start, step_size, df_driver_sessions, n_calls)
logging.info(f"Sessions to process: {len(session_df)}")
logging.info("----- Querying the API and saving the files -----")
query_and_save(session_df, save_path_abs)
logging.info("------ End of query ------")
Conclusion and moving forward
This was an example of the API querying logic I used on this project, I tried to build something as robust as possible and ensuring that I will not have a query running indefinitely, the main thing to remember is that to query an API, you will have to think on how big is the volume you want to query, if it is a smaller amount, it can be straightforward but if the volume is massive, you may prefer pre processing your data to query multiple times the API with some sub queries so that you ensure your data quality all along the process.
The next article will be around loading data to the database, and some preparation steps for cleansing the data.
If you want to find the previous articles: