F1 project - querying APIs using python

Arthur Ladwein

Published Feb 20, 2025

In my previous article, I talked about the overview of my project and the technology choices I made, this article will start the deep dive series of it starting with querying the API using python. In this article we will go through the APIs queries and how I made them modular so that I could reuse the code as much as possible and at the same time allow me flexibility in the processing of this data.

We will first go through what I called simple data, meaning data I could extract from the API which have a low volume and could be fully updated each run or did not need any specific configuration doing the call.

Then we will go through the more complex datasets I queried and how I made it modular so that I can know easily what will be called and how it works.

Simple data

So for simple data, I identified a few data sources that could be queried often and were not too big, I used one script for all the calls and it allows me to query all of them.

The data sources concerned are the following:

drivers

It contains data regarding the driver, mainly relating to information, the team he is running for during which GP and so on, this data is mandatory for visualization since it will allow to have something more personalized and know who is the driver behind each number.

meetings

This source contains data regarding the meeting, in Formula 1, a meeting is an event, generally on a weekend during which there will be multiple sessions (qualifications, races...), so in this source you can find one row per meeting and information about the meeting, its location, the name of the circuit, the date of the meeting and many more.

sessions

As I just explained for the meetings, a session is a part of a meeting, for instance on a simple GP weekend meeting, you will find multiple sessions, practice sessions (1, 2, 3), qualifying and the race itself, this gives details about the sessions, its date, which type of session it is. This data will be used extensively to add details and ensure we treat the right data.

This data is regarding to pit stops, I added it quite late in my analytics process noticing I would need it for my analysis. So if you do not know what a pit stop is, it is simply when a car goes to its stand so that the tires can be changed or the mechanics can fix things. I wanted to use this data to remove some data points to only have the racing points and also be able to identify when did the pit stops happened.

For the script itself and the logic I used, it is very straightforward, I set up some logging for better readability and being able to have logs running while I run it, also being able to log out of the script if I make sure it runs automated (not configured in the version below), then go to querying the API and finally writing to a json locally stored. The only trick is using a for loop to loop all the calls and fo them one at a time.

Logging configuration:

Located at the top of the script just below the packages imports, this is a block you will find in many of my scripts.

# Configure logging
logging.basicConfig(
     level=logging.INFO,
     format="%(asctime)s - %(levelname)s - %(message)s",
     handlers=[logging.StreamHandler()]
)

This gives the format of the logging infos which will appear during the run time and set the basic configuration for the rest of the script.

Get api data :

def get_api_data(api_base_url, api_call, params=None):
     """
     Fetches data from the API.
     Args:
          api_base_url (str): Base URL of the API
          api_call (str): Specific API endpoint
          params (dict): Optional query parameters
     """
     full_url = api_base_url + api_call
     result_query = requests.get(full_url, params=params)
     return result_query

This one is again pretty straightforward, first I create the full url using the base api url and the specific call I want to make (will be useful for the loop later on). Then I use the package requests and specifically the call get to obtain my data and return those.

Writing to json:

def write_to_json(data, base_path, file_name):
    """
    Writes the data to a JSON file.
    Args:
        data : the json to be parsed
        base_path (str): the path to write the file
        file_name (str): the file name 
    """
    full_path = os.path.join(base_path, f"{file_name}.json")
    with open(full_path, 'w') as f:
        json.dump(data, f, indent=4)
    logging.info(f"File saved successfully: {full_path}")

Here the function is writing json data to a specific directory, we use the json package and the dump function to ensure the json is well formatted for further processing and add a bit of logging as an info using the full path readability.

Main function and the global orchestration :

def main():
    # Use forward slashes for path
    save_path = '01_sources/000_initial_extract'
    api_base_url = 'https://api.openf1.org/v1/'
    api_call_list = ['drivers', 'sessions', 'meetings', 'pit']
    
    for call in api_call_list:
        # Set parameters for sessions endpoint
        params = {'session_type': 'Race'} if call == 'sessions' else None
        
        # Make the API call to get data
        step_1 = get_api_data(
            api_base_url=api_base_url,
            api_call=call,
            params=params
        )
        
        # Check if the API call was successful
        if step_1.status_code == 200:
            logging.info(f"Success on the call: {call}")
            step_2 = step_1.json()
            
            # Create relative path without leading dot
            save_path_abs = os.path.join(os.getcwd(), save_path)
            os.makedirs(save_path_abs, exist_ok=True)
            
            # Write the result to a JSON file
            write_to_json(data=step_2, base_path=save_path_abs, file_name=call)
        else:
            logging.error(f"Error on the call: {call}, Status Code: {step_1.status_code}")

This one looks a bit more massive but stays readable and easy enough to understand.

What is does is quite simple, I first set the variables I will need along the process, I app all the api calls as a list that will be split later on for the different calls and files.

Then I begin a loop through the different calls in the list with a simple for loop, it will allow me to process each call individually. There is one exception I made is that if the call made is for sessions, I only want to keep the race sessions and not get all the data from the sessions which are qualifying or practice since it adds a lot of data that I do not plan on using in the upcoming analysis.

Then I start calling the api with the previously defined function.

This done I check the results I got as follow :

if the result of the call is a code 200 I will process my data further, if not I will throw an error so that I stop processing this data. If it is indeed a code 200, I ensure that the path I plan to write my files on is existing and then use the write function to write my json file.

This is the global process for the simple data, it calls the API, then check if the call made is correct and if so it writes the data as a json file to a specific repository.

Complex data

What I qualify as complex data is data that needs parameters for each API call and for which the volume is a lot bigger, the overall process is the same, except that there will be multiple calls made for each API, depending on multiple parameters to ensure we capture all the data.

For the complex data, it contains 4 main API that I used:

car_data

Contains telemetry data about each car during the events, it is very detailed and we will use this data as an example for the script explanation.

laps

This contains data around the laps for each driver and some aggregated metrics, this data will be used to identify for each time which lap it is.

position

the position data is key, what is the use of racing if you cannot have positions, its format is simple, it gives a position for each driver and its start date, so if a driver overpasses from second to first, it will have 2 rows for each driver with starting dates, it can grow to many rows for some drivers who had many position changes during the race.

location

This data contains the position as x, y and z, it has not been used in the global process but allows for shaping the different circuits and comparing positions.

Generating the URLs

For URL generation this time it becomes a lot more complex since the params will be moving each time, I did not query the API one time as I did for simple data but instead made multiple calls. This is a method that I prefer using globally since it allows me a lot more control on the data I extract, also allows me in case of a crash, to get back to where I was instead of re querying the whole API. I then put all the generated urls to a data frame so that it can be queried later on.

In the case of car data, I had to split the queries quite a lot, so instead of just querying car_data, which would lead in an absurd amount of data resulting on a very long process, both on the API server end and on my end, I split it in many different smaller calls.

The first split I made is by session, then I split it by driver (for the drivers who ran this session) so that I have for each session 20 calls made, but this was not enough, so I do another split by speed, after a few try and errors, I ended up splitting by ranges of speed of a 100 km per hour leading to the following for session 1, driver 2 :

session 1, driver 2 from 0 km/h to 100 km/h
session 1, driver 2 from 101 km/h to 200 km/h
session 1, driver 2 from 201 km/h to 300 km/h
session 1, driver 2 from 301 km/h to 400 km/h

which leads to a call looking like this :

Recommended by LinkedIn

Supercharge your Python code with Dataclasses

Ash Johnson 3 years ago

Build your production forecast using Python (Part 2)

Mauricio Baldomir 4 years ago

Compound Data Type In Python

Suraj Kumar Soni 3 years ago

https://api.openf1.org/v1/car_data?driver_number=4&session_key=9159&speed>=0&speed<100

After generating all these urls, it ends up being stored in a list that will be used moving forward.

def generate_urls_by_session(
    base_url: str, 
    min_start: int, 
    step_size: int, 
    df: pd.DataFrame, 
    n_calls: int
) -> pd.DataFrame:
    """
    Generates a DataFrame of session keys and corresponding API call URLs based on speed ranges.
    
    Parameters:
        base_url (str): The base URL for the API.
        min_start (int): Starting minimum speed.
        step_size (int): Incremental step size for speed ranges.
        df (pd.DataFrame): DataFrame containing 'session_key' and 'driver_number' columns.
        n_calls (int): Number of API call URLs to generate per session.
        
    Returns:
        pd.DataFrame: DataFrame with columns ['session_key', 'urls'] where 'urls' is a list of generated URLs.
    """
    session_urls: Dict[str, List[str]] = {}

    for _, row in df.iterrows():
        session_key = row['session_key']
        driver_number = row['driver_number']
        urls = []
        for i in range(n_calls):
            min_value = min_start + i * step_size
            max_value = min_value + step_size
            url = (f"{base_url}?speed>={min_value}&speed<{max_value}"
                   f"&session_key={session_key}&driver_number={driver_number}")
            urls.append(url)
        session_urls.setdefault(session_key, []).extend(urls)
    
    return pd.DataFrame(list(session_urls.items()), columns=['session_key', 'urls'])

Checking existing files

This bit of code is very important for the rest of the script, it allows me to scan the existing files (so the files already processed) and will be used later on to compare to the list of URLs generated in the previous step. Another solution could have been to log the files processed into a file, but this way allows a lot of flexibility and ensure that a file will not be replaced and remove a point of failure by having a file which could be manually updated or else.

The code simply scan all the files present in the directory and removes the naming convention and file type to only keep the session_key which is in the name of the saved file.

def scan_output_repository(save_directory: Path) -> Set[str]:
    """
    Scans the directory for existing session JSON files and returns a set of session keys.
    
    Parameters:
        save_directory (Path): Path to the directory where JSON files are stored.
        
    Returns:
        Set[str]: Set of session keys for which JSON files already exist.
    """
    if not save_directory.exists():
        save_directory.mkdir(parents=True, exist_ok=True)
    
    existing_sessions = {
        f.stem.replace("car_data_", "") 
        for f in save_directory.glob("*.json")
    }
    return existing_sessions

Querying and saving

The next step in my data processing was the function to query the apis and save the data to a repository. In this query I start to reuse some of my previously generated code, I specifically use the previous scan repository function and compare it to the urls generated and log all the sessions already generated (this step is totally optional, it could also not be mentioned since it is already processed, but I prefer to have a too verbose log than not verbose enough).

The comparison done, my script will start querying the api to get my data, the script will here query the api, check if the call is successful, if it is the case, the data is appended and then written to multiple files (one per session).

Why creating one file per session see may wonder?

My answer will be simple, it is a lot easier to handle in case of an error, let's suppose I have the session 7779 on which I find a problem around data quality, I can simply delete this file and it will just relaunch the query for this specific session. This is to my mind a best practice for ease of use and error management.

def query_and_save(session_df: pd.DataFrame, save_directory: Path) -> None:
    """
    Queries the API for each session and saves the response data to JSON files.
    
    Parameters:
        session_df (pd.DataFrame): DataFrame containing 'session_key' and 'urls'.
        save_directory (Path): Directory to save the JSON output files.
    """
    existing_sessions = scan_output_repository(save_directory)

    for _, row in session_df.iterrows():
        session_key = row['session_key']

        if str(session_key) in existing_sessions:
            logging.info(f"Session {session_key} already exists. Skipping...")
            continue

        urls: List[str] = row['urls']
        session_data: List[dict] = []  # Collect data for current session
        logging.info(f"--- Querying session {session_key} ---")
        
        for url in urls:
            try:
                response = requests.get(url)
                if response.status_code == 200:
                    data = response.json()
                    session_data.append(data)
                else:
                    logging.warning(f"Failed to query URL: {url}, Status Code: {response.status_code}")
            except Exception as e:
                logging.error(f"Error querying URL {url}: {e}")
        
        session_file = save_directory / f"car_data_{session_key}.json"
        with session_file.open('w', encoding='utf-8') as f:
            json.dump(session_data, f, indent=4)
        logging.info(f"Data for session {session_key} saved to {session_file}")

Main function

All my sub routines are now defined, and I now will only have my main function to define.

To summarize and say it simply, my main function will be first defining the paths I am going to use in my process, then I will create the list of sessions and drivers I want to extract, this done I will generate the URLs and then only query those and finally save the results to individual files.

In between there are multiple parameters to fill the functions previously built.

def main() -> None:
    # Define paths using Pathlib for better path handling
    data_path = Path('./01_sources/000_initial_extract')
    output_path = Path('./01_sources/001_initial_extract_advanced/car_data')
    save_path_abs = output_path.resolve()

    # Load driver data and session data
    driver_file = data_path / 'drivers.json'
    session_file = data_path / 'sessions.json'
    
    df_drivers = pd.read_json(driver_file)[['driver_number', 'session_key']]
    df_sessions = pd.read_json(session_file)[['session_key']]

    # Merge driver and session data
    df_driver_sessions = df_drivers.merge(df_sessions, on='session_key', how='inner').drop_duplicates()

    # Parameters for URL generation
    base_url = "https://api.openf1.org/v1/car_data"
    min_start = 0     # Starting minimum speed
    step_size = 100   # Incremental step for speed range
    n_calls = 4       # Number of API calls per session
    
    logging.info("----- Generating URLs for API calls -----")
    session_df = generate_urls_by_session(base_url, min_start, step_size, df_driver_sessions, n_calls)
    logging.info(f"Sessions to process: {len(session_df)}")
    
    logging.info("----- Querying the API and saving the files -----")
    query_and_save(session_df, save_path_abs)
    logging.info("------ End of query ------")

Conclusion and moving forward

This was an example of the API querying logic I used on this project, I tried to build something as robust as possible and ensuring that I will not have a query running indefinitely, the main thing to remember is that to query an API, you will have to think on how big is the volume you want to query, if it is a smaller amount, it can be straightforward but if the volume is massive, you may prefer pre processing your data to query multiple times the API with some sub queries so that you ensure your data quality all along the process.

The next article will be around loading data to the database, and some preparation steps for cleansing the data.

If you want to find the previous articles:

Part 1: project overview

To view or add a comment, sign in

F1 project - querying APIs using python

Arthur Ladwein

Simple data

Logging configuration:

Get api data :

Main function and the global orchestration :

Complex data

Generating the URLs

Recommended by LinkedIn

Checking existing files

Querying and saving

Main function

Conclusion and moving forward

More articles by Arthur Ladwein

Others also viewed

Mini-Projects in Python [2]

From Dynamo Nodes to Python Nodes - 4 hours+ to 20 seconds run time

Easy guild to set up Notepad ++ to run Python Script on Windows.

Anaconda and Python Installation with Jupyter basics (#New learnings Python#2)

Webscraping with Python and Beautiful Soup

Class and instance attributes | Python

PLAXIS Output Visualisation Using Python

মোট ২৩ ধাপে পাইথন শেখার টুডু লিস্ট -

Importing Data in Python: A Quick Guide

Explore content categories

Simple data

Logging configuration:

Get api data :

Main function and the global orchestration :

Complex data

Generating the URLs

Recommended by LinkedIn

Checking existing files

Querying and saving

Main function

Conclusion and moving forward

More articles by Arthur Ladwein

Connectors

Alteryx Interface tools : Tree

Alteryx Interface tools : Radio button

Alteryx Interface tools : List Box

Alteryx Interface tools : Drop Down

Alteryx Interface tools : Checkbox

Alteryx Interface tools : basic interface elements (date, numeric, dropdown, map, text box)

Alteryx Interface Tools : introduction

F1 project - loading to the database

F1 project - data analysis overview

Others also viewed

Mini-Projects in Python [2]

From Dynamo Nodes to Python Nodes - 4 hours+ to 20 seconds run time

Easy guild to set up Notepad ++ to run Python Script on Windows.

Anaconda and Python Installation with Jupyter basics (#New learnings Python#2)

Webscraping with Python and Beautiful Soup

Class and instance attributes | Python

PLAXIS Output Visualisation Using Python

মোট ২৩ ধাপে পাইথন শেখার টুডু লিস্ট -

Importing Data in Python: A Quick Guide

Explore content categories