Reverse-Engineering Screaming Frog: What I Learned Building a Python API Around Crawl Data

Antonio Atilio Maculus

Published Mar 11, 2026

For years, my workflow with Screaming Frog looked the same:

Run a crawl.
Open the GUI.
Export CSVs.
Clean them.
Then finally start analyzing.

It worked. But it was deeply manual. Last month I started asking a simple question:

What if crawl data didn’t have to leave code at all? This article is a summary of what I learned during the first month of building a Python library to read and automate Screaming Frog crawl data programmatically.

Not a launch post. Just a technical breakdown of the journey so far.

1. The Real Limitation Wasn’t Configuration — It Was Output

I had already built sf-config-tool, which lets you configure Screaming Frog crawls programmatically. That solved half the problem. But even using the CLI, the only way to extract data was through predefined exports: tab/filter combinations as flat CSV files.

No ad-hoc queries.
No direct access to the link graph.
Every new question meant another export.

That’s when I looked at the .dbseospider file format more closely.

2. What’s Inside a .dbseospider File?

A .dbseospider file is essentially:

A ZIP archive
Containing an Apache Derby database

That changes everything. Instead of exporting CSVs, you can connect directly to the database from Python.

Inside that database:

APP.URLS contains every crawled page (200+ columns)
APP.LINKS stores the full link graph
Duplicate tracking
PageSpeed data
Accessibility results
Complete HTTP headers
Dozens of meta tag slots
Lighthouse JSON blobs

Even better, derby contains fields not surfaced in default GUI views/exports. Accessing it was the first breakthrough.

3. Raw Data Is Not Usable Data

Once the database was accessible, a new problem appeared. The internal column names are not the ones users recognize.

Instead of:

Address
Status Code
Title 1

You get:

ENCODED_URL
RESPONSE_CODE
TITLE_1

To make this usable, I exported every possible tab and filter from Screaming Frog — 628 CSV files — to understand how internal schema maps to user-facing labels. This mapping layer is now what allows querying crawl data in Python using familiar column names.

4. The Link Graph Is the Real Power

One of the most powerful discoveries was the internal link graph. Screaming Frog stores every relationship:

Source URL
Destination URL
Anchor text
Link type
Follow/nofollow

In the GUI, you explore this manually. In code, it becomes a graph you can traverse.

That enables:

Orphan detection
Internal link audits
Anchor distribution analysis
Redirect chain detection
Canonical chain detection

Recommended by LinkedIn

A Deep Dive into Request Body in FastAPI

Shanya Awasthi 1 year ago

Interactive Data Visualization with Python Using Bokeh

Sergi Lehkyi 7 years ago

Merge Overlapping Rasters Using python and rioxarray

Chonghua Yin 2 years ago

All scriptable. All reproducible.

5. Computed Fields Required Rerunning calculations

Not all values in the GUI are stored directly. “Indexability,” for example, is computed from:

robots.txt status
multiple meta robots tags
X-Robots-Tag headers
other internal flags

To reproduce GUI-consistent results, I had to reconstruct mappings and computed logic from Screaming Frog’s decision logic and replicate it in SQL and Python. Accessing raw data isn’t enough. You need to reconstruct the tool’s internal business logic.

6. The Format Problem

Screaming Frog has three formats:

.seospider (serialized Java objects, not queryable)
CSV exports (flat, incomplete)
.dbseospider (Derby database, fully queryable)

The CLI can export CSVs and save .seospider, but not .dbseospider directly. That’s a major constraint.

The workaround:

Load .seospider via CLI in DB storage mode
Let Screaming Frog generate the Derby database
Package that directory into a .dbseospider file

Now any crawl can become a portable, queryable database.

7. Full Automation

With:

Configuration (sf-config-tool)
Format conversion
Database access
Mapping layer
Link graph traversal
Crawl diff comparison

It becomes possible to:

Start crawls
Convert formats
Query results
Compare crawls over time

All from Python. No GUI. No manual exports.

Real World Use Case:

A monthly enterprise crawl (~80k URLs) where the SEO team needs to answer, every time:

Which internal pages are 404 right now?
Which pages have the most broken inlinks?
Which broken inlinks are rel=nofollow vs follow?
Which redirect chains are longer than 3 hops?
What changed vs last month (status, title, canonicals)?

Before: manual GUI filtering + multiple CSV exports + spreadsheet joins. Now: one Python workflow that queries Derby directly and outputs a prioritized fix list in minutes, fully reproducible month to month.

8. Alpha and Early Feedback

The first alpha batch is now running on real sites.

Some early feedback:

10 lines of code replaced exporting and merging 3 CSVs
Replacing 45+ minutes of GUI work
Crawl diff being the most valuable feature

The most interesting part isn’t speed.

It’s that workflows that were previously manual can now be automated end-to-end.

Reverse-Engineering Screaming Frog: What I Learned Building a Python API Around Crawl Data

Antonio Atilio Maculus

1. The Real Limitation Wasn’t Configuration — It Was Output

2. What’s Inside a .dbseospider File?

3. Raw Data Is Not Usable Data

4. The Link Graph Is the Real Power

Recommended by LinkedIn

5. Computed Fields Required Rerunning calculations

6. The Format Problem

7. Full Automation

Real World Use Case:

8. Alpha and Early Feedback

Others also viewed

Building a RAG-Enabled Web Assistant with Python and Flask

Creating an Application to Generate Creative Business Ideas using Flask and OpenAI

Using Python and Pandas to find the related movies

Multi-Channel Attribution Model with Python

📊 PYTHON + DASH TIP: Visual Hierarchy with Sunburst Chart

📊 PYTHON + DASH TIP: Bubble Chart to Visualize Product Sales

Mimicking Python Dictionaries in Power Query M: Multi-Key Joins

How to Do Simple Grouped Calculations on All Rows of a DataFrame in Python + R

Meet Taipy: Build Full-Stack AI & Data Apps in Pure Python — No HTML, No JS Needed

Explore content categories