Reverse-Engineering Screaming Frog: What I Learned Building a Python API Around Crawl Data

For years, my workflow with Screaming Frog looked the same:

  • Run a crawl.
  • Open the GUI.
  • Export CSVs.
  • Clean them.
  • Then finally start analyzing.

It worked. But it was deeply manual. Last month I started asking a simple question:

What if crawl data didn’t have to leave code at all? This article is a summary of what I learned during the first month of building a Python library to read and automate Screaming Frog crawl data programmatically.

Not a launch post. Just a technical breakdown of the journey so far.

1. The Real Limitation Wasn’t Configuration — It Was Output

I had already built sf-config-tool, which lets you configure Screaming Frog crawls programmatically. That solved half the problem. But even using the CLI, the only way to extract data was through predefined exports: tab/filter combinations as flat CSV files.

  • No ad-hoc queries.
  • No direct access to the link graph.
  • Every new question meant another export.

That’s when I looked at the .dbseospider file format more closely.

2. What’s Inside a .dbseospider File?

A .dbseospider file is essentially:

  • A ZIP archive
  • Containing an Apache Derby database

That changes everything. Instead of exporting CSVs, you can connect directly to the database from Python.

Inside that database:

  • APP.URLS contains every crawled page (200+ columns)
  • APP.LINKS stores the full link graph
  • Duplicate tracking
  • PageSpeed data
  • Accessibility results
  • Complete HTTP headers
  • Dozens of meta tag slots
  • Lighthouse JSON blobs

Even better, derby contains fields not surfaced in default GUI views/exports. Accessing it was the first breakthrough.

3. Raw Data Is Not Usable Data

Once the database was accessible, a new problem appeared. The internal column names are not the ones users recognize.

Instead of:

  • Address
  • Status Code
  • Title 1

You get:

  • ENCODED_URL
  • RESPONSE_CODE
  • TITLE_1

To make this usable, I exported every possible tab and filter from Screaming Frog — 628 CSV files — to understand how internal schema maps to user-facing labels. This mapping layer is now what allows querying crawl data in Python using familiar column names.

4. The Link Graph Is the Real Power

One of the most powerful discoveries was the internal link graph. Screaming Frog stores every relationship:

  • Source URL
  • Destination URL
  • Anchor text
  • Link type
  • Follow/nofollow

In the GUI, you explore this manually. In code, it becomes a graph you can traverse.

That enables:

  • Orphan detection
  • Internal link audits
  • Anchor distribution analysis
  • Redirect chain detection
  • Canonical chain detection

All scriptable. All reproducible.

5. Computed Fields Required Rerunning calculations

Not all values in the GUI are stored directly. “Indexability,” for example, is computed from:

  • robots.txt status
  • multiple meta robots tags
  • X-Robots-Tag headers
  • other internal flags

To reproduce GUI-consistent results, I had to reconstruct mappings and computed logic from Screaming Frog’s decision logic and replicate it in SQL and Python. Accessing raw data isn’t enough. You need to reconstruct the tool’s internal business logic.

6. The Format Problem

Screaming Frog has three formats:

  • .seospider (serialized Java objects, not queryable)
  • CSV exports (flat, incomplete)
  • .dbseospider (Derby database, fully queryable)

The CLI can export CSVs and save .seospider, but not .dbseospider directly. That’s a major constraint.

The workaround:

  1. Load .seospider via CLI in DB storage mode
  2. Let Screaming Frog generate the Derby database
  3. Package that directory into a .dbseospider file

Now any crawl can become a portable, queryable database.

7. Full Automation

With:

  • Configuration (sf-config-tool)
  • Format conversion
  • Database access
  • Mapping layer
  • Link graph traversal
  • Crawl diff comparison

It becomes possible to:

  • Start crawls
  • Convert formats
  • Query results
  • Compare crawls over time

All from Python. No GUI. No manual exports.

Real World Use Case:

A monthly enterprise crawl (~80k URLs) where the SEO team needs to answer, every time:

  • Which internal pages are 404 right now?
  • Which pages have the most broken inlinks?
  • Which broken inlinks are rel=nofollow vs follow?
  • Which redirect chains are longer than 3 hops?
  • What changed vs last month (status, title, canonicals)?

Before: manual GUI filtering + multiple CSV exports + spreadsheet joins. Now: one Python workflow that queries Derby directly and outputs a prioritized fix list in minutes, fully reproducible month to month.

8. Alpha and Early Feedback

The first alpha batch is now running on real sites.

Some early feedback:

  • 10 lines of code replaced exporting and merging 3 CSVs
  • Replacing 45+ minutes of GUI work
  • Crawl diff being the most valuable feature

The most interesting part isn’t speed.

It’s that workflows that were previously manual can now be automated end-to-end.


To view or add a comment, sign in

Others also viewed

Explore content categories