A Static Code Analysis of 8 Open Source Giants

Glenn Engstrand

Published Apr 14, 2025

I recently released a free-to-use web app where you can ask software architecture questions about eight popular open-source services. The answers come from a RAG-focused LLM. This article summarizes a static code analysis performed on those open-source projects that the ask the software architect app can answer questions about: Apache Cassandra, Debezium, Apache Druid, Elasticsearch, Apache Kafka, Apache Lucene, Neo4j, and Apache Spark. The analysis uses metrics like file count, lines of code (LoC), cyclomatic complexity, and package fan-in to understand these codebases' size, complexity, and structure without running the code.

Key repository-level findings revealed significant variations that you might find to be surprising. Among the eight services, Elasticsearch has the largest file count and average complexity per file. Cassandra shows the highest median LoC, indicating typically larger files. Cassandra's code files are significantly larger than those of the other database technologies on the list, Elasticsearch and Neo4J.

Diving deeper into package-level metrics identified specific "hotspots" within the projects. For instance, Spark's internal package was notably large, while specific Elasticsearch packages exhibited high complexity variation or significant inter-dependency (high fan-in), like the index.mapper package. The analysis noted Cassandra's tendency toward large files might stem from coding choices like grouping related implementations in single files.

While acknowledging limitations, the study concludes that static analysis provides valuable quantitative insights, revealing diverse architectural landscapes and identifying areas of concentrated size, complexity, or dependency within these successful open-source systems.

To view or add a comment, sign in

A Static Code Analysis of 8 Open Source Giants

Glenn Engstrand

More articles by Glenn Engstrand

Explore content categories

More articles by Glenn Engstrand

What Should Software Architects Do Now that AI Writes Most of the Code?

Enhance Your Applications with a RAG

Is Your Dev Environment a Bottleneck?

Unpacking the Monorepo: Is It Right for Your Organization?

Navigating the Design Maze: ADs vs. RFCs – Which Formal Doc To Use? 🚀

Stop Guessing "Why" with ADRs

Tame Design Chaos with an ARF

Microservices Running Blind? You Need Observability.

Tackling Technical Debt

Code Migration Using LLMs

Explore content categories