A Static Code Analysis of 8 Open Source Giants
I recently released a free-to-use web app where you can ask software architecture questions about eight popular open-source services. The answers come from a RAG-focused LLM. This article summarizes a static code analysis performed on those open-source projects that the ask the software architect app can answer questions about: Apache Cassandra, Debezium, Apache Druid, Elasticsearch, Apache Kafka, Apache Lucene, Neo4j, and Apache Spark. The analysis uses metrics like file count, lines of code (LoC), cyclomatic complexity, and package fan-in to understand these codebases' size, complexity, and structure without running the code.
Key repository-level findings revealed significant variations that you might find to be surprising. Among the eight services, Elasticsearch has the largest file count and average complexity per file. Cassandra shows the highest median LoC, indicating typically larger files. Cassandra's code files are significantly larger than those of the other database technologies on the list, Elasticsearch and Neo4J.
Diving deeper into package-level metrics identified specific "hotspots" within the projects. For instance, Spark's internal package was notably large, while specific Elasticsearch packages exhibited high complexity variation or significant inter-dependency (high fan-in), like the index.mapper package. The analysis noted Cassandra's tendency toward large files might stem from coding choices like grouping related implementations in single files.
While acknowledging limitations, the study concludes that static analysis provides valuable quantitative insights, revealing diverse architectural landscapes and identifying areas of concentrated size, complexity, or dependency within these successful open-source systems.