Diving Deeper: Can LLMs Author Complex Tech Docs? My RAG Experiment Revisited

Diving Deeper: Can LLMs Author Complex Tech Docs? My RAG Experiment Revisited

The buzz around Large Language Models (LLMs) often touts their potential to automate content creation, a prospect particularly tantalizing for technical documentation where clarity, accuracy, and depth are paramount. But can these models move beyond simple text generation to author comprehensive documentation on intricate topics for an audience that already has some background? Driven by this question, I embarked on an experiment to gauge the readiness of several prominent LLMs – ChatGPT (o1, o3 mini high, 4.5 variants), Claude Sonnet 3.7, and Gemini 2.5 Pro – for this specific challenge, using Retrieval-Augmented Generation (RAG) as my test case.

My starting point wasn't a blank page, but rather a personal outline capturing key RAG concepts, from the general process to advanced techniques like multi-query retrieval and hierarchical clustering. This outline served as a scaffold, containing brief notes and some pseudocode examples. The challenge posed to the LLMs was multi-faceted: not just to expand on this skeleton, but to truly enrich it for a reader looking to deepen their RAG knowledge. I employed a two-prompt strategy for each model. The first prompt initiated the process, asking for a basic summary based on the outline. The crucial second prompt significantly raised the bar, requesting detailed explanations of underlying theories, replacement of pseudocode with concise, working code examples illustrating key concepts, verification of API relevance, and, importantly, the identification and filling of implicit knowledge gaps within the outline – tasks essential for creating valuable learning material beyond the basics.

Evaluating the outputs meant focusing on their helpfulness for someone already familiar with LLM concepts but seeking deeper RAG knowledge. I primarily considered:

  1. Value for Advanced Learning: How well did the document explain complex RAG theories and provide useful implementation details (like code) for advanced techniques?
  2. Outline Enrichment & Gap Filling: Did the model go beyond the basic outline to add nuances, context, or address implicit missing information relevant to a learner?
  3. Code Quality & Utility: Were the code examples (replacing pseudocode) accurate, illustrative of key concepts, and potentially useful for someone trying to implement these techniques?
  4. Clarity and Structure: Was the information presented in a clear, well-organized manner suitable for learning complex topics?

Based on these criteria, the models showed distinct strengths and weaknesses in this specific test:


Article content

In this experiment, Gemini 2.5 Pro demonstrated particular strengths in providing deep theoretical context and proactively addressing potential knowledge gaps, making its output potentially very valuable for understanding the 'why' behind techniques. Claude Sonnet 3.7 also performed well, especially in generating clear explanations and quite detailed code examples that closely followed the outline's technical pointers. When comparing the ChatGPT models specifically for a reader wanting to learn advanced RAG techniques, an interesting observation emerged. While ChatGPT 4.5 produced a generally readable document, it was often quite superficial on complex topics. In contrast, ChatGPT o1 and o3 mini high, although perhaps less polished overall, sometimes provided more concrete detail (like specific code or pseudocode snippets for advanced methods) than 4.5 did. For someone trying to grasp the specifics of an advanced technique, this added detail might make them more helpful than 4.5's brevity on those points.

To truly appreciate these differences, let's zoom in on how the models handled the RAPTOR (Hierarchical Clustering) section. My outline briefly described RAPTOR as clustering docs hierarchically using GMM, mentioning UMAP, BIC for optimal cluster calculation, and the general process. Here's how the models responded:

  • Gemini 2.5 Pro offered a thorough conceptual explanation, discussed the components, but also provided valuable context on RAPTOR's complexity and research status, explaining why simple code wasn't provided and offering conceptual steps instead.
  • Claude Sonnet 3.7 delivered impressive code detail, implementing clustering with PCA/UMAP, GMM, BIC, and even outlining retrieval logic.
  • ChatGPT o1 included theory points and relevant pseudocode for clustering using UMAP/GMM/BIC.
  • ChatGPT o3 mini high also provided theory and a code snippet for clustering with UMAP/GMM.
  • ChatGPT 4.5, however, offered just a single definitional sentence. For a learner wanting to understand RAPTOR specifics, the outputs from Gemini (for concept/context) and Sonnet (for code) seem most valuable. However, the tangible code/pseudocode snippets in o1 and o3 mini high arguably offer more learning material on this specific topic than the near-absent explanation in 4.5.

Beyond content quality, a persistent practical hurdle emerged during this process. I consistently found that ChatGPT models, across the versions tested, struggled with generating clean Markdown output when code blocks were included. The formatting would often break or render incorrectly, necessitating tedious manual cleanup. In several instances, I found it quicker to paste the flawed ChatGPT output into Claude simply to have it fix the Markdown structure – an ironic but necessary workaround that highlights friction in using certain tools directly for documentation generation.

Reflecting on this experiment, it's clear that while LLMs possess remarkable capabilities, their suitability for authoring complex technical documentation isn't uniform. For the specific, demanding task of enriching a RAG outline with theory, code, and proactive gap-filling, models like Gemini 2.5 Pro and Claude Sonnet 3.7 showed significant promise in this test, tackling complex requirements with notable success. Interestingly, even less advanced models like ChatGPT o1 and o3 mini high sometimes provided more useful detail on specific complex topics than the more recent 4.5 variant, depending on the learning goal. This highlights that the "best" model might depend heavily on the specific task and desired output characteristics. The potential for LLMs to assist technical writers is clear, but careful prompting, critical evaluation, and awareness of practical issues like formatting remain essential.


P.S. A Meta-Note on Using LLMs to Write About LLMs

It feels only right to mention that generating this very blog post involved its own layer of LLM interaction. I primarily used Gemini 2.5 Pro (the model writing this now, based on my instructions) and ChatGPT o1 to help draft and refine the content based on my analysis and feedback.

Interestingly, this meta-process wasn't immune to the quirks we often see with LLMs. Even on a relatively straightforward task like summarizing our comparison, mistakes occurred. For instance, ChatGPT o1 initially got confused about which files were which during the analysis phase. Gemini 2.5 Pro, perhaps influenced by different default metrics or interpretations, initially suggested ChatGPT 4.5 outperformed o1 and o3 mini-high, requiring correction based on the specific viewpoint of a learner seeking advanced details. Furthermore, I noticed Gemini's initial drafts tended to frame its own performance (as identified in the RAG generation task) quite favorably, while ChatGPT o1's attempts at comparison felt somewhat more neutral. It required specific instruction to ensure a more modest and balanced tone reflective of a single experiment. It's a useful reminder that even when using LLMs for summary or writing tasks, careful review, clear metric definition, and iterative refinement are crucial – the models are powerful tools, but not yet infallible authors or objective analysts.

To view or add a comment, sign in

More articles by Dustin Wang

Explore content categories