GitHub Ships Rubber Duck Agent for Copilot CLI with 74.7% Code Quality Improvement | Ilyas F posted on the topic | LinkedIn

2w

🦆 GitHub just shipped a "Rubber Duck" agent for Copilot CLI — and the data backs it up. The idea is simple but powerful: after the primary model writes code, a second model from a different AI family automatically reviews it. Why it works → Models from the same family share the same blind spots. Cross-architecture review catches a completely different class of errors than self-review. The results? 74.7% gap closure in code quality issues. This is basically institutionalizing what top engineers already do — getting a code review from someone with a different perspective. Currently available in Copilot CLI only. VS Code coming soon. 🔗 Credit: @burkeholland #GitHubCopilot #AI #CopilotCLI #CodeReview #SoftwareEngineering #DeveloperTools

1 Comment

Anandesh Sharma, graphic

Anandesh Sharma 2w

Cross-architecture review is genuinely clever, same family models failing together is a real problem we hit constantly when chaining LLMs. The 74.7% gap closure number is doing heavy lifting here though, curious what the baseline was. Teams building multi-agent review pipelines on Definable wire different model providers per node specifically to exploit this, the routing layer makes swapping architectures per task surprisingly painless.

To view or add a comment, sign in

More Relevant Posts

Burke Holland
1mo Edited
Report this post
What do you think the best model for coding is? The answer is different depending on who you ask. But it's the wrong question. The right question is: what's the best model for this job? Different models are good at different things. Sonnet for planning. GPT 5.3 Codex for cranking out code. Opus for deep reasoning and design. Haiku for fast exploration. In GitHub Copilot CLI, you can use all of them - in the same workflow. In my latest video, I walk through how to compose multi-model workflows in the CLI: - Override the model on the fly - Build custom agents that each use a different model for their specialty - Use the built-in /fleet command to automatically parallelize and delegate to the right agent - Create an adversarial review skill where multiple models review each other's work https://lnkd.in/gme48rsq

Multi-model AI workflows in GitHub Copilot CLI

https://www.youtube.com/

2 Comments
Like Comment
To view or add a comment, sign in
Karnav Gupta
1w
Report this post
Everyone’s talking about powerful models like Claude and OpenAI. But in reality? Most engineering teams are quietly defaulting to tools like GitHub Copilot. Why? 👉 Because integration beats capability in the real world. Copilot sits inside the developer workflow. Inside the IDE. Inside the pull requests. Inside the daily grind. No context switching. No extra setup. Just… there. And that’s exactly the lesson we’re missing in Test Engineering. We don’t just need powerful AI. We need AI that’s deeply integrated into how we already work. ⚙️ This is where RAG comes in — not as a concept, but as a layer: * Plugged into your test frameworks * Connected to your logs, test cases, and defect history * Working inside CI/CD pipelines, not outside them #GitHubCopilot #RAG #AITransformation #AIAdoption #TestEngineering
Like Comment
To view or add a comment, sign in
Mahdi Farahani
6d
Report this post
🚀 Just discovered something seriously powerful for devs working with AI + codebases: GitNexus — a zero-server code intelligence engine that turns any repo into a living knowledge graph. Drop in a repo → get full architecture visibility, dependency mapping, and AI-ready context. No more blind edits. No more missing dependencies. This is how AI should understand code. 🔗 https://lnkd.in/ddBR2HEV #AI #DeveloperTools #OpenSource #MachineLearning #CodeAnalysis #DevTools #GitHub
Like Comment
To view or add a comment, sign in
Muhibbuddin Shaid Hakkeem
2w
Report this post
I stopped using one AI for everything. Now I use two. The difference surprised me. Here's the actual workflow. Most people run Claude Code like a solo dev. One prompt. One model. Ship it. It works. But it tops out. Claude is genuinely good at building. Long context, complex architecture, multi-file sessions. It also over-engineers sometimes, runs out of tokens, and drifts on longer tasks. Which is fine. I just stopped pretending one tool covers all of it. What changed things for me: the OpenAI Codex plugin running inside Claude Code. Not a replacement. More like a second set of eyes on the same work. What it looks like in practice: Claude builds the feature. While it's still running, I kick off /codex:adversarial-review in the background. Codex looks at the same codebase from a different angle. When Claude wraps up, I feed those findings back in. Claude fixes what it missed. Both running at the same time. No waiting. The demo that made it click: - Claude built a full 2D roguelike dungeon crawler from scratch. - Codex did an adversarial review and found actual bugs. - Save states silently resetting. Staircase logic that permanently locked the player. - Claude would have shipped it. Codex caught it before that happened. That's not Claude failing. That's just what happens when a second reviewer looks at the same code. What you actually get from running both: • Claude handles the build, context, and long-run architecture • Codex runs a background adversarial pass on the same code • Feed the results back and Claude fixes the gaps it missed • No context switching, both run at the same time • More coverage, same amount of time Pick your model carefully. But build the workflow around it. That's where the real gap opens up. The full walkthrough is in the attachment every step of the process, with visuals from the actual demo. #ClaudeCode #OpenAICodex #AITools #GenerativeAI #DeveloperTools #CodingWithAI #BuildInPublic #AIEngineering #SoftwareDevelopment #GenAI
Like Comment
To view or add a comment, sign in
TheNextGenTechInsider.com

647 followers
2w
Report this post
GitHub Launches Rubber Duck Experimental Feature for Copilot CLI 📌 GitHub’s Rubber Duck experiment brings a second AI mind to Copilot CLI, cross-checking code plans with a different model family to catch “confident mistakes” before they escalate. It boosts accuracy on complex tasks - closing 74.7% of performance gaps and catching critical bugs like silent overwrites or dependency conflicts. Now in experimental mode, it’s a bold leap toward smarter, more reliable AI coding assistants. 🔗 Read more: https://lnkd.in/dfnCgvZS #Githubcopilotcli #Rubberduck #Multimodelreview #Experimentalfeature #Aiassistant
Like Comment
To view or add a comment, sign in
Ruben Vermaak
1w
Report this post
Today I learned something that quietly changed how I think about AI tools. I've been working through the Data Engineering ZoomCamp and got into Kestra's AI Copilot feature. The idea is straightforward: instead of hand-writing YAML for your workflow configs, you describe what you want in plain English and the Copilot generates the flow code for you. But the more interesting part was understanding why it actually works well. The answer is RAG, Retrieval Augmented Generation. Without it, an AI assistant is just working from whatever it learned during training. With RAG, it pulls in live, relevant context before it responds, in this case Kestra's own documentation and workflow patterns. That's what lets it give you accurate, specific output instead of generic guesses. It clicked for me why this matters in data engineering specifically. Pipelines are detailed and unforgiving. A hallucinated config or a wrong parameter name breaks everything. Grounding the AI in real documentation before it generates anything isn't a nice-to-have, it's the whole point. Kestra recently raised $25M and reported over 2 billion workflows executed in 2025 alone, Kestra which tells you orchestration tooling is becoming serious infrastructure, not just a nice abstraction on top of cron jobs. Still early in the ZoomCamp but the depth keeps surprising me. If you're curious about the data engineering space, follow along. #DataEngineering #DEZoomCamp #Kestra #RAG #LearningInPublic #Python
Like Comment
To view or add a comment, sign in
Valerii Abakumov
1w
Report this post
Did Claude Opus 4.7 just get... dumber? I’ve been using 4.6 for weeks and it was just a beast. But last few days with 4.7 have been a struggle - logic gaps, more iterations with same prompts, and it just feels like it's lost its edge. Honestly, even GPT-5.4 is starting to look good by comparison. Ah, and here is the real kicker: GitHub Copilot just locked us into 4.7, no more 4.6 option 😢 As an engineer, I hate it when "upgrades" feel like a forced regression: it's like we finally learned to use fire, and now we're back to banging rocks together and hoping for a spark 🫠 How are you finding Claude models' quality lately? Is it only me, or you also feel like you're back in the caves? #AI #Claude #GithubCopilot #AIWorkflows #SoftwareEngineering #PostVibe
Like Comment
To view or add a comment, sign in
Karthick Baskar
3w
Report this post
Why the "Rubber Duck" is the most important update to Copilot in 2026. GitHub just dropped an experimental feature that solves a massive headache for dev managers: AI hallucinations in multi-file refactors. It’s called Rubber Duck mode, and it’s a brilliant move in agentic design. Instead of one model checking its own homework (which rarely works bias in, bias out), Copilot now pairs your primary model with a "reviewer" from a completely different AI family. How it works: If you're using Claude as your primary coder, GitHub spins up GPT-5.4 as the "Rubber Duck" to critique the plan before a single line of code is written. The result? Early benchmarks show it closes nearly 75% of the performance gap on complex, 70+ step tasks. It’s catching the silent logic errors that usually don't surface until a production bug report hits your desk. In my view, 2026 isn't about which LLM is "smarter." It’s about which multi-agent architecture provides the highest guardrails for our teams. #GitHubCopilot #AI #SoftwareEngineering #AgenticAI #GenAI
Like Comment
To view or add a comment, sign in
Emad Abuselmiya
4w
Report this post
During my last sprint, I handed off bug fixes to an AI agent—and the difference was clear. Tools like Copilot and Claude Code don’t just finish lines; they understand the context and suggest entire functions. This shift from simple autocomplete to smart code generation means fewer errors and quicker turnaround. Integrations with VS Code and JetBrains keep everything seamless, so I stay in the zone without switching apps. If you haven’t tried it yet, give Cursor or Copilot a week and see how much repetitive coding they take off your plate. What coding task would you want an AI to handle next? 🤖✨ #AI #SoftwareEngineering #AIDevelopment

1 Comment
Like Comment
To view or add a comment, sign in
Mohit Yadav
2w
Report this post
Beyond the Prototype: When Streamlit Hits the Wall 🧱 In the world of AI development, we often start with Streamlit because it’s the "Gold Standard" for rapid prototyping. It’s fantastic for getting a UI up in minutes, but as my team and I recently discovered, high-complexity projects eventually demand a more robust architecture. The Challenge:- RAG + a 7B Parameter Giant We recently integrated a Mistral-7B model into a complex deep learning project to implement RAG (Retrieval-Augmented Generation). The goal? Allow the model to "read" our existing complex codebase and answer architectural questions in real-time. The Reality Check: While Streamlit is great for smaller models, trying to load a 7-billion parameter LLM alongside a heavy RAG pipeline pushed the framework to its limits. We encountered: Memory Overload: Standard hosting environments struggled to keep the model weights in VRAM. Inference Latency: Streamlit’s "rerun-on-interaction" logic can become a bottleneck when handling heavy-load deep learning tasks. State Management: Handling long, complex retrieval chains requires a more granular control of the backend. The Engineering Takeaway: For beginners, Streamlit is a 10/10. But for production-grade AI, you eventually have to transition to a decoupled architecture: Backend: FastAPI or Flask to handle the "heavy lifting" and model serving. Containerization: Using Docker to manage the massive environment requirements. Frontend: A dedicated framework that doesn't reload the entire model on every button click. Building is about choosing the right tool for the job—not just the easiest one. Has anyone else hit a performance ceiling with "easy-to-use" frameworks when scaling their LLMs? Let's discuss in the comments. 👇 #MLOps #Mistral7B #AIEngineering #Streamlit #RAG #DeepLearning #Scalability
Like Comment
To view or add a comment, sign in

Ilyas F

19,147 followers

View Profile Connect

More from this author

Explore content categories