LLMs have become disturbingly capable pen-testers. With 579 lines of python scaffolding code, an LLM can autonomously compromise an Active Directory network. Privilege escalation, lateral movement, domain dominance.. the whole thing, as tested against the GOAD (Game of Active Directory) testbed. We've just released a new version of Cochise (https://lnkd.in/dMJFCN-u), our open-source prototype for autonomous assumed-breach pentesting, with a focus on simplicity and readability. If you're researching LLM-based offensive security, this is meant as a baseline and starting point. The accompanying paper was accepted at ACM TOSEM, and I'll be presenting at ICSE in Rio de Janeiro next week. If you're there and want to grab a coffee or an after-conference drink, message me.
Andreas — interesting baseline, and the GOAD results are compelling. The 576-line scaffold is the point worth noting. The capability isn't in the code — it's in the model. The scaffold just removes the friction. That's exactly the threat model we've been building against. NIGHTFALL takes the opposite approach — 47 purpose-built offensive tools, zero LLM dependency, zero external API keys, Ed25519 cryptographic gate on destructive operations. The difference matters in production engagements where you need a reproducible, auditable, evidence-chained result — not a model making autonomous decisions you can't fully explain to a client. Cochise proves the concept. The question for practitioners is whether concept-proof is sufficient for real engagements. Congratulations on the TOSEM acceptance.
genuinely would be interested to see a tool like this go head to head with an experienced AD pentester in an unknown, not purposefully vulnerable lab
Very cool! Thank you for sharing
Based on a lab people solved earlier? You can just do it with a skill.md