IBM released a new benchmark for evaluating enterprise agents on multi-hop multi-source reasoning, called VAKRA 🙌🏻 Interesting to see gpt-oss-120b is second in the benchmark after Gemini-3-Flash topping the leaderboard 👏 pretty cool how they consistently release enterprise agentic work (that no other lab does!) get started in their demo → https://lnkd.in/dc_jVuW4
Merve Noyan Football stats and Disney voice actors are zero-blast-radius queries. No regulatory consequence if the answer is wrong. No audit trail needed. No fail-closed gate required. They’re pedagogically clean but strategically misleading …..they make “complex reasoning” look like a tool orchestration problem, when the actual enterprise problem is what happens when the AI reasons incorrectly about a payment instruction or a clinical decision. The whole framing treats tool invocation as the hard problem.
The multi-hop degradation problem is real — each API hop is a potential context loss point. Curious how VAKRA accounts for compounding reasoning errors across hops vs. single-hop failures.
@Kevin yes, fair critique— but this benchmark is targeting a specific slice of the problem. Multi-hop, multi-source tool-grounded reasoning has been relatively underexplored, and it’s a real challenge in practice. Knowing when to use a retriever, when to use a structured API especially under tool use constraints are situations agents routinely face when deployed. And of course this work doesn’t replace the need for evaluating things like numerical reasoning, temporal consistency, domain specific knowledge, instruction adherence, or fail-safe behavior in high-stakes settings and other forms of 'complex reasoning'