Merve Noyan’s Post

1mo Edited

IBM released a new benchmark for evaluating enterprise agents on multi-hop multi-source reasoning, called VAKRA 🙌🏻 Interesting to see gpt-oss-120b is second in the benchmark after Gemini-3-Flash topping the leaderboard 👏 pretty cool how they consistently release enterprise agentic work (that no other lab does!) get started in their demo → https://lnkd.in/dc_jVuW4

4 Comments

Danish Contractor 1mo

@Kevin yes, fair critique— but this benchmark is targeting a specific slice of the problem. Multi-hop, multi-source tool-grounded reasoning has been relatively underexplored, and it’s a real challenge in practice. Knowing when to use a retriever, when to use a structured API especially under tool use constraints are situations agents routinely face when deployed. And of course this work doesn’t replace the need for evaluating things like numerical reasoning, temporal consistency, domain specific knowledge, instruction adherence, or fail-safe behavior in high-stakes settings and other forms of 'complex reasoning'

Kevin Brown 1mo

Merve Noyan Football stats and Disney voice actors are zero-blast-radius queries. No regulatory consequence if the answer is wrong. No audit trail needed. No fail-closed gate required. They’re pedagogically clean but strategically misleading …..they make “complex reasoning” look like a tool orchestration problem, when the actual enterprise problem is what happens when the AI reasons incorrectly about a payment instruction or a clinical decision. The whole framing treats tool invocation as the hard problem.

Jen W. 1mo

The multi-hop degradation problem is real — each API hop is a potential context loss point. Curious how VAKRA accounts for compounding reasoning errors across hops vs. single-hop failures.

See more comments

To view or add a comment, sign in

More Relevant Posts

Port.io

21,136 followers
3w
Report this post
A stack trace tells you where an error happened. It doesn't tell you why it was introduced, or who's responsible for fixing it. Port now supports the Sentry MCP Server so developers can query Sentry error data from Port AI alongside service ownership and recent deployments from Port's catalog. Triage faster. Assign correctly. Fix sooner. Learn more about Port’s new MCP Connector: https://lnkd.in/eeXVz3Qn
Like Comment
To view or add a comment, sign in
Jeff Richards
3w
Report this post
Fixing bugs isn’t just about seeing the error, it’s about having the right context. Sentry shows what broke. Port.io provides the why and who. With MCP connecting the two, AI can move from error → owner → fix in one flow. Faster triage. Quicker fixes. Less tool-hopping. This is what strong partnerships should look like, meeting developers where they are and helping customers resolve issues faster. Grateful for the partnership with Sentry!
Port.io

21,136 followers
3w

A stack trace tells you where an error happened. It doesn't tell you why it was introduced, or who's responsible for fixing it. Port now supports the Sentry MCP Server so developers can query Sentry error data from Port AI alongside service ownership and recent deployments from Port's catalog. Triage faster. Assign correctly. Fix sooner. Learn more about Port’s new MCP Connector: https://lnkd.in/eeXVz3Qn
Like Comment
To view or add a comment, sign in
Darshil Kothari
1mo
Report this post
GPT-4, Gemini and Mixtral all use the same secret architecture It’s called Mixture of Experts. And it makes them 8× cheaper to run Here’s how it works: → Split the model into N specialist “experts” → A router picks the best 2 for each token → The rest stay idle You get a 400B model for the cost of a 50B one. Think of it like a hospital. You don’t wake every doctor for every patient. The router picks who’s needed, everyone else sleeps
Like Comment
To view or add a comment, sign in
Port.io

21,136 followers
4w
Report this post
When an incident fires, every minute of context-gathering is a minute you're not resolving. Port now supports the PagerDuty MCP Server so SREs can query live incident data from Port AI, grounded in your service catalog. No more jumping between PagerDuty and your IDP to figure out who owns the affected service, what the last deployment touched, or whether there's an open change request. The right context. The right team. Faster time to resolution. Learn more about Port’s new MCP Connector: https://lnkd.in/eiuZrfkS
1 Comment
Like Comment
To view or add a comment, sign in
Jeff Richards
4w
Report this post
I love this better-together story! Thousands of teams rely on PagerDuty as their enterprise platform for AI-first operations. Enabling that rich operational context to be married with Port's SDLC & engineering context via MCP is going to unlock so many realtime autonomous resolution use cases. Grateful for the partnership. Big things ahead. ( 👋 🙏 Carrie Lacina Ina Dalal Justyn Roberts Steve Gross)
Port.io

21,136 followers
1mo

When an incident fires, every minute of context-gathering is a minute you're not resolving. Port now supports the PagerDuty MCP Server so SREs can query live incident data from Port AI, grounded in your service catalog. No more jumping between PagerDuty and your IDP to figure out who owns the affected service, what the last deployment touched, or whether there's an open change request. The right context. The right team. Faster time to resolution. Learn more about Port’s new MCP Connector: https://lnkd.in/eiuZrfkS
Like Comment
To view or add a comment, sign in
DURBHASI GURUKULAM PRIVATE LIMITED ( dgpl )

357 followers
2w
Report this post
🚀 Memory MCP Server - Now Public! Hey everyone! 👋 Great news - our Memory MCP Server is now public and available for anyone to use! 🎉 🔗 GitHub: https://lnkd.in/gkMqP5UW It's a graph-based memory system for AI agents using Model Context Protocol (MCP) with: • 📚 Persistent memory storage • 🕸️ Knowledge graph (entities & relations) • 🔍 Vector search support • 👥 Multi-user perspective sharing • 💾 Session management Credit to DGPL (Durbhasi Gurukulam Private Limited) - https://lnkd.in/gPRRcyxY Check out the README for quick startGuide. Feedback welcome! 🙌
Like Comment
To view or add a comment, sign in
Antonio Gulli
1w
Report this post
This chapter introduces the handoff pattern, which is the simplest and often the most appropriate architecture for multi-agent systems. It fixes the problem of over-engineering complex frameworks like LangGraph by defining specialist agents and a single router to dispatch requests. You should use this pattern whenever a task can be fully handled by one agent and sub-tasks are independent. We will build a complete customer support triage system using this design pattern. Finally, the chapter compares implementations using the OpenAI Agents SDK and the Google ADK, demonstrating that the underlying pattern is consistent across major frameworks. notes: https://lnkd.in/dmNDMjkb repo: https://lnkd.in/dcj4wUfk Feedback is welcome
6 Comments
Like Comment
To view or add a comment, sign in
Sumanth Donthula
1mo Edited
Report this post
Wonder, how to structure your code for ML experiments in Notebooks. The “minimal viable structure” keeps things: – Reproducible – Testable – Easier to refactor into pipelines later Medium: https://lnkd.in/dfzdN3dw

1 Comment
Like Comment
To view or add a comment, sign in
Rick Chang
1mo
Report this post
Ruby Elite isn’t just a certification—it’s a real advantage for customers building AI factories. Higher efficiency means lower energy bills, less cooling demand, and more stable GPU performance. Excited to bring this level of power innovation to our U.S. partners.
Compuware Technology Inc

1,015 followers
1mo

Compuware Hits the 80 PLUS® Ruby Elite Tier! The era of AI demands more than just power—it demands extreme efficiency. We are thrilled to announce that Compuware has officially achieved the prestigious 80 PLUS® Ruby certification for our CPR-5521-2M1 PSU! As the core of our 33kW ORv3-compliant Power Shelf, this Ruby-certified solution delivers a breakthrough efficiency for high-power demand AI/HPC environments. Some details for you: Compuware 33kW Power Shelf: https://lnkd.in/g4PEVEPZ 80 PLUS Ruby Report: https://lnkd.in/g9k7V53U Press Release: https://lnkd.in/g9PYHXcX
Like Comment
To view or add a comment, sign in
Muhammad Ali
2w
Report this post
MLOps 2.0 needs to solve the monitoring problem first. Deploying a model is week one. Catching silent degradation over six months is the real challenge.

#mlops2 #mlops #datascience #machinelearning #mlengineer | Qwak (Acquired by JFrog) linkedin.com
Like Comment
To view or add a comment, sign in

57,561 followers

View Profile Follow

Merve Noyan’s Post

More from this author

we ship the home for your coding and personal agents on hugging face

Explore content categories