Debugging AI, One Edge Case at a Time

Debugging AI, One Edge Case at a Time

Over the past few days, I’ve pushed and merged a few PRs across different repos. Nothing flashy — but the kind of work that actually matters when you try to run AI systems in production.

A few examples:

  • NemoClaw → added real-world deployment notes (hardware constraints, known issues, NIM warnings). Basically the stuff you wish you knew before things break
  • NemoClaw → fixed installer/uninstaller issues that were causing inconsistent setups
  • OpenShell → updated routing logic for GPT-5+ compatibility (max_completion_tokens)
  • Megatron-LM → fixed a Python shutdown crash in async calls (the kind of bug that only shows up at the worst time)
  • kvpress → improved decoding state handling for more predictable outputs

What I enjoy in this kind of work is that it sits right in the messy middle:

between GPUs, drivers, APIs, and actual usage in production.

And honestly, that’s where most problems are.

Not the model.

Not the theory.

But everything around it.

That’s also where I tend to focus:

making AI systems stable, reproducible, and usable outside of demos.

  • infra that doesn’t randomly break
  • deployments you can actually trust
  • systems that behave the same way twice

I’m based in Switzerland and currently open to senior roles around:

AI infrastructure, platform engineering, or anything where things need to actually work at scale.

If that’s what you’re building, happy to chat.

#AI #MLOps #Infrastructure #NVIDIA #LLM

To view or add a comment, sign in

More articles by Maxime Grenu

Explore content categories