Computer Vision vs LLMs: When to Use Each | Satya Mallick posted on the topic | LinkedIn

1w Edited

After 15+ years building computer vision systems, I tested Opus 4.7 on one of the simplest CV tasks — detecting cars in an aerial parking lot image. The result? It took 4–5 minutes and still missed cars, produced false positives, and placed points on empty spaces. Bounding box results were even worse — boxes in completely wrong locations. I also tested Codex, which did significantly better at pointing (found 24 cars with accurate placement). But it still took nearly 3 minutes. Meanwhile, a YOLO model does the same in 30 milliseconds. Here's the takeaway for anyone building with AI right now: Multimodal LLMs are incredible for reasoning, summarization, and code generation. But they are not a replacement for dedicated computer vision models. For standard detection → train a model or use YOLO. For open-vocabulary tasks like "find only red cars" → use Qwen 3 VL or MoonDream. Both respond in milliseconds, cost a fraction of the tokens, and produce far more accurate results. The right model for the right task. Always. Full breakdown in the video. Link in comments. #ComputerVision #AI #MachineLearning #YOLO #DeepLearning #ObjectDetection #LLM #AIEngineering #MLOps #OpenCV

38 Comments

Niel Cansino, graphic

Putting AI to work for robotics leaders & operators · Frameworks, product decisions, and from-the-trenches lessons · 10y in robotics · 30 posts/mo · Machine Learning @ Georgia Tech · Filipino in Tokyo

1w

Special-purpose tools always beat general-purpose tools 24 hours of the day. It's like comparing if a Swiss knife is better for peeling potatoes than a peeler - maybe a lousy analogy but you get the point.

Saurabh Datta, graphic

Saurabh Datta 1w

Yup moondream

Kasey Evans, graphic

Kasey Evans 1w

What about models like Grounding DINO?

Muja Kayadan, graphic

Muja Kayadan 1w

However, you can create a car detection model with Opus 4.7 in 5 minutes, without the training part. Finds dataset, finds backbone, writes up the code.

Ilia Shipachev, graphic

Computer Vision Engineer · 3D Vision & Reconstruction and 3DGS

1w

I also have a positive experience with Qwen 3VL. This one specifically: Qwen3-VL-30B-A3B-Instruct-Q4_K_M: https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF/blob/main/Qwen3-VL-30B-A3B-Instruct-Q4_K_M.gguf In my tests it performed better than Qwen 3.5 or Gemma 4 (even larger ones, less quatized!) for the same open-vocabulary task. And it was run locally, partially offloaded to GPU.

Kevin Ong SH, PhD, graphic

Kevin Ong SH, PhD 1w

Totally agree with "right model for the right task". More often than not, I hear many engineers/developers (including experienced ones) who constantly advocate these huge LLMs (i.e., think swiss army knives) for all vision-related tasks. When asked on why it takes so long to get results (e.g., 30 seconds to 1 minute), I get the standard response that it is the LLM provider's task and not theirs. When further pressed on how to improve the evaluation metrics, the usual answer of "improve prompt engineering" comes up. BTW, I've had good experience with YOLO, RF-DETR, Qwen3.x-VL families whilst getting decent latency responses.*Sigh*

Sanidhya Srivastava, graphic

Sanidhya Srivastava 5d

Ya The yolo models are best in detection even can be trained from scratch even u can append theier architecture based on your need and yes u not need any other rand block from papers hence it internal mechanism hurt when we start scale so use it internal blocks for architecture modify wrt axhitiche support such as adding p2 in yolo11 or seg or obb or add other layer or integration psa or c2psa in yolov8 or 8 seg or 8 obb it can increase latency to 1ms but manageable wrt attention on small objects even other types block if u integrate by append in block file it can be always greater chance of model failure as extended to large dataset even on it type also reduce accuracy and speed so even yolo is better ,always try it base architecture and even need in specially case we can use it blocks related to architecture by replacing by a code too .so yolo is better and can be enhanced by changing blocks but only its internal one.

See more comments

To view or add a comment, sign in

More Relevant Posts

OpenCV University

40,122 followers
1w
Report this post
After 15+ years building computer vision systems, I tested Opus 4.7 on one of the simplest CV tasks — detecting cars in an aerial parking lot image. The result? It took 4–5 minutes and still missed cars, produced false positives, and placed points on empty spaces. Bounding box results were even worse — boxes in completely wrong locations. I also tested Codex, which did significantly better at pointing (found 24 cars with accurate placement). But it still took nearly 3 minutes. Meanwhile, a YOLO model does the same in 30 milliseconds. Here's the takeaway for anyone building with AI right now: Multimodal LLMs are incredible for reasoning, summarization, and code generation. But they are not a replacement for dedicated computer vision models. For standard detection → train a model or use YOLO. For open-vocabulary tasks like "find only red cars" → use Qwen 3 VL or MoonDream. Both respond in milliseconds, cost a fraction of the tokens, and produce far more accurate results. The right model for the right task. Always. Full breakdown in the video. Link in comments. #ComputerVision #AI #MachineLearning #YOLO #DeepLearning #ObjectDetection #LLM #AIEngineering #MLOps #OpenCV
Like Comment
To view or add a comment, sign in
Zahidul Islam
1w Edited
Report this post
10 years in data operations has taught me one thing: "Good enough" for a demo is a failure in production. Satya Mallick , CEO of OpenCV proved this by testing top-tier Multimodal LLMs on a simple car detection task. We are seeing a "Return to Precision." While multimodal LLMs are incredible for reasoning and code, they aren't a replacement for dedicated computer vision models in high-stakes environments. My Take: In my decade of helping companies scale CV systems, I’ve seen that the most defensible products aren't built on generic APIs—they are built on specialized models. The narrative that "HITL Data Ops is dying" misses a critical shift: • LLMs don't eliminate the need for data; they raise the bar for it. * To get 30ms latency and 99.9% accuracy, you need custom-trained models. • To train those models, you need high-fidelity, human-verified ground truth that generalist models simply can't provide yet. The demand for data ops is evolving from "mass labeling" to Data Certainty—ensuring that your specialized model performs reliably where generalist models fail. The right model for the right task. Always.

Satya Mallick

CEO @ OpenCV | BIG VISION Consulting | AI, Computer Vision, Machine Learning
1w Edited

After 15+ years building computer vision systems, I tested Opus 4.7 on one of the simplest CV tasks — detecting cars in an aerial parking lot image. The result? It took 4–5 minutes and still missed cars, produced false positives, and placed points on empty spaces. Bounding box results were even worse — boxes in completely wrong locations. I also tested Codex, which did significantly better at pointing (found 24 cars with accurate placement). But it still took nearly 3 minutes. Meanwhile, a YOLO model does the same in 30 milliseconds. Here's the takeaway for anyone building with AI right now: Multimodal LLMs are incredible for reasoning, summarization, and code generation. But they are not a replacement for dedicated computer vision models. For standard detection → train a model or use YOLO. For open-vocabulary tasks like "find only red cars" → use Qwen 3 VL or MoonDream. Both respond in milliseconds, cost a fraction of the tokens, and produce far more accurate results. The right model for the right task. Always. Full breakdown in the video. Link in comments. #ComputerVision #AI #MachineLearning #YOLO #DeepLearning #ObjectDetection #LLM #AIEngineering #MLOps #OpenCV
Like Comment
To view or add a comment, sign in
Krishna C.
1w Edited
Report this post
Satya Mallick, I’m curious why Anthropic #AI model Opus or OpenAI's Codex model would be used for computer vision tasks by anyone. These models are generally optimized for coding, reasoning, and language-based workflows rather than specialized CV pipelines, which are typically better handled by dedicated vision models.

Satya Mallick

CEO @ OpenCV | BIG VISION Consulting | AI, Computer Vision, Machine Learning
1w Edited

After 15+ years building computer vision systems, I tested Opus 4.7 on one of the simplest CV tasks — detecting cars in an aerial parking lot image. The result? It took 4–5 minutes and still missed cars, produced false positives, and placed points on empty spaces. Bounding box results were even worse — boxes in completely wrong locations. I also tested Codex, which did significantly better at pointing (found 24 cars with accurate placement). But it still took nearly 3 minutes. Meanwhile, a YOLO model does the same in 30 milliseconds. Here's the takeaway for anyone building with AI right now: Multimodal LLMs are incredible for reasoning, summarization, and code generation. But they are not a replacement for dedicated computer vision models. For standard detection → train a model or use YOLO. For open-vocabulary tasks like "find only red cars" → use Qwen 3 VL or MoonDream. Both respond in milliseconds, cost a fraction of the tokens, and produce far more accurate results. The right model for the right task. Always. Full breakdown in the video. Link in comments. #ComputerVision #AI #MachineLearning #YOLO #DeepLearning #ObjectDetection #LLM #AIEngineering #MLOps #OpenCV
Like Comment
To view or add a comment, sign in
AKSHITA AGARWAL
4d
Report this post
I spent the last few weeks deep-diving into monocular depth estimation for my Latest Advances in Engineering & Technology course, and here's what genuinely surprised me: The biggest breakthroughs in AI don't always come from better algorithms. They come from asking better questions. For years, depth estimation was stuck. We had CNNs, we had datasets, we had computing power—but we still needed expensive LiDAR labels for every training image. It felt like a dead end. Then in 2017, someone asked: "What if we don't need labels at all?" Monodepth flipped the problem. Instead of "predict depth from labeled images," it became "does this depth explain the geometric relationship between stereo pairs?" That single reframing unlocked 62 million unlabeled images overnight. No manual annotation required. The technical insight is elegant, but the broader lesson is more important: when you hit a bottleneck, sometimes the solution isn't working harder. It's thinking differently. I see this pattern repeating across machine learning. Vision Transformers weren't revolutionary because they're architecturally complex—they won because they process information globally instead of locally, which happens to be exactly what depth understanding needs. It's a reminder that in building systems, understanding the problem structure matters as much as understanding the solution. Hitting a bottleneck that requires rethinking the problem rather than just optimizing the existing approach is often where real progress happens. #MachineLearning #ComputerVision #AI #DeepLearning #ProblemSolving
Like Comment
To view or add a comment, sign in
Kanan Shukurlu
1w Edited
Report this post
A colleague of mine Aleksandr Yakimenko has been rebuilding claw machine mechanics from the hardware side — custom power electronics, physics-based grip, a completely new motion cycle. That's what makes this project interesting to begin with. My part is the AI side. The obvious way to do that would be to bolt on another camera just for ML. I'm not a fan of that. If the machine already has a front camera that the player uses, that stream should be enough. Same video, two jobs: one for the player, one for the model. No extra hardware unless it's absolutely necessary. From there, the challenge is to make the machine understand the scene well enough to act on it. What objects are there, where exactly are they, how are they sitting, which one should be targeted. And then comes the part I like most: control. I'm building an RL agent that can be told which object to go for and learn how to actually get it. Not just move randomly in the right direction, but learn sequences that lead to successful grabs and deliveries. What makes this hard is that success is not binary. There's a huge difference between grabbing the right object and dropping it halfway, touching the wrong one, almost delivering it, or actually finishing the task. If you don't encode those differences properly, the agent learns the wrong behavior very quickly. That's also why this gets more interesting when combined with the hardware side. Once the machine can do more than a simple grab-and-lift loop, once it can carry, reposition, and release with control, the AI problem becomes much richer. It turns into planning under real physical constraints, not just arcade input timing. That's the direction I find exciting: not "AI for a claw machine," but a machine that can see, make decisions, and eventually play these modes with real consistency. If there's interest, I can share more about the RL environment design and the sim-to-real side, which is usually where these systems become much more interesting than they look on paper. #ai #computervision #reinforcementlearning #gamedev #simtoreal #onnx #pytorch
Like Comment
To view or add a comment, sign in
Hamid Najib
4d
Report this post
Most people think AI is for writing emails but the real architects are using it to rewrite the periodic table. If you are not watching what is happening with GNoME-2 right now you are missing the biggest industrial shift of the decade. I spent the morning diving into the latest autonomous discovery cycles and the scale is honestly staggering. We are talking about a deep learning tool that has predicted 2.2 million new crystal structures with a precision that makes traditional lab simulations look like using an abacus. For those of us in the trenches of architecture and infrastructure this is the ultimate validation of the "Agent vs Human" paradigm. The AI handles the brute-force complexity of 3D mapping and stability predictions while the human provides the high-level vision and industrial application. This is not just about data. This is about physical reality catching up to our computational power. When we move from "chatting with agents" to "orchestrating physical discovery" the ROI stops being theoretical and starts being revolutionary. #MaterialScience #AIStrategy #GNoME2
1 Comment
Like Comment
To view or add a comment, sign in
Chandan Kudige Manjunath
2w
Report this post
One liner of the day #46 State breaks and retries pile up, that’s when you see what an AI agent is made of, basically how it handles the situation when things don't go as planned. Pretty much the same way you judge a new engineer who recently joined your team. #AI #AIAgents #AIEngineering #LLMs #SoftwareEngineering #SystemDesign #Engineering
Like Comment
To view or add a comment, sign in
Zheng Bruce Li
1w Edited
Report this post
AI harness is quite important, from my experience with Kaggle SAE agentic tests. 1) Anthropic Claude Code with Opus 4.7 was able to get 16/16 (100% correct) in one go. 2) OpenClaw with Opus 4.7 gets no better than 14/16 with 3 attempts, and struggle to learn from experience. 3) Hermes with Sonnet 4.7 gets to 16/16 in the final round after learning from mistakes in the first two runs. (excellent considering Sonnet is a less capable model than Opus) But the danger is that harness will almost certainly be absorbed back into the model itself by top tier labs, thus there might not be enough survival spaces for so-called harness companies.
Like Comment
To view or add a comment, sign in
Krishna Bairagi
2w
Report this post
Long-horizon AI research agents aren’t just a reasoning challenge — they’re fundamentally a state management problem. It’s not enough for an agent to perform well in the next step. Real-world ML research spans task design, implementation, experimentation, debugging, and evidence tracking over extended periods — often hours or even days. A recent paper introduces AiScientist, a system designed for autonomous long-horizon ML engineering. The core insight? 👉 Keep control thin, and state thick. Instead of overloading a central controller, AiScientist uses: • A lightweight orchestrator for high-level progress • Specialized agents that continuously ground themselves in persistent workspace artifacts — including plans, code, logs, analyses, and experimental results This architecture is powered by a powerful concept: “File-as-Bus” — where shared files act as the communication and memory backbone across agents. 📊 The impact is significant: • +10.54 improvement on PaperBench vs best baseline • 81.82% Any Medal on MLE-Bench Lite • Removing File-as-Bus drops performance sharply (−6.41 PaperBench, −31.82 MLE-Bench Lite) 💡 Why this matters Building truly autonomous research agents isn’t about longer context windows or better prompts — it’s about durable, structured project memory. 📄 Paper: arxiv.org/abs/2604.13018 🎓 Learn more: academy.dair.ai ⸻ #AI #GenerativeAI #MachineLearning #AIAgents #LLM #AutonomousSystems #ResearchAI #DeepTech #ArtificialIntelligence #MLResearch #AgentArchitecture #RAG #AIEngineering #FutureOfAI
Like Comment
To view or add a comment, sign in
Deep Block

1,347 followers
1w
Report this post
We put together a short case study on something we see a lot in change detection work. A pretrained model will often get you started, but that does not mean it matches the way your actual project defines “change.” In this example, we worked on rural orthophoto data from Yeongdeok, South Korea. The baseline result was reasonable for building-related change, but it was weak on the things that actually mattered for this dataset — farmland land-use change, forest change, road change, and solar panel installation. So instead of treating the model as fixed, we labeled a relatively small amount of additional data inside Deep Block and fine-tuned it in the same workflow. The result was not magically perfect, and I think that part matters. There were still misses and false positives. But with roughly 200 labels and a short training run, the model moved in the right direction and became noticeably more useful for this specific dataset. That is really the point of the write-up. Fine-tuning still matters, especially when you are dealing with rural imagery, seasonal variation, GSD mismatch, or a more domain-specific definition of change than the pretrained model was built for. If you are working with orthophotos, drone imagery, or satellite imagery and the baseline output feels close but not quite usable, this is usually the gap. We wrote the full case study here: https://lnkd.in/gfiyHeaj And for teams that cannot move data out of their own environment, the same kind of workflow can also be run on-prem. Check https://lnkd.in/gjmTg3A9 #computervision #nocode #aitools #remotesensing #photogrammetry #orthophoto #geoai #geospatialAI #ArtificialIntelligence #AI #nocodeAI #deepblock #AIplatform #AnnotationTool #EarthObservation #ImageAnalysis #ChangeDetection #deeplearning

How to Fine-Tune a Change Detection Model in Deep Block | Rural Orthophoto Case Study
Like Comment
To view or add a comment, sign in

Satya Mallick

69,436 followers

View Profile Follow

More from this author

Explore content categories