After 15+ years building computer vision systems, I tested Opus 4.7 on one of the simplest CV tasks — detecting cars in an aerial parking lot image. The result? It took 4–5 minutes and still missed cars, produced false positives, and placed points on empty spaces. Bounding box results were even worse — boxes in completely wrong locations. I also tested Codex, which did significantly better at pointing (found 24 cars with accurate placement). But it still took nearly 3 minutes. Meanwhile, a YOLO model does the same in 30 milliseconds. Here's the takeaway for anyone building with AI right now: Multimodal LLMs are incredible for reasoning, summarization, and code generation. But they are not a replacement for dedicated computer vision models. For standard detection → train a model or use YOLO. For open-vocabulary tasks like "find only red cars" → use Qwen 3 VL or MoonDream. Both respond in milliseconds, cost a fraction of the tokens, and produce far more accurate results. The right model for the right task. Always. Full breakdown in the video. Link in comments. #ComputerVision #AI #MachineLearning #YOLO #DeepLearning #ObjectDetection #LLM #AIEngineering #MLOps #OpenCV
Yup moondream
What about models like Grounding DINO?
However, you can create a car detection model with Opus 4.7 in 5 minutes, without the training part. Finds dataset, finds backbone, writes up the code.
I also have a positive experience with Qwen 3VL. This one specifically: Qwen3-VL-30B-A3B-Instruct-Q4_K_M: https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF/blob/main/Qwen3-VL-30B-A3B-Instruct-Q4_K_M.gguf In my tests it performed better than Qwen 3.5 or Gemma 4 (even larger ones, less quatized!) for the same open-vocabulary task. And it was run locally, partially offloaded to GPU.
Totally agree with "right model for the right task". More often than not, I hear many engineers/developers (including experienced ones) who constantly advocate these huge LLMs (i.e., think swiss army knives) for all vision-related tasks. When asked on why it takes so long to get results (e.g., 30 seconds to 1 minute), I get the standard response that it is the LLM provider's task and not theirs. When further pressed on how to improve the evaluation metrics, the usual answer of "improve prompt engineering" comes up. BTW, I've had good experience with YOLO, RF-DETR, Qwen3.x-VL families whilst getting decent latency responses.*Sigh*
Ya The yolo models are best in detection even can be trained from scratch even u can append theier architecture based on your need and yes u not need any other rand block from papers hence it internal mechanism hurt when we start scale so use it internal blocks for architecture modify wrt axhitiche support such as adding p2 in yolo11 or seg or obb or add other layer or integration psa or c2psa in yolov8 or 8 seg or 8 obb it can increase latency to 1ms but manageable wrt attention on small objects even other types block if u integrate by append in block file it can be always greater chance of model failure as extended to large dataset even on it type also reduce accuracy and speed so even yolo is better ,always try it base architecture and even need in specially case we can use it blocks related to architecture by replacing by a code too .so yolo is better and can be enhanced by changing blocks but only its internal one.
Putting AI to work for robotics leaders & operators · Frameworks, product decisions, and from-the-trenches lessons · 10y in robotics · 30 posts/mo · Machine Learning @ Georgia Tech · Filipino in Tokyo
1wSpecial-purpose tools always beat general-purpose tools 24 hours of the day. It's like comparing if a Swiss knife is better for peeling potatoes than a peeler - maybe a lousy analogy but you get the point.