Optical and hybrid convolutional neural networks (CNNs) recently have become of increasing interest to achieve low-latency, low-power image classification, and computer-vision tasks. However, implementing optical nonlinearity is challenging, and omitting the nonlinear layers in a standard CNN comes with a significant reduction in accuracy. In a recent paper, published in Advanced Photonics Nexus (in collaboration with Eli Shlizerman and authored by Minho Choi, Anna Wirth-Singh and Jinlin Xiang), we use knowledge distillation to compress modified AlexNet to a single linear convolutional layer and an electronic backend (two fully connected layers). We obtain comparable performance with a purely electronic CNN with five convolutional layers and three fully connected layers. We implement the convolution optically via engineering the point spread function of an inverse-designed meta-optic. Using this hybrid approach, we estimate a reduction in multiply-accumulate operations from 17M in a conventional electronic modified AlexNet to only 86 K in the hybrid compressed network enabled by the optical front end. This constitutes over 2 orders of magnitude of reduction in latency and power consumption. Furthermore, we experimentally demonstrate that the classification accuracy of the system exceeds 93% on the MNIST dataset of handwritten digits. The paper can be found at: https://lnkd.in/gfkkadJR In a followup work, we have extended this approach to CIFAR-10 and Imagenet with even more improvement in terms of power and latency: https://lnkd.in/gEks2TcQ
Improvements in Image Classification Techniques
Explore top LinkedIn content from expert professionals.
Summary
Improvements in image classification techniques refer to advances in how computers and artificial intelligence systems identify and categorize images, making these systems faster, more accurate, and easier to interpret. Recent innovations combine physics principles, new AI architectures, and smart training strategies to enable reliable image recognition for tasks ranging from medical diagnostics to industrial quality control.
- Embrace hybrid solutions: Consider combining optical components and electronic neural networks to dramatically reduce the time and power needed for image classification, especially when processing large datasets.
- Fine-tune with care: Start with a pre-trained model and carefully adjust only the necessary layers based on how different your own images are from the model's original data, using thoughtful data augmentation when samples are limited.
- Prioritize interpretability: Use tools and frameworks that help explain how AI models make decisions, breaking down complex internal processes into understandable steps for safer and more trustworthy image recognition.
-
-
In the last blog I talked about the importance of Classical ML/DL. This post focuses on Finetuning image models. (Part 4 of #ArchitectingAI) Pre-trained image models are powerful. Fine-tuning them correctly is the real skill. Transfer learning lets you start with a backbone — ResNet, MobileNet, EfficientNet — already trained on millions of images, and adapt it to your problem. Less data, faster training, better results. I applied this to classify surface defects in industrial steel project. It works well and rewards a meticulous approach. Each wrong decision compounds! 5 key things that actually matter: 1. 𝐓𝐡𝐞 𝐟𝐫𝐨𝐳𝐞𝐧 𝐥𝐚𝐲𝐞𝐫𝐬 𝐚𝐫𝐞 𝐚𝐥𝐫𝐞𝐚𝐝𝐲 𝐝𝐨𝐢𝐧𝐠 𝐦𝐨𝐬𝐭 𝐨𝐟 𝐭𝐡𝐞 𝐰𝐨𝐫𝐤. 𝑭𝒊𝒏𝒆-𝒕𝒖𝒏𝒊𝒏𝒈 𝒊𝒔 𝒓𝒆𝒇𝒊𝒏𝒆𝒎𝒆𝒏𝒕, 𝒏𝒐𝒕 𝒓𝒆𝒍𝒆𝒂𝒓𝒏𝒊𝒏𝒈. Early layers in a pre-trained model capture universal patterns — edges, textures, shapes — that transfer across domains. Later layers are task-specific. Freezing the backbone and training only a new classification head gets you most of the way there. Unfreezing the whole network is rarely worth it. 2. 𝐃𝐨𝐦𝐚𝐢𝐧 𝐠𝐚𝐩 𝐝𝐞𝐭𝐞𝐫𝐦𝐢𝐧𝐞𝐬 𝐲𝐨𝐮𝐫 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐬𝐭𝐫𝐚𝐭𝐞𝐠𝐲. 𝑻𝒉𝒆 𝒇𝒖𝒓𝒕𝒉𝒆𝒓 𝒚𝒐𝒖𝒓 𝒅𝒂𝒕𝒂 𝒇𝒓𝒐𝒎 𝑰𝒎𝒂𝒈𝒆𝑵𝒆𝒕, 𝒕𝒉𝒆 𝒎𝒐𝒓𝒆 𝒚𝒐𝒖 𝒏𝒆𝒆𝒅 𝒕𝒐 𝒖𝒏𝒇𝒓𝒆𝒆𝒛𝒆. Natural images transfer easily. Industrial textures, medical scans, satellite imagery — larger domain gap, less directly applicable features. Know your gap before deciding how many layers to unfreeze. More gap = more fine-tuning needed. 3. 𝐃𝐚𝐭𝐚 𝐚𝐮𝐠𝐦𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧 𝐢𝐬 𝐧𝐨𝐭 𝐨𝐩𝐭𝐢𝐨𝐧𝐚𝐥 𝐰𝐢𝐭𝐡 𝐬𝐦𝐚𝐥𝐥 𝐝𝐚𝐭𝐚𝐬𝐞𝐭𝐬. 𝑻𝒉𝒆 𝒎𝒐𝒅𝒆𝒍 𝒘𝒊𝒍𝒍 𝒎𝒆𝒎𝒐𝒓𝒊𝒔𝒆 𝒚𝒐𝒖𝒓 𝒅𝒂𝒕𝒂 𝒊𝒇 𝒚𝒐𝒖 𝒈𝒊𝒗𝒆 𝒊𝒕 𝒕𝒉𝒆 𝒄𝒉𝒂𝒏𝒄𝒆. Augmentation creates diversity the model hasn't seen — rotations, skews, flips, contrast shifts, brightness changes, blur. It forces generalisation over memorisation, critical when domain-specific data is limited. Apply it before model-specific preprocessing — wrong order means augmenting corrupted inputs, silently. 4. 𝐅𝐫𝐞𝐞𝐳𝐞 𝐟𝐢𝐫𝐬𝐭. 𝐅𝐢𝐧𝐞-𝐭𝐮𝐧𝐞 𝐬𝐞𝐜𝐨𝐧𝐝. 𝑨𝒍𝒘𝒂𝒚𝒔 𝒊𝒏 𝒕𝒉𝒂𝒕 𝒐𝒓𝒅𝒆𝒓. A randomly initialised head will undo what you borrowed. New classifier weights start random. Early gradients are large and noisy — if the backbone is already unfrozen, they overwrite representations learned from millions of images. Train the head first, stabilise it, then unfreeze selectively. 5. 𝐓𝐫𝐞𝐚𝐭 𝐩𝐫𝐞-𝐭𝐫𝐚𝐢𝐧𝐞𝐝 𝐰𝐞𝐢𝐠𝐡𝐭𝐬 𝐚𝐬 𝐟𝐫𝐚𝐠𝐢𝐥𝐞 𝐝𝐮𝐫𝐢𝐧𝐠 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠. 𝑼𝒔𝒆 𝒂 𝒎𝒖𝒄𝒉 𝒔𝒎𝒂𝒍𝒍𝒆𝒓 𝒍𝒆𝒂𝒓𝒏𝒊𝒏𝒈 𝒓𝒂𝒕𝒆. A standard learning rate undoes the representations you were trying to preserve. Unfreeze only the last few layers — more layers at a higher learning rate tend to overfit or underfit fast. Transfer learning is powerful because it builds on what's already been learned. Know your domain gap. Know what to freeze. Know when to fine-tune. Do it with care.
-
How Chain-of-Explanation (CoE) Reveals What Deep Vision Models "Actually" Learn ... 👉 Can we trust explanations from AI systems if their "reasoning" involves tangled concepts? Modern deep vision models excel at tasks like image classification, but their decision-making process often resembles a black box. While concept-based explanation methods aim to decode these models by identifying interpretable visual patterns, they face two critical limitations: - Manual concept annotation struggles with scalability and adaptability - Polysemantic neurons (single neurons activating for multiple unrelated concepts) create unreliable explanations Why Should We Care About Concept Polysemanticity? Imagine a neuron that fires for both "dog paws" and "wheel rims." Traditional explanation methods might label it simply as "textured surfaces," missing crucial context. This ambiguity: 1. Reduces trust in model explanations 2. Hampers debugging and improvement efforts 3. Limits deployment in safety-critical applications like medical imaging The CoE paper introduces a measurable solution: Concept Polysemanticity Entropy (CPE), quantifying how "confused" a neuron’s concept representation truly is. 👉 What Makes CoE Different? Instead of treating neurons as single-concept detectors, CoE: 1. Automatically describes concepts using large vision-language models (LVLMs) to generate linguistic explanations of activated image regions 2. Disentangles polysemantic neurons into atomic concepts (e.g., decomposing "textured surfaces" into ["tire treads", "animal fur", "brick patterns"]) 3. Builds explanation chains that trace how these atomic concepts interact across network layers to form decisions 👉 How Does This Work in Practice? 1. Global Concept Mapping - LVLMs analyze top-activating image patches for each neuron - Generate natural language descriptions (e.g., "striped patterns" instead of just showing cat fur examples) 2. Context-Aware Filtering - For a given input image, semantic entailment models select the most relevant atomic concept from a neuron’s polysemantic set 3. Decision Pathway Visualization - Connects filtered concepts across layers into human-readable chains - Example: "Wavy edges → Feather textures → Bird wing structures → [Final prediction: Bald Eagle]" 👉 Why These Results Matter - 36% average improvement in explanation accuracy over existing methods (GPT-4o/human evaluations) - First quantitative metric (CPE) for evaluating concept clarity in neural networks - Enables precise audits of model behavior: High CPE layers flag areas needing architectural refinement The framework isn’t just about explaining models – it’s a diagnostic tool for building more robust, interpretable vision systems. By converting entangled activations into contextualized concept chains, CoE bridges the gap between artificial and human-understandable reasoning.
-
What if a 200-year-old physics equation could teach a neural network to see? The damped harmonic oscillator; the same equation that describes a spring, a pendulum, a guitar string, turns out to produce exactly the representations you need for medical image analysis. Here's the idea: when you push a signal through an oscillator, you get three things for free from the dynamics: → Position (q): what the image contains → Momentum (p): where things change (boundaries, edges, textures) → Energy (H): how important each region is (a built-in saliency map) Nobody designed these outputs for computer vision. They fall out of Hamilton's equations. But they happen to be precisely what you need for both segmentation and classification tasks in images! For segmentation: momentum highlights boundaries, energy tells the decoder where to pay attention. For classification: pool all three into a compact vector and you have texture complexity, spatial activity, and content, all from one equation. We call the framework HamVision, and we tested it across 10 medical imaging benchmarks (skin lesions, cardiac MRI, thyroid ultrasound, blood cells, pathology, and more). The results surprised us: State-of-the-art Dice scores on 4/4 segmentation benchmarks State-of-the-art accuracy on 2/6 classification benchmarks All with only 8.57M parameters,2 to 12x smaller than competing methods The part I find most interesting is not the numbers. It's that the momentum maps genuinely encode boundary information and the energy maps genuinely highlight salient regions, without ever being told to. These properties emerge entirely from the physics. You can open the model, look at the oscillator's state, and understand what it's doing in physical terms. This work reinforces something I keep coming back to: physics isn't just a source of applications for AI. It's a source of ideas. The inductive biases that nature discovered over billions of years; conservation laws, symmetries, oscillatory dynamics, are computational primitives waiting to be repurposed. The paper is now on arXiv: https://lnkd.in/dBqhmS2D
-
This paper reviews the latest advances in AI technologies for skin cancer detection, focusing on how AI, especially DL and transformer-based models, are improving diagnostic accuracy, interpretability, and integration into clinical practice. 1️⃣ Skin cancer, particularly malignant melanoma, poses significant health risks and highlights the urgent need for early and accurate detection tools. 2️⃣ AI models like CNNs, YOLO, and transformer-based architectures now match or exceed dermatologists in tasks such as lesion classification and segmentation. 3️⃣ AI systems like 3DermSpot, DermoSight, and DermaSensor have received regulatory approvals and are used for real-time, noninvasive skin cancer screening in primary care. 4️⃣ Advanced data preprocessing methods—including noise reduction and data augmentation—are essential to improving image quality and model performance. 5️⃣ Transformer and multimodal models (e.g., SkinViT, SkinGPT-4, MONet) enhance diagnostic accuracy, robustness, and interpretability by integrating clinical text and image data. 6️⃣ Reinforcement learning and hybrid architectures further boost model adaptability, accuracy, and trustworthiness in real-world settings. 7️⃣ Key challenges remain: model interpretability, computational demands, dataset biases, and integration with clinical workflows and data privacy standards. 8️⃣ Future directions include optimizing resource efficiency, improving data diversity, and building interpretable, privacy-preserving AI tools for widespread clinical adoption. ✍🏻 Zhengyu Yu, Chao Xin, Yingzhe Yu, Jingjing Xia, Lianyi Han. AI dermatology: Reviewing the frontiers of skin cancer detection technologies. Intelligent Oncology. 2025. DOI: 10.1016/j.intonc.2025.03.002
-
🧪 New Machine Learning Research: Is ImageNet Outperformed by a Single Video? How can a single long, unlabeled video rival the famous ImageNet dataset? Groundbreaking research by Shashank Venkataramanan (Inria), Mamshad Nayeem Rizve (University of Central Florida), João Carreira (Google DeepMind), Yuki Asano (University of Amsterdam), and Yannis Avrithis (Institute of Advanced Research in Artificial Intelligence (IARAI) explores this possibility through their method DORA - Research goal: The study aims to demonstrate that egocentric videos, like virtual "Walking Tours," can train image encoders as effectively as large-scale datasets like ImageNet. - Research methodology: The researchers introduced DORA, a self-supervised method that learns from 10 long, high-resolution videos (average 1.5 hours each). This method tracks and identifies objects over time, using cross-attention in vision transformers. - Key findings: DORA pretrained on a single Walking Tours video achieved 44.5% accuracy in ImageNet classification tasks, outperforming existing models trained on curated datasets. It also improved object discovery scores to 56.2% on Pascal VOC 2012, showing a 5% gain over traditional methods. - Practical implications: This method is ideal for tasks requiring data efficiency, such as urban scene understanding and autonomous vehicle navigation. With minimal labeled data, DORA offers an accessible approach for developing robust visual encoders. #LabelYourData #MachineLearning #Innovation #AIResearch #MLResearch #ComputerVision #SelfSupervisedLearning #ImageNet #DataAnnotation
-
VariViT: A Vision Transformer for Variable Image Sizes https://lnkd.in/eN3gidkx Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: this https URL --- Newsletter https://lnkd.in/emCkRuA More story https://lnkd.in/eMFcEekQ LinkedIn https://lnkd.in/ehrfPYQ6 #AINewsClips #AI #ML #ArtificialIntelligence #MachineLearning #ComputerVision
-
Pyramid Vision Transformers (PVTs) bridge the gap between CNNs and transformers. ❤️ Early Vision Transformers had a column-like architecture that was primarily designed for image classification and was not well-suited for dense prediction tasks such as semantic segmentation, instance segmentation, and object detection. As a result, their output feature maps tended to be spatially coarse and blurry. Dense prediction tasks require multi-resolution feature representations. PVT addresses this by building a hierarchical architecture: transformer blocks are stacked across multiple stages, and at each stage, the feature map is patchified again and passed to the next level. This creates a pyramid of features at different spatial scales, similar to CNN backbones. To make the computation efficient at higher resolutions, PVT introduces a Spatial Reduction Attention mechanism, which projects the key and value tensors into a lower-dimensional space before computing attention. This significantly reduces computational cost while preserving global context.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development