At the Intersection of Deep Learning and Computer Vision
a click at the hotel hosting the symposium

At the Intersection of Deep Learning and Computer Vision

I had to beat Saturday inertia and cold November morning to spring up from my bed and rush to attend a symposium organized by IEEE Bangalore section. Its website promised an array of speakers from both industry and academia, on topics that I keep hearing and puzzled about but have little knowledge or experience working on.

In the beginning

In his keynote address, Rajeev held a crystal ball on how Amazon is trying to improve customer experience with their product search engine. Using deep learning techniques such as recursive neural net, customer's query with specific product characteristics was semantically matched to product specification tags and images.

Some 100+ delegates were lapping up keenly on some of the loss functions and its handling - log loss, siamese loss, twin loss and hinge loss - being mentioned in the presentation. I was however more attentive when the speaker raised a semantic mismatch issue. A customer raising a query on weight bearing capacity of a specific camera tripod was answered with weight of the tripod. Sellers populating poor metadata of products are also to be equally blamed along with algorithms for poor customer experience during product search on website.

Deep convolution neural networks were also employed for tasks such as image classification, color extraction and logo extraction - to name a few. To ensure that error rates are incrementally reduced, Amazon audits the answers that customers receive for their queries.

A gentle primer

Gowri is a veteran in the computer vision space with publications for 10+ years. The professor was on stage to give the audience a primer on computer vision. For outsiders of the field like me, the session was a bonus.

All image recognition began with digitizing the image - each pixel being categorized using a red, blue and green scale. Techniques such as histogram transformation, histogram equalization, correlation mask, convolution flip and de-noising help in understanding an image. Segmentation and filtering are used to identify boundaries between components of an image and recognize objects within.

The professor emphasized the need to understand basics though recent innovations such as AlexNet and VGG net have fast tracked research in computer vision and pattern recognition using deep learning algorithms. In fact as per the professor, the convolution neural network should have been termed as correlation neural network, as there was no flipping done.

Particularly interesting was the work by the professor's students to identify frames in videos. Most of cricket highlight videos are focused on the boundaries and sixes from a cricket match. However, what if a young player wants to understand how do different bowlers bowl a yorker and prevent runs being scored by batsman. The task requires freezing a video frame by frame and identify only those frames that are limited to pitch coverage and bowler action.

A google search on the professor's work yielded many papers for later read.

Learning with no prior knowledge

Soma unfurled a variety of research that she and her students were involved with. The IISc professor was on a different plane when she began with cross modal matching. In my database management days, I was comfy to tune database searches that are pure text based. Here was a researcher explaining about semantically matching text with image!

Almost 20 minutes passed before I could understand the full import of what was being discussed. Whatever be the input mode - text, video, audio, image - a generalized semantic preserving hashing was applied to generate hash code. The hash codes were then matched. This would help eCommerce sites enable matching of customer queries against coarse tagged images of objects.

Zero shot learning is a research topic that the professor's students has started working on. In its basic form, the concept is about recognizing an object even before it was introduced to the learning engine.

The group had earlier engaged in a research to recognize an image using information from a set of images belonging to a different class. On the screen, professor presented about a research training an engine to recognize a zebra for the first time after it was trained with a horse.

I thought it will be some time before brick and mortar shops are replaced wholly by these sites. It looked like lot of people will be employed in universities and organizations before a lot more people in retail shops are replaced. Discussions at symposium sort of explained why I still felt shopping at Bengaluru's famous Commercial Street was more engaging than the one I experienced buying on eCommerce websites.

There is more involved inside a human mind than just a set of matching text or image matching, even with sophistication brought about by deep neural network algorithms. The open innovations have taken humans very closer to replicating a human mind, but the distance waiting to be covered is still a long one.

Impressive steps to improve accuracy

Pramod highlighted how at Flipkart he was involved in developing tasks that understood customer query and match with right products and their images. It should take me few summers before understanding the underlying method of t-Distributed Stochastic Neighbor Embedding (t-SNE)! For the deeply interested, I can only point to a well cited paper, the basis of the day's talk. The attempt however is to improve accuracy of assessment of customer's intent and place the right product in front of her, nudging for a purchase.

Arijit was referred by an earlier speaker and credited with discovering how indexed videos improved engagement. However, in the symposium talk, Arijit was referring his work at Amazon and the presentation was based on a paper that he had published. The crux of the talk was on training an engine using a concept called Generative Adversarial Network. I had to visualize two actors, one generating possible images and another attempting to recognize existing images, and a third one combining the outputs together to finally increase accuracy of the recognition.

Tathagato walked us through their startup journey in improving health care using a combination of product innovation and deep learning techniques. In an example were doctors and pathologists could miss detecting malarial cells in a blood sample, SigTuple's device came to the rescue. Here was an example of a potential mistake due to boredom being overcome by pattern recognition accuracy of deep learning algorithms.

Intel, platinum sponsor for the event was deep into building Xeon embedded platforms that incorporate deep learning on its OpenVINO toolkit. Rajesh walked through Intel's journey, particularly in factories across China. In one example, aluminium castings were inspected by technicians for possible defects such as cracks or casting defects. These were now automated using computer vision, their accuracy improved by deep learning algorithms.

Neeraj was short but was interesting in his explanation of how Dell, Gold sponsor for the event, was using technology for inspecting assembled laptops. Interesting was the use of generative adversarial network to generate textual description of the orders from scanned images and match it with the actual order details for semantic matching.

Improving engagement

Ashwini explained how his startup Kalpnik Technologies was providing an engaging experience through their VR-wear. Devotees can wear their kit and experience devotional experience inside select temples. Ashwini was on a chat with Shriram highlighting challenges on implementing AR/VR solutions. Shriram explained how augmenting the radioscopy image from tummy with potential images of other parts to enable investigating doctor understand the causes of symptoms better. As Shriram explained, I was visualizing a small snaky probe with a camera writhing through my blood vessel allowing doctor to view my innards. The AR system superimposed other relevant images from a database to help the doctor make sense of location of the probe and what he might be seeing. The challenge however is to maintain the plane of image, match speed of movement and other limitations of current technologies in the space.

Anirban's talk was the last for the day. However, the issue of identifying a person in datasets sourced from multiple cameras was an interesting one. The issue is termed re-Identification. It is about allowing an operator such as a policeman recognize, for example a reported missing person from multiple cameras and improving chance of finding. The method involved an engaging interaction for accurate detection aided by computer vision and deep learning techniques.

Manish is invested in making learning videos engaging. Manish's startup VideoKen allows users to push in a normal video and get an output that is video plus table of contents plus index. Today most of the videos - whether for learning or marketing - did not get viewed beyond the first few seconds. Arijit's (mentioned earlier) insight was that videos are not watched linearly. Providing a table of contents allowed video consumer to jump to a location that is interesting, like the way user would have engaged with a text book. Manish's team used deep learning techniques such as Long Short Term Memory networks to use a combination of attributes from video to automatically generate textual props.

Niranjan is involved in directing drama as well as involved with researching at the intersection of entertainment, computer vision and deep learning at TCS. Several use cases that leveraged multiple passes through long short term memory networks were highlighted. One example was placing a coke can on frames that showed a desert. This allowed a content owner to use exercise her rights and creatively monetize by placing advertisement without intruding and messing up with the user's experience while watching the content (say a movie).

In conclusion

The universities, the entrepreneurs and the industries are continuously exploring on multiple edges. It is visible that multiple domains are continuously interacting at their edges with other domains creating a variety of opportunities that will potentially take humanity to a different plane. While yesterday's shiny new things are today's commodities, it is worth noting Gowri's recommendation that all inventions are built on detailed understanding and appreciation of yesterday's discoveries.

It is now almost a decade that I have been introduced to digital technologies in my work environment. My time spent in the symposium was a quick 10 hour trip to a deep learning wonderland. It will not automatically change my work profile tomorrow, but I am convinced on what might be plausible in an AI first world. The world that I painted in my earlier blog possibly is an interesting journey along the long edge.




To view or add a comment, sign in

More articles by Praveen Kumar B. S.

  • The Miracle at Dubai

    Good Samaritan (Sami, for this story) is an insufficient term. "Help someone when you find them in similar situation.

    3 Comments
  • Better Future with Inclusive Personal Development

    "Measure yourself on the value you deliver" - so exhorted a motivational speaker, a leading coach, Sonu Sharma, to his…

    2 Comments
  • Disruption Due to Innovation

    When lockdown was enforced due to Covid, weekend grocery shopping gave way to online shopping. Feeling with curiosity…

  • I Centered Design Thinking

    This is a very light weight introduction to a book by Shankar Thayumanavan. I owe a thank you to Shankar Thayumanavan.

    1 Comment
  • Gazing Technology Crystal Ball

    An invite from ThoughtWorks for their first (at least for me) conference in Bengaluru, India, felt like fresh aroma…

    2 Comments
  • Growth and 5 Is

    A pawn in a game of chess gets promoted to a Queen or a Rook or a Bishop or a Knight of the same color. That is growth,…

    1 Comment
  • The Bots Kicked Me Out!!

    My customer was unhappy with a widget my company had delivered. It was defective and needed to be replaced.

  • Emancipation from Corruption

    Driving is my second profession. i get to drive myself to office and back everyday.

  • Transitioning to a Software Architect Role

    Abstract An architect can be effective in designing solutions by executing a set of activities consistently. Given a…

    4 Comments
  • Defining the Output, Upfront - is Necessary

    Service industries or organizations or groups or departments, particularly that of IT is plagued by roles that do not…

    10 Comments

Others also viewed

Explore content categories