Perform Pose Estimation using Computer Vision

Perform Pose Estimation using Computer Vision

What is Computer Vision?

Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the human visual system can do

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images (the input of the retina) into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

What is Pose Estimation?

No alt text provided for this image

Human pose estimation and tracking is a computer vision task that includes detecting, associating, and tracking semantic key points. Examples of semantic keypoints are “right shoulders”, “left knees” or the “left brake lights of vehicles”.

The performance of semantic keypoint tracking in live video footage requires high computational resources what has been limiting the accuracy of pose estimation. With the latest advances, new applications with real-time requirements become possible, such as self-driving cars and last-mile delivery robots.

Today, the most powerful image processing models are based on convolutional neural networks (CNNs). Hence, state-of-the-art methods are typically based on designing the CNN architecture tailored particularly for human pose inference.

Bottom-up vs. Top-down methods

All approaches for pose estimation can be grouped into bottom-up and top-down methods.

  • Bottom-up methods estimate each body joint first and then group them to form a unique pose. Bottom-up methods were pioneered with DeepCut (a method we will cover later in more detail).
  • Top-down methods run a person detector first and estimate body joints within the detected bounding boxes.

Pose Estimation with Deep Learning

With the rapid development of deep learning solutions in recent years, deep learning has been shown to outperform classical computer vision methods in various tasks, including image segmentation or object detection. Therefore, deep learning techniques brought significant advances and performance gains in pose estimation tasks.

Next, we will list and review the popular pose estimation methods.

The Most popular Pose Estimation methods

No alt text provided for this image

  • High-Resolution Net (HRNet)
  • OpenPose
  • DeepCut
  • AlphaPose

Deep Learning based Pose Estimation methods

  • High-Resolution Net(HRNet):It is neural network for human pose estimation. It is an architecture used in image processing problems to find what we know as key-points (joints) with respect to the specific object or person in an image. One advantage of this architecture over other architectures is that most existing methods match high-resolution representations of postures from low-resolution representations with respect to using high-low resolution networks. In place of this bias, the neural network maintains high-resolution representations when estimating postures.
  • OpenPose :one of the most popular bottom-up approaches for multi-person human pose estimation. This architecture features real-time, multi-person pose estimation. OpenPose is an open-sourced real-time multi-person detection, with high accuracy in detecting body, foot, hand, and facial keypoints. An advantage of OpenPose is that it is an API that gives users the flexibility of selecting source images from camera fields, webcams, and others, more importantly for embedded system applications
  • DeepCut: It another popular bottom-up approach for multi-person human pose estimation. DeepCut is used for detecting the poses of multiple people. The model works by detecting the number of people in an image and then predicting the joint locations for each image. DeepCut can be applied to videos or images with multi-persons/objects, for example, football, basketball, and more.
  • AlphaPose : It is a popular top-down method of pose estimation. It is useful for detecting poses in the presence of inaccurate human bounding boxes. That is, it is an optimal architecture for estimating human poses via optimally detected bounding boxes. AlphaPose architecture is applicable for detecting both single and multi-person poses in images or video fields.
  • DeepPose This is a human pose estimator that leverages the use of deep neural networks. The deep neural network (DNN) of DeepPose captures all joints, hinges a pooling layer, a convolution layer, and a fully-connected layer to form part of these layers.

Use Cases and Applications of Pose Estimation

  1. Human Activity Estimation
  2. Motion Transfer and Augmented Reality
  3. Training Robots
  4. Motion Tracking for Consoles




To view or add a comment, sign in

More articles by Keerthi Thaneeru

  • Text extraction from given images

    Introduction Multimedia, natural scenes, images are sources of textual information. Textual information extracted from…

  • Face Detection in live video feed

    What is Face Detection? Face detection is an AI-based computer technology that can identify and locate the presence of…

  • Exploratory data analysis

    Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using…

Others also viewed

Explore content categories