BlazePose: On-Machine Real-time Body Pose Tracking

提供:鈴木広大
2025年9月12日 (金) 22:29時点におけるAhmadTabor7 (トーク | 投稿記録)による版 (ページの作成:「<br>We present BlazePose, a lightweight convolutional neural network architecture for [https://www.wakewiki.de/index.php?title=Benutzer:DanieleWillhite iTagPro key finder] human pose estimation that's tailored for real-time inference on cell units. During inference, the community produces 33 physique keypoints for a single individual and runs at over 30 frames per second on a Pixel 2 cellphone. This makes it significantly suited to actual-time use cases like fitnes…」)
(差分) ← 古い版 | 最新版 (差分) | 新しい版 → (差分)
ナビゲーションに移動 検索に移動


We present BlazePose, a lightweight convolutional neural network architecture for iTagPro key finder human pose estimation that's tailored for real-time inference on cell units. During inference, the community produces 33 physique keypoints for a single individual and runs at over 30 frames per second on a Pixel 2 cellphone. This makes it significantly suited to actual-time use cases like fitness tracking and signal language recognition. Our foremost contributions embrace a novel body pose monitoring answer and a lightweight body pose estimation neural network that uses both heatmaps and regression to keypoint coordinates. Human physique pose estimation from images or video plays a central function in various functions reminiscent of well being monitoring, itagpro locator sign language recognition, and gestural management. This activity is challenging as a consequence of a large number of poses, quite a few degrees of freedom, and occlusions. The widespread method is to supply heatmaps for each joint together with refining offsets for each coordinate. While this choice of heatmaps scales to a number of individuals with minimal overhead, it makes the mannequin for a single particular person significantly bigger than is appropriate for real-time inference on cell phones.



In this paper, we handle this specific use case and exhibit vital speedup of the mannequin with little to no quality degradation. In distinction to heatmap-based strategies, regression-primarily based approaches, iTagPro portable while much less computationally demanding and extra scalable, try to predict the mean coordinate values, often failing to deal with the underlying ambiguity. We prolong this concept in our work and use an encoder-decoder network architecture to foretell heatmaps for all joints, followed by another encoder that regresses on to the coordinates of all joints. The iTagPro key finder perception behind our work is that the heatmap department might be discarded during inference, making it sufficiently lightweight to run on a mobile phone. Our pipeline consists of a lightweight body pose detector followed by a pose tracker community. The tracker predicts keypoint coordinates, the presence of the person on the current body, and the refined region of interest for the present body. When the tracker indicates that there isn't any human present, we re-run the detector network on the next body.



The majority of modern object detection options depend on the Non-Maximum Suppression (NMS) algorithm for his or her last post-processing step. This works properly for rigid objects with few degrees of freedom. However, this algorithm breaks down for scenarios that include extremely articulated poses like these of humans, e.g. people waving or hugging. It's because a number of, ambiguous bins satisfy the intersection over union (IoU) threshold for the NMS algorithm. To overcome this limitation, we concentrate on detecting the bounding field of a comparatively inflexible physique part like the human face or torso. We observed that in lots of cases, the strongest sign to the neural community in regards to the place of the torso is the person’s face (as it has excessive-distinction options and has fewer variations in appearance). To make such a person detector fast and lightweight, iTagPro bluetooth tracker we make the robust, yet for AR purposes valid, assumption that the top of the individual should at all times be seen for our single-person use case. This face detector predicts extra particular person-specific alignment parameters: the middle level between the person’s hips, the dimensions of the circle circumscribing the whole particular person, and incline (the angle between the strains connecting the 2 mid-shoulder and mid-hip points).



This permits us to be in keeping with the respective datasets and inference networks. In comparison with the majority of present pose estimation options that detect keypoints using heatmaps, our monitoring-primarily based resolution requires an preliminary pose alignment. We prohibit our dataset to those instances where both the whole person is seen, or where hips and shoulders keypoints might be confidently annotated. To make sure the mannequin helps heavy occlusions that are not present in the dataset, we use substantial occlusion-simulating augmentation. Our training dataset consists of 60K photographs with a single or few individuals within the scene in frequent poses and 25K images with a single person within the scene performing health workout routines. All of these pictures had been annotated by people. We adopt a mixed heatmap, offset, and regression method, as proven in Figure 4. We use the heatmap and offset loss solely within the coaching stage and take away the corresponding output layers from the model before working the inference.



Thus, we effectively use the heatmap to supervise the lightweight embedding, which is then utilized by the regression encoder community. This strategy is partially inspired by Stacked Hourglass method of Newell et al. We actively utilize skip-connections between all the phases of the network to realize a steadiness between excessive- and low-degree options. However, the gradients from the regression encoder are usually not propagated back to the heatmap-trained features (note the gradient-stopping connections in Figure 4). We have now found this to not only enhance the heatmap predictions, but additionally substantially improve the coordinate regression accuracy. A related pose prior is a vital a part of the proposed answer. We deliberately restrict supported ranges for the angle, scale, and translation throughout augmentation and information preparation when coaching. This enables us to decrease the community capability, making the community faster whereas requiring fewer computational and thus power assets on the host gadget. Based on both the detection stage or the earlier frame keypoints, we align the individual so that the point between the hips is located at the middle of the sq. image passed as the neural community input.