Realtime Multi-Person 2D Pose Estimationを試してみた！

画像から人の姿勢が分析できるOpenPose

こんにちは。

AI coordinatorの清水秀樹です。

DeepLearning系の参考書を読む度によく目にするRealtime Multi-Person 2D Pose Estimationを試してみたので、その内容の紹介になります。

別名OpenPoseとも言うようです。

簡単にどんなものかと言うと、人の姿勢を推論できるソフトウェアです。

上の動画のように、大変精度の高い推論が試せるようになっています。

はっきり言って、凄すぎです。

姿勢だけでなく、顔の輪郭や手の形の推論までできました。

かなり不気味なホラー画像となっていますが、そこはご容赦ください。

そして、画像からも分かるように、顔の輪郭や目の形、眉毛や口の位置まで推論できます。

かなり高性能ですね。

と言うわけで、参考サイトを元にここで紹介している通りに実行して頂ければ、簡単にOpenPoseを試すことができますので、興味がある方はぜひチャレンジしてみてください。

参考元サイトの紹介

GitHubに紹介されています。

日本語での説明があるところが嬉しですね。

開発環境

iMac (27-inch, Late 2012)

プロセッサ 2.9 GHz intel Core i5

macOS Sierra バージョン 10.12.4

Anaconda3-4.2.0-MacOSX-x86_64

python 3.5.2

chainer 3.1.0

ソースコードの紹介

caffemodel使用しますので、訓練済みcaffemodelのダウンロードが必要です。

cd models
wget http://posefs1.perception.cs.cmu.edu/OpenPose/models/pose/coco/pose_iter_440000.caffemodel
wget http://posefs1.perception.cs.cmu.edu/OpenPose/models/face/pose_iter_116000.caffemodel
wget http://posefs1.perception.cs.cmu.edu/OpenPose/models/hand/pose_iter_102000.caffemodel
python convert_model.py posenet pose_iter_440000.caffemodel coco_posenet.npz
python convert_model.py facenet pose_iter_116000.caffemodel facenet.npz
python convert_model.py handnet pose_iter_102000.caffemodel handnet.npz
cd ..

上記はGitHubで紹介されているまま掲載していますが、wgetはmacのターミナルでは使用できません。

そのため、直接URLにアクセスしてダウンロードしました。

wgetをmacでも使えるようにして欲しいですね。

それとも簡単に使える方法があるのでしょうか？

誰かご存知の方がいらっしゃったら、教えてください。

さて、基本的にGitHubに紹介されているソースコードそのままでも試せますが、人と顔と手の全てを動画で推論できるソースコードがなかったので、適当に作ってみました。

その代わりCPUですと、相当遅いです。

やっぱり高性能GPU搭載マシーンが欲しいところですね。

import cv2
import argparse
import chainer
from entity import params
from pose_detector import PoseDetector, draw_person_pose
from hand_detector import HandDetector, draw_hand_keypoints
from face_detector import FaceDetector, draw_face_keypoints

chainer.using_config('enable_backprop', False)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Pose detector')
    parser.add_argument('--gpu', '-g', type=int, default=-1, help='GPU ID (negative value indicates CPU)')
    args = parser.parse_args()

    # load model
    pose_detector = PoseDetector("posenet", "models/coco_posenet.npz", device=args.gpu)
    hand_detector = HandDetector("handnet", "models/handnet.npz", device=args.gpu)
    face_detector = FaceDetector("facenet", "models/facenet.npz", device=args.gpu)

    cap = cv2.VideoCapture('1.mp4')
    #cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
    #cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

    while True:
        # get video frame
        ret, img = cap.read()

        if not ret:
            print("Failed to capture image")
            break

        # inference
        print("Estimating pose...")
        person_pose_array = pose_detector(img)
        res_img = cv2.addWeighted(img, 0.6, draw_person_pose(img, person_pose_array), 0.4, 0)

        # each person detected
        for person_pose in person_pose_array:
            unit_length = pose_detector.get_unit_length(person_pose)

            # face estimation
            print("Estimating face keypoints...")
            cropped_face_img, bbox = pose_detector.crop_face(img, person_pose, unit_length)
            if cropped_face_img is not None:
                face_keypoints = face_detector(cropped_face_img)
                res_img = draw_face_keypoints(res_img, face_keypoints, (bbox[0], bbox[1]))
                #cv2.rectangle(res_img, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (255, 255, 255), 1)

            # hands estimation
            print("Estimating hands keypoints...")
            hands = pose_detector.crop_hands(img, person_pose, unit_length)
            if hands["left"] is not None:
                hand_img = hands["left"]["img"]
                bbox = hands["left"]["bbox"]
                hand_keypoints = hand_detector(hand_img, hand_type="left")
                res_img = draw_hand_keypoints(res_img, hand_keypoints, (bbox[0], bbox[1]))
                #cv2.rectangle(res_img, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (255, 255, 255), 1)

            if hands["right"] is not None:
                hand_img = hands["right"]["img"]
                bbox = hands["right"]["bbox"]
                hand_keypoints = hand_detector(hand_img, hand_type="right")
                res_img = draw_hand_keypoints(res_img, hand_keypoints, (bbox[0], bbox[1]))
                #cv2.rectangle(res_img, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (255, 255, 255), 1)

        cv2.imshow("result", res_img)

        # escを押したら終了。
        if cv2.waitKey(1) == 27:
            break

    #終了
    cap.release()
    cv2.destroyAllWindows()

最近では物体検出や画像認識だけでなく、Realtime Multi-Person 2D Pose Estimationのように人の姿勢まで簡単に推論できるようになってきているようです。

また、ARと言った技術も出てきていますので、そのうち物体までの距離を測って人のサイズを推論するような学習モデルも出てきそうですね。

そんなモデルが出てきたら、色々なことに使えそうです。

それではまた。