Hand Detection Tracking in Python using OpenCV and MediaPipe

Aditee Gautam
6 min readMay 16, 2024

--

Hand recognition is an active research field in Human-Computer Interaction technology. In this machine learning project on Hand Gesture Recognition, we are going to make a real-time Hand Gesture Recognizer using the MediaPipe framework in OpenCV and Python.

OpenCV is a real-time Computer vision and image-processing framework built on C/C++. But we’ll use it on OpenCV-python package.

How to use MediaPipe?

MediaPipe is a customizable machine learning solutions framework developed by Google. It is an open-source and cross-platform framework, and it is very lightweight. MediaPipe comes with some pre-trained ML solutions such as face detection, pose estimation, hand recognition, object detection, etc.

We’ll first use MediaPipe to recognize the hand and the hand key points. MediaPipe returns a total of 21 key points for each detected hand.

MediaPipe’s Hand module utilizes key points to detect and track hands in images or video frames. These key points, also known as landmarks, represent specific anatomical points on the hand. MediaPipe returns a total of 21 such key points for each detected hand. Here’s a brief overview of these key points:

  1. Wrist: The base of the hand.

2–5. Thumb: Four points representing the tip, knuckle, and two points along the thumb.

6–9. Index Finger: Four points representing the tip, knuckle, and two points along the index finger.

10–13. Middle Finger: Four points representing the tip, knuckle, and two points along the middle finger.

14–17. Ring Finger: Four points representing the tip, knuckle, and two points along the ring finger.

18–21. Pinky Finger: Four points representing the tip, knuckle, and two points along the pinky finger.

These key points provide spatial information about the hand’s position, orientation, and configuration, which can be used for various tasks such as gesture recognition, hand tracking, virtual reality interactions, and more.

For example, by analyzing the relative positions and movements of these key points over time, you can infer gestures or hand poses, enabling intuitive interactions with applications and devices.

In summary, MediaPipe’s Hand module leverages these hand key points to accurately detect and track hands in real-time, opening up possibilities for a wide range of interactive and immersive experiences.

Download:

  1. Python version- Download Python | Python.org
  2. PyCharm version- Download PyCharm: The Python IDE for data science and web development by JetBrains

Run Command in Python setup:

  1. OpenCV Install

pip install opencv-python

OpenCv install in Python Command

2. MediaPipe Install

pip install mediapipe

MediaPipe Install in Python command

Following are the steps of this coding in Hand Recognition:

  1. Import necessary packages:
import cv2
import os
import mediapipe as mp

2. Initial Models for MediaPipe:

self.handsMp = mp.solutions.hands
self.hands = self.handsMp.Hands()
self.mpDraw= mp.solutions.drawing_utils

· Mp.solution.hands module performs the hand recognition algorithm. So, we create the object and store it in mpHands.

· Using mpHands.Hands method we configured the model. The first argument is max_num_hands, that means the maximum number of hands will be detected by the model in a single frame. MediaPipe can detect multiple hands in a single frame, but we’ll detect only one hand at a time in this project.

· Mp.solutions.drawing_utils will draw the detected key points for us so that we don’t have to draw them manually.

3. Read frames from a webcam:

cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

if not cap.isOpened():

print("Cannot open camera")

exit()

while True:
ret, frame = cap.read()
cv2.imshow('frame', frame)
cv2.waitKey(1)

· We create a VideoCapture object and pass an argument ‘0’. It is the camera ID of the system. In this case, we have 1 webcam connected with the system. If you have multiple webcams then change the argument according to your camera ID. Otherwise, leave it default.

· The cap.read() function reads each frame from the webcam.

cv2.imshow() shows frame on a new openCV window.

The cv2.waitKey() function keeps the window open until the key ‘q’ is pressed.

4. Detect hand keypoints:

def findFingers(self, frame, draw=True):
imgRGB = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
self.results = self.hands.process(imgRGB)
if self.results.multi_hand_landmarks:
for handLms in self.results.multi_hand_landmarks:
if draw:
self.mpDraw.draw_landmarks(frame, handLms,self.handsMp.HAND_CONNECTIONS)
return frame
def findPosition( self, frame, handNo=0, draw=True):
xList =[]
yList =[]
bbox = []
self.lmsList=[]
if self.results.multi_hand_landmarks:
myHand = self.results.multi_hand_landmarks[handNo]
for id, lm in enumerate(myHand.landmark):

h, w, c = frame.shape
cx, cy = int(lm.x * w), int(lm.y * h)
xList.append(cx)
yList.append(cy)
self.lmsList.append([id, cx, cy])
if draw:
cv2.circle(frame, (cx, cy), 5, (255, 0, 255), cv2.FILLED)

xmin, xmax = min(xList), max(xList)
ymin, ymax = min(yList), max(yList)
bbox = xmin, ymin, xmax, ymax
if draw:
cv2.rectangle(frame, (xmin - 20, ymin - 20),(xmax + 20, ymax + 20),
(0, 255 , 0) , 2)

return self.lmsList, bbox

· MediaPipe works with RGB images, but OpenCV reads images in BGR format. So, using cv2.cvtCOLOR() function we convert the frame to RGB format.

· The process function takes an RGB frame and returns a result class.

· Then we check if any hand is detected or not, using result.multi_hand_landmarks method.

· After that, we loop through each detection and store the coordinate on a list called landmarks.

· Here image height (y) and image width(x) are multiplied with the result because the model returns a normalized result. This means each value in the result is between 0 and 1.

· And finally using mpDraw.draw_landmarks() function we draw all the landmarks in the frame.

Coding for Hand Tracking detection

The source code of hand gesture recognition project as HandTrackingDynamic.py

import cv2
import mediapipe as mp
import time
import math as math


class HandTrackingDynamic:
def __init__(self, mode=False, maxHands=2, detectionCon=0.5, trackCon=0.5):
self.__mode__ = mode
self.__maxHands__ = maxHands
self.__detectionCon__ = detectionCon
self.__trackCon__ = trackCon
self.handsMp = mp.solutions.hands
self.hands = self.handsMp.Hands()
self.mpDraw= mp.solutions.drawing_utils
self.tipIds = [4, 8, 12, 16, 20]

def findFingers(self, frame, draw=True):
imgRGB = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
self.results = self.hands.process(imgRGB)
if self.results.multi_hand_landmarks:
for handLms in self.results.multi_hand_landmarks:
if draw:
self.mpDraw.draw_landmarks(frame, handLms,self.handsMp.HAND_CONNECTIONS)

return frame

def findPosition( self, frame, handNo=0, draw=True):
xList =[]
yList =[]
bbox = []
self.lmsList=[]
if self.results.multi_hand_landmarks:
myHand = self.results.multi_hand_landmarks[handNo]
for id, lm in enumerate(myHand.landmark):

h, w, c = frame.shape
cx, cy = int(lm.x * w), int(lm.y * h)
xList.append(cx)
yList.append(cy)
self.lmsList.append([id, cx, cy])
if draw:
cv2.circle(frame, (cx, cy), 5, (255, 0, 255), cv2.FILLED)

xmin, xmax = min(xList), max(xList)
ymin, ymax = min(yList), max(yList)
bbox = xmin, ymin, xmax, ymax
print( "Hands Keypoint")
print(bbox)
if draw:
cv2.rectangle(frame, (xmin - 20, ymin - 20),(xmax + 20, ymax + 20),
(0, 255 , 0) , 2)

return self.lmsList, bbox

def findFingerUp(self):
fingers=[]

if self.lmsList[self.tipIds[0]][1] > self.lmsList[self.tipIds[0]-1][1]:
fingers.append(1)
else:
fingers.append(0)

for id in range(1, 5):
if self.lmsList[self.tipIds[id]][2] < self.lmsList[self.tipIds[id]-2][2]:
fingers.append(1)
else:
fingers.append(0)

return fingers

def findDistance(self, p1, p2, frame, draw= True, r=15, t=3):

x1 , y1 = self.lmsList[p1][1:]
x2, y2 = self.lmsList[p2][1:]
cx , cy = (x1+x2)//2 , (y1 + y2)//2

if draw:
cv2.line(frame,(x1, y1),(x2,y2) ,(255,0,255), t)
cv2.circle(frame,(x1,y1),r,(255,0,255),cv2.FILLED)
cv2.circle(frame,(x2,y2),r, (255,0,0),cv2.FILLED)
cv2.circle(frame,(cx,cy), r,(0,0.255),cv2.FILLED)
len= math.hypot(x2-x1,y2-y1)

return len, frame , [x1, y1, x2, y2, cx, cy]

def main():

ctime=0
ptime=0
cap = cv2.VideoCapture(0)
detector = HandTrackingDynamic()
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
if not cap.isOpened():
print("Cannot open camera")
exit()

while True:
ret, frame = cap.read()

frame = detector.findFingers(frame)
lmsList = detector.findPosition(frame)
if len(lmsList)!=0:
#print(lmsList[0])

ctime = time.time()
fps =1/(ctime-ptime)
ptime = ctime

cv2.putText(frame, str(int(fps)), (10,70), cv2.FONT_HERSHEY_PLAIN,3,(255,0,255),3)

#gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
cv2.imshow('frame', frame)
cv2.waitKey(1)



if __name__ == "__main__":
main()

Result:

Useful Links

  • Follow with me

--

--

Aditee Gautam
Aditee Gautam

Written by Aditee Gautam

Technology Professional and Computer Scientist. Learn Code and Learn Programming

No responses yet