Published on

How to Build Gesture Controls for AR Apps with MediaPipe Hand Tracking on iOS & Android

Authors
  • avatar
    Name
    Almaz Khalilov
    Twitter

How to Build Gesture Controls for AR Apps with MediaPipe Hand Tracking on iOS & Android

TL;DR

  • You'll build: An AR demo app where you can control virtual objects using hand gestures (tracked via the phone's camera).
  • You'll do: Get MediaPipe Hand Tracking code → Install the SDK → Run a sample app on iOS and Android → Integrate hand tracking into your own AR app → Test gestures on a device.
  • You'll need: No account (open-source SDK), a smartphone with a camera (iPhone or ARCore-supported Android), and development tools (Xcode, Android Studio).

1) What is MediaPipe Hand Tracking?

Fig 1. MediaPipe Hand Tracking example output – 21 hand landmarks (finger joints) are tracked and shown as dots, with brighter dots indicating points closer to the camera.

MediaPipe Hand Tracking is a high-fidelity, real-time hand and finger tracking solution from Google. It uses machine learning to infer 21 3D keypoints (landmarks) of each hand from a single camera frame. It can track multiple hands at once (e.g. both hands) and even determines left vs. right hand automatically. Under the hood, MediaPipe employs a two-step pipeline: first a palm detector finds the hand's location, then a hand landmark model finds the 21 joint positions in that region. This model was trained on 30,000 real-world images plus synthetic hand images to improve robustness – it can handle various hand poses and even partial occlusion of fingers.

What it enables

  • Markerless, controller-free input: Use your bare hands as input for AR/VR apps instead of physical controllers. The camera feed is analyzed to track hand movements and gestures in real time, enabling natural interaction.
  • 21-point hand skeleton tracking: Get precise 2D/3D coordinates for 21 key hand landmarks (finger joints and tips) per hand. This high-detail tracking allows detecting complex hand poses and gestures (e.g. pinches, thumbs-up) by analyzing the landmark geometry.
  • Cross-platform on-device ML: Runs in real-time on modern mobile devices (Android & iOS) without needing a cloud server. MediaPipe is optimized for mobile GPUs/CPUs, achieving 30fps hand tracking on a phone and even handling two hands simultaneously.

When to use it

  • Hands-free AR/VR interactions: Ideal for mobile AR apps or VR experiences where users don't have dedicated controllers. For example, you can build touchless gaming setups or public AR installations where users use hand signs to interact.
  • Creative apps and education: Use gestures in interactive games, educational tools, or art apps. E.g., rehab or exercise games where patients perform hand movements as input, or sign language learning apps that recognize hand shapes.
  • No-touch user interfaces: In scenarios where touching a screen is impractical (kiosks, automotive HUDs, hygienic environments), hand tracking allows natural control. It's useful for UI navigation with mid-air gestures (swipes, grabs) in AR heads-up displays or smartphone AR without blocking the view.

Current limitations

  • Camera field of view: The hand must be visible to the camera at all times. The interaction area is limited to what the camera sees. For mobile AR, this often means the user can realistically use one hand (while holding the phone with the other), and gestures need to be performed within arm's length of the device.
  • Lighting and background: As a vision-based solution, tracking accuracy can degrade in poor lighting or against complex backgrounds. Colored or low lighting can significantly hurt performance. You may need to ensure good, even lighting or inform users to avoid backgrounds that confuse the hand detector (e.g. skin-toned or very cluttered backgrounds).
  • No direct depth sensing: MediaPipe provides relative 3D coordinates of the hand, but it doesn't know the absolute distance from the camera. For true spatial interactions (placing virtual objects in the world), you'll integrate with AR frameworks (ARCore/ARKit) to map 2D hand positions to real-world scale. The hand landmarks depth is relative – e.g. fingertips vs palm – but not an absolute distance from the device.
  • Performance considerations: Real-time hand tracking is computationally intensive. Older or low-end devices might achieve lower frame rates. Running this alongside AR rendering can tax the device, so use GPU acceleration if possible and optimize your app's frame processing. (MediaPipe's pipeline avoids redundant work by only re-running the palm detector when tracking is lost to maintain speed.)
  • Preview SDK status: MediaPipe Hands (as part of MediaPipe Solutions) was initially released as a preview and beta. As of this writing, the APIs are fairly stable, but still marked as evolving. Check release notes for changes, and be cautious about shipping to production – test thoroughly on target devices and be prepared for updates as the library matures.

2) Prerequisites

Access requirements

  • No sign-up required: MediaPipe is open-source. You don't need a special developer account or portal access. All libraries and sample code are publicly available on GitHub and Google's Maven/CocoaPods. (License is Apache 2.0, free for commercial use.)
  • MediaPipe model file: You will need the pre-trained hand landmark model file. Google provides a .task model file (e.g. hand_landmarker.task) that the SDK uses. We'll download this in the steps below – no API key required, just a direct download.

Platform setup

iOS

  • Xcode 13+ (with iOS 13.0 or later SDK). Ensure you can run apps on a device running iOS 13 or higher (iOS 16+ recommended for ARKit support).
  • CocoaPods installed (for integrating the MediaPipe SDK). The MediaPipe iOS Tasks library is distributed as a Pod. Make sure you have the latest CocoaPods and run pod repo update if needed.
  • Physical iPhone or iPad (highly recommended). The hand tracking uses the camera; it won't fully work on the iOS Simulator (Simulator has no live camera feed by default). Use an actual device for testing AR + camera features.

Android

  • Android Studio Arctic Fox (4.0)+ with Android SDK 30+ (Android 11 or higher). Ensure you have a recent build environment for compatibility with the MediaPipe AAR.
  • Gradle 7+ and Kotlin 1.5+ (if using Kotlin). MediaPipe's Android solution is in an AAR library accessible via Gradle. Use a relatively modern Gradle version as required by Android Studio.
  • Physical Android phone (recommended). While you can use an emulator that emulates camera input, it's simpler to test on a real device with a camera. The device should support ARCore if you plan to do world AR (see Integration section), but for just hand tracking a device with Android 8.0+ and a decent camera is sufficient.

Hardware or mock

  • Smartphone with rear or front camera: This solution needs only a standard camera. No depth sensor, special glove, or wearable is required – "No controllers required" means your hand is the only "input device". For AR apps, you'll typically use the rear camera for world tracking, but the hand tracker works with front cameras too (e.g. selfie AR filters).
  • Your hand 🙂 (or a recorded video): To test, you'll use real hand movements. Optionally, for automated testing you can use a recorded video of hand gestures as a mock input (feed it into the app) to simulate gestures consistently during development. This can be handy in CI environments or when you can't physically perform gestures.
  • Good lighting environment: Not a hardware requirement per se, but ensure the area where you test has adequate lighting. This will significantly improve tracking stability (as discussed, dark or harsh lighting can impair the hand detection).

3) Get Access to MediaPipe Hand Tracking

  1. Go to the MediaPipe project repository: Navigate to the official MediaPipe sample projects on GitHub. The hand tracking example code is part of the google-ai-edge/mediapipe-samples repo. You can clone it with Git:

    git clone https://github.com/google-ai-edge/mediapipe-samples.git
    
    
  2. Retrieve the hand tracking sample: (Optional) After cloning, you can save time by using a sparse checkout to fetch only the hand tracking examples:

    cd mediapipe-samples
    git sparse-checkout init --cone
    git sparse-checkout set examples/hand_landmarker
    
    

    This will pull down only the hand landmarker sample apps (for iOS and Android). If you prefer, you can also download the repository ZIP and extract it, but the repo is large; sparse checkout is convenient to avoid downloading unnecessary files.

  3. No special permissions needed: There is no need to request access or join any beta program – the SDK is available publicly. (Google's AI Edge Portal mention is for analytics/benchmarking, not required for using the library.) Accepting terms isn't needed beyond the open-source license, which you agree to by using the code.

  4. Create a project (if integrating into your own app): If you plan to integrate into an existing app, ensure you have a project created in Xcode/Android Studio. You might make a copy of the example as a starting point. If just running the sample, you can skip this.

  5. Download the model file:

    • iOS: Download the hand landmark model (e.g. hand_landmarker.task) from Google's site. You can find it in the MediaPipe documentation or directly via the URL (for example, using wget as shown in Google's guide). Once downloaded, add this file into your Xcode project (e.g., drag it into the project navigator, ensuring "Copy items if needed" is checked). This will bundle the model in your app.
    • Android: Similarly, download hand_landmarker.task and place it in your app's app/src/main/assets/ directory. In the sample code, they use a constant path to load this asset. Make sure the asset file name matches what the code expects (the sample uses MP_HAND_LANDMARKER_TASK which is set to "hand_landmarker.task").

Done when: you have the MediaPipe Hand Landmarker sample code on your machine, and the model file (hand_landmarker.task) is downloaded and added to the project (iOS bundle or Android assets). At this point, you're ready to build and run the sample apps with hand tracking capability.


4) Quickstart A — Run the Sample App (iOS)

Goal

Run the official iOS hand tracking sample app and verify that hand landmark detection works on a real device's camera (and by extension, that gestures can be recognized with your hands in front of the device).

Step 1 — Get the sample

  • Clone or open project: If you followed the steps above, open the Xcode project located at mediapipe-samples/examples/hand_landmarker/ios/ (it may be an Xcode workspace if CocoaPods are used). The sample is typically an Xcode project configured with the necessary files.
  • Alternatively, download the sample as a ZIP from GitHub. Unzip it and open HandLandmarker.xcodeproj (or .xcworkspace) from the iOS example directory in Xcode.

Step 2 — Install dependencies

  • Install MediaPipe SDK via CocoaPods: The sample uses CocoaPods to include the MediaPipe Tasks SDK. Navigate to the ios sample folder in Terminal and run pod install. This will fetch the MediaPipeTasksVision pod (which contains the Hand Landmarker library). After this, open the generated .xcworkspace if not already open.
    • Note: Ensure you have CocoaPods installed (sudo gem install cocoapods if not) and an updated pods repo. The Podfile in the sample should already list pod 'MediaPipeTasksVision', which brings in the hand tracking SDK.
  • (SPM alternative): If preferred, you could integrate via Swift Package Manager by referencing the MediaPipe GitHub, but as of now the official docs recommend CocoaPods. Stick to the Pod for simplicity.

Step 3 — Configure app

Before running, check a few configuration details:

  • Add the model to the app bundle: If you haven't already, ensure hand_landmarker.task is added to the Xcode project and is included in the app target. In Xcode, you should see this file in the Project navigator (the sample might already include a placeholder or instructions to add it). In code, the sample will load it via Bundle.main.path(forResource: "hand_landmarker", ofType: "task") – if that returns nil, the file isn't bundled correctly.
  • Bundle Identifier: Optionally, set a unique Bundle ID for the sample app in the project settings if you intend to run it on your device (especially if an app with the same ID is already installed). For example, change com.google.mediapipe.handlandmarker to something like com.yourname.HandSample and update the provisioning profile if needed.
  • Privacy permission: In Info.plist, verify that a NSCameraUsageDescription key is present with a string explaining why the app uses the camera. iOS will require this for camera access. (The sample likely has this, but double-check to avoid a runtime crash when accessing the camera.)
  • Other capabilities: No special entitlements are needed since we are just using the camera. (If this were ARKit-based, you'd also enable ARKit capabilities, but the sample focuses on the ML aspect.)

Step 4 — Run

  1. Select the target device: In Xcode, choose your iPhone as the run destination (connect your device via USB or network and select it in the scheme dropdown).
  2. Build & Run: Hit the Run button (▶️). Xcode will compile the app and install it on the device. The first build may take a bit longer as it compiles the MediaPipe pod.
  3. Launch and observe: The app should launch on your iPhone, likely opening a camera view with some UI to switch between live camera and perhaps still image mode (depending on sample implementation).

Step 5 — Connect to camera input

  • Grant camera access: The first time it runs, you'll get an iOS permission prompt for camera usage. Tap "Allow". If you miss this or hit "Don't Allow," you'll need to enable it in Settings for the app to function.
  • Point the camera at your hand: Aim the device's camera so that one of your hands is fully visible in the frame. It may help to have a plain background behind your hand initially.
  • (No external wearable or controller is needed – your hand is the input. Just ensure the camera can see it clearly.)

Verify

  • Landmarks visible: You should see the app drawing hand landmarks on your hand in real-time – typically 21 points overlaying your fingers and palm joints. In some samples, connections between points (a skeleton outline) are drawn. Move your hand around; the landmarks should follow.
  • Basic gestures register: Try making a fist, open hand, or a thumbs-up. While the sample might not explicitly label gestures, you can verify the landmarks move accordingly (e.g., points bunch together for a fist, etc.). If the sample has a mode to detect gestures (some MediaPipe examples have a "Gesture Recognizer" mode), you might see text feedback for recognized gestures.
  • Multiple hands (if supported): If you have a friend nearby or can manage using the front camera, try getting two hands in view. The model can track two hands simultaneously if configured. The sample might only track one by default, but it's a good stress test – it should at least not crash with a second hand in frame.

Common issues

  • Build errors (Xcode): If you get build errors about missing modules or architechtures, ensure you opened the .xcworkspace (not the .xcodeproj) after installing pods. Also, check that you're building for a device (the pod might not support Simulator for ARM M1 – run on a device).
  • Black camera view: If the app runs but shows a black screen, likely camera permission was denied or not working. Check the device Settings -> Privacy -> Camera and make sure your app has access. Also ensure your device isn't using the camera elsewhere (though iOS generally handles that).
  • No landmarks detected: If the camera feed is visible but no hand points appear:
    • Make sure the model file is loaded (check Xcode debug output; if it can't find the model, you'll see an error). If so, add the model file properly and rebuild.
    • Ensure your hand is within frame and reasonably centered. The detector might not pick up a hand cut off at the edge or too small/too large in the view.
    • Try better lighting or a simpler background if the environment is challenging.
  • App crashes on launch: This could be due to missing permission usage description (check Info.plist) or some code issue. Run with the debugger to see the log. A common mistake is forgetting to add the model or using an incorrect path, which might throw an exception when initializing the HandLandmarker – the sample code should handle errors, but it's worth checking. Fix any such issues (e.g., correct resource name) and try again.

5) Quickstart B — Run the Sample App (Android)

Goal

Run the official Android hand tracking sample app and verify that it can detect hand landmarks via the phone camera. This confirms the MediaPipe hand tracking is working on Android, so you can then integrate it into an AR app.

Step 1 — Get the sample

  • Import project: Using Android Studio, import the Android sample project found in mediapipe-samples/examples/hand_landmarker/android/. If you cloned with sparse checkout, open that folder as a project. It should contain an Android app module (with Gradle files, etc.).
  • Gradle sync: Android Studio will likely prompt to sync Gradle. Let it download any Gradle wrapper and plugin updates. The project should be configured with the necessary dependencies (we'll verify in the next step).

Step 2 — Configure dependencies

  • MediaPipe Tasks dependency: The sample app uses the MediaPipe Tasks Vision library. In the app-level build.gradle, ensure you see the dependency declaration for Hand Landmarker:

    implementation 'com.google.mediapipe:tasks-vision:latest.release'
    
    

    This brings in the MediaPipe vision AAR which includes hand tracking. The latest.release will resolve to the latest version (e.g., 0.10.+). If Gradle fails to find it, make sure you have Google's Maven repository enabled in your project (repositories { google() ... } in build.gradle). The sample likely has it set already.

  • Authentication token (not needed): MediaPipe artifacts are public, so you do not need any API token in gradle.properties. If you see references to needing a token in some contexts, that might be for certain Google services, but not for this library. The dependency should download without credentials.

  • Sync project: After confirming the dependency, click "Sync Project with Gradle Files" in Android Studio. It should download the com.google.mediapipe:tasks-vision library and any others. If it completes without errors, you have the SDK ready.

Step 3 — Configure app

  • Set applicationId (optional): The sample comes with an applicationId (e.g. com.google.mediapipe.handlandmarker). If this clashes with an existing app on your device, you can change it in the app's build.gradle (applicationId "com.yourcompany.handdemo"). Generally, it's fine as is.

  • Add permissions in AndroidManifest: Open AndroidManifest.xml. Ensure it includes:

    <uses-permission android:name="android.permission.CAMERA" />
    
    

    The sample should have this already, as camera access is essential. If the sample allows reading from gallery, it might also have READEXTERNALSTORAGE or the newer READ_MEDIA_IMAGES for Android 13+. For our purposes, camera permission is key.

  • AR support (if planning ARCore): (Not in sample by default) If you intend to extend this to an ARCore app, you would add:

    <uses-feature android:name="android.hardware.camera.ar" android:required="true"/>
    
    

    and include ARCore dependencies. The sample doesn't do AR, so this isn't needed just to test hand tracking, but remember this for integration later.

  • Check model file placement: Verify that hand_landmarker.task is present in app/src/main/assets/. The sample's code will try to load this asset by name. If you don't see it, add the model file (download as mentioned earlier) into the assets folder. Without it, the app will not be able to initialize the hand tracker.

Step 4 — Run

  1. Connect your Android device: Enable USB debugging on your phone and connect it to your PC. Ensure Android Studio recognizes it (you might need to approve the PC RSA key on the phone).
  2. Select Run configuration: In Android Studio, select the app module run configuration (usually already selected by default). Choose your device as the target.
  3. Run the app: Click the Run ▶️ button. The app will build (Gradle will compile the code and package the APK) and install it on your device. Watch the Run console for any issues.
  4. Launch on device: The app should auto-launch on your phone (or you can find the app icon, typically "Hand Landmarker", and open it).

Step 5 — Connect to camera input

  • Approve camera permission: Android will prompt you "Allow app to take pictures and record video?" when it first tries to use the camera. Grant this permission.
  • Test with your hand: You should see the camera preview on-screen. Place your hand in front of the camera. Move it around to see if the app begins drawing landmarks on your hand.
  • (No Bluetooth/wearable connection is needed – just the camera feed. If the sample has a switch for different modes (image vs live), make sure it's in live camera mode to continuously track.)

Verify

  • Hand landmarks drawn: The app should overlay points (and possibly connecting lines) on your hand in the camera view. You might see colored dots on each knuckle and fingertip. This means the hand tracking ML model is running successfully.
  • Real-time performance: Move your hand moderately fast or change gestures; the tracking should follow with minimal lag (on a modern phone, typically ~30 FPS). If you see a significant delay or choppy updates, the device might be struggling – check logcat for warnings (e.g., fallback to CPU if no GPU available can slow it down).
  • Gesture functioning: Try a thumbs-up or peace sign. The sample might not display gesture names (unless you're running a GestureRecognizer variant of the sample), but visually confirm that the landmarks align with the gesture (e.g., for thumbs-up, you should see the thumb's points extended upward, other fingers curled).
  • Multiple hands: If you have the ability, test two hands in view. The default config might only track one (to save resources). If only one hand is being tracked at a time, that's expected with default numHands=1. You could increase it in code to test, but be aware of performance. At least ensure the presence of a second hand doesn't break tracking of the first.

Common issues

  • Gradle build fails (dependency not found): If Gradle cannot resolve com.google.mediapipe:tasks-vision, ensure you have mavenCentral() or google() in the repositories. Also check your internet connection for Gradle. The dependency is on Google's Maven; no login needed. In some cases, you might need to add google() in both project and module build.gradle files.
  • App crashes on launch: Check logcat for errors. A common culprit is the model file missing in assets – the app might throw an exception when trying to load it. The logcat would show something like "Asset not found" or file open error. If so, add the hand_landmarker.task to assets and retry. Another cause could be missing camera permission in the manifest (should have been added in step 3) – on Android 11+, if you forgot that, the app will crash when requesting it. Add the <uses-permission> and reinstall.
  • No camera feed / blank screen: If you granted permission but see no camera preview, your device might not support the camera API being used or it's in use by another app. Close other camera apps. Also, some devices (or if using an emulator without camera) will show black. Use a real device with a working camera. Check logcat for any camera-specific errors.
  • Landmarks flicker or not stable: If the hand is at the edge of the frame or partially out, the detector may lose it and re-detect repeatedly. This can cause flicker. To improve stability, keep the hand fully in view and not too small. Also, rapid motion can momentarily break tracking until the palm detector catches up.
  • Low performance on older device: If running on a lower-end Android, you might get low FPS. Ensure the sample is using GPU if available (the MediaPipe Tasks library usually uses CPU by default on Android unless you enable acceleration via options). You can try enabling the Android GPU delegate in the code (if not already) via HandLandmarkerOptions.builder().setExecutor(…something…) or simply be aware that older phones will have some lag.
  • App shows Connected = false or similar: Some samples might display a status. If it's saying not connected or waiting, it could be expecting a secondary device or sensor. However, the Hand Landmarker sample should just start. If there's a UI element showing status, read any on-screen instructions or check if you need to tap something to start the camera. Usually, it should start automatically.

6) Integration Guide — Add MediaPipe Hand Tracking to an Existing Mobile AR App

Goal

Integrate the MediaPipe Hand Tracking SDK into your own app and enable a hand-gesture-driven feature in an AR experience. We'll outline how to set up the hand tracking pipeline inside an app (iOS or Android) and use it to drive an example interaction (e.g., controlling a virtual object via gestures).

Architecture

Think of the integration in layers:

  • Camera + AR Feed → Hand Tracker → Gesture events → App logic. Your app likely already uses ARKit (iOS) or ARCore (Android) for the camera feed and world tracking. You'll tap into that camera feed (or use a separate one) to pass images to MediaPipe. The MediaPipe HandLandmarker runs on each frame to detect hand landmarks. From those, you determine if a certain gesture is made. That then triggers app UI or game logic – for example, "pinch fingers" = grab or select an AR object.

A possible structure:

  • App UI layer: e.g., a Unity scene or native AR view that shows virtual objects.
  • MediaPipe Hand Tracker (SDK client): manages the ML model, processes camera frames to output landmarks.
  • Gesture recognizer logic: interprets landmarks to high-level gestures (e.g., forms a fist, doing a thumbs-up).
  • Application state controller: takes gesture events and performs actions in the AR scene (e.g., toggling an object, moving something, or taking a photo, depending on your feature).

Step 1 — Install the SDK

Bring MediaPipe into your project as a dependency:

iOS (Xcode project):

  • Add the MediaPipe Tasks SDK via CocoaPods or Swift Package. E.g., in your Podfile:

    pod 'MediaPipeTasksVision'
    
    

    then pod install. This gives you access to HandLandmarker and related classes in Swift/ObjC. If using Swift Package Manager, point it to Google's repo (if supported) or add the .xcframework manually (advanced).

  • Import the library in code with import MediaPipeTasksVision.

Android (Gradle):

  • In your app module build.gradle, add:

    implementation "com.google.mediapipe:tasks-vision:0.10.29"  // or latest version
    
    

    (Replace latest.release with an explicit version for stability, e.g., 0.10.29 which is a 2025 release.) Ensure minSdkVersion of your app is at least 19 (as required by MediaPipe, usually). Sync Gradle to pull the library.

  • Import classes in your Kotlin/Java code:

    import com.google.mediapipe.tasks.vision.handlandmarker.HandLandmarker
    import com.google.mediapipe.tasks.vision.handlandmarker.HandLandmarkerResult
    // etc.
    
    

    These classes come from the tasks-vision package.

Step 2 — Add permissions

Integrating into an AR app means your app likely already has camera permissions, but double-check:

iOS (Info.plist):

  • Ensure NSCameraUsageDescription is present with a user-facing reason (e.g., "This app uses the camera to track your hand and enable AR interactions."). Without this, iOS will block camera access.
  • If you plan to use the front camera, also consider adding NSFaceIDUsageDescription if needed (only if using TrueDepth or face data; for simple hand tracking with back camera, not needed).
  • NSMicrophoneUsageDescription – not needed for hand tracking unless your app uses audio.
  • Privacy - Motion usage – not needed here.
  • Bluetooth permission – not needed (no external device).

Android (AndroidManifest.xml):

  • The app must have <uses-permission android:name="android.permission.CAMERA" />.
  • If using Android 13+, and you want to pick images from gallery as fallback, add READ_MEDIA_IMAGES permission (or legacy READ_EXTERNAL_STORAGE for older Android).
  • If your AR app uses other sensors or features, those permissions should already be in place (e.g., coarse location for ARCore's geospatial features, if any – not related to hand tracking though).
  • You might also include <uses-feature android:name="android.hardware.camera.ar" ...> if distributing an ARCore-dependent app so that only ARCore-capable devices install it.

Step 3 — Create a thin client wrapper

It's a good practice to encapsulate the hand tracking logic in its own component, so your UI code isn't cluttered with ML details. Create classes/services like:

  • HandTrackingManager (WearablesClient analog): This will initialize the HandLandmarker SDK and manage the camera frame processing. For example, it might open the camera (if not already open via AR session) or subscribe to AR frame updates. It provides methods like startTracking() and stopTracking(). It could also hold state like whether a hand is currently detected.
  • GestureRecognitionService: Using the output from the HandTrackingManager (landmarks), this component interprets gestures. For instance, it could have logic to detect a "pinch" by measuring distance between thumb tip and index tip landmarks, or detect a "fist" by comparing curled finger landmark angles. This can be as simple or complex as needed – initially, implement one gesture for your feature.
  • PermissionsService: (if you don't have one) – to handle requesting camera permission and any other permission gracefully at runtime. Ensure this is called before starting camera/hand tracking.

Integrate these in your app's lifecycle:

  • Initialize HandTrackingManager (and load the model) on app launch or when you enter the AR feature screen. E.g., on iOS, you'll load the model into memory by creating HandLandmarker with the model path; on Android, call HandLandmarker.createFromOptions(context, options) to get an instance. This may take a brief moment (it loads TFLite model).
  • Connect to the camera feed: If using ARKit/ARCore, you can get the camera frames from those APIs (ARKit's ARFrame.capturedImage, ARCore's frame.acquireCameraImage() or via OpenGL texture). Or simpler, start a secondary CameraX or Camera2 pipeline just for hand tracking. Choose one to avoid conflicts (ARCore can sometimes share camera if configured for CPU image output).
  • Each frame (or each AR update): pass the image to MediaPipe. On iOS, you might use HandLandmarker.detectAsync (if available) or the appropriate API to feed a CMSampleBuffer or UIImage each frame. On Android, use HandLandmarkerResult result = handLandmarker.detect(videoFrame) or if using live stream mode, set a listener that gives results asynchronously.
  • Handle the result: You'll get a set of hand landmarks (and possibly a handedness label, and a flag if hand present). If landmarks are present, feed them to your GestureRecognitionService to see if a target gesture is performed.
  • Emit events: If a gesture of interest is detected, have the system call back into your UI/game logic, e.g., via a delegate or LiveData/Flow (Android) or a Swift delegate/closure (iOS).
  • Manage lifecycle: Start tracking when the AR view appears (and permission is granted), stop tracking when view disappears (to free camera/CPU). Also handle errors: e.g., if model fails to load or camera fails to open, show an error to user.

Definition of done:

  • MediaPipe hand tracker initializes without errors when the feature starts (e.g., no missing model file, no incompatible device issues).
  • The camera feed is successfully being processed by the hand tracker (you get landmark results in your logs or callbacks).
  • When your hand is in view, the system detects it (even if you're not yet acting on it, you can log "Hand detected" to verify).
  • When you perform the chosen gesture, the system recognizes it (e.g., you log "Pinch gesture detected").
  • Clean up happens on exit – the camera and tracker stop so they don't consume resources in other parts of the app.

Step 4 — Add a minimal UI screen

Design a simple interface or feedback mechanism for the hand-controlled feature. This could be within your AR view or an overlay on it:

  • "Enable Hand Control" toggle: You might have a UI switch or button to turn on/off hand tracking (if it's not always on). This can call your HandTrackingManager start/stop.
  • Connection/status indicator: Since there's no external device, this could simply be an on-screen icon or text like "👋 Tracking" when a hand is detected. It gives feedback that the system sees your hand. For example, you could put a small green dot or "Hand: ✔️" when a hand is present, and gray or "Hand: ❌" when none.
  • Virtual object or target UI: If your feature is to control an object, place that object in the AR scene. For a quick demo, maybe a 3D cube or a balloon that appears.
  • Gesture action button (if needed): In some cases, you might still include a traditional UI button for fallback. But ideally, the gesture itself triggers the action. However, having a debug button "Perform Action" can help testing alongside the gesture.
  • Result display: E.g., if the gesture triggers capturing a photo or spawning an object, show the result. For instance, if a screenshot is taken, show a thumbnail. If an object is placed, visibly show it in scene or list it.

For our running example (say, pinch to spawn a virtual object), the UI could be:

  • A short text: "Pinch your index and thumb to place a cube."
  • A counter or indicator that an object was placed.
  • Perhaps a "Reset" button to remove all spawned objects (for testing repeatedly).

Keep the UI minimal so as not to clutter the AR view – use translucent overlays or small icons. The focus is that the user's hand is doing the main interaction.


7) Feature Recipe — Use a Pinch Gesture to Toggle an AR Object

Goal

Implement a specific gesture-driven feature: when the user pinches their thumb and index finger together (like a "air click" gesture), the app will spawn a virtual object in the AR scene (or toggle its visibility). Pinch again, and the object disappears. This simulates a basic "select" action using hand gesture, without touching the screen.

UX flow

  1. Connected & ready: The app is running the camera and has hand tracking active (user's hand is being tracked).
  2. User performs pinch: They bring thumb and index finger together in view of the camera.
  3. Gesture recognized: The app detects the pinch gesture.
  4. Action triggered: The app places a virtual object (e.g., a 3D cube) at a predefined location in the AR scene, or toggles it on/off.
  5. Feedback given: The object appears (if toggled on), possibly accompanied by a brief confirmation (sound or visual flash). The user opens their hand or un-pinches.
  6. User can repeat: Each distinct pinch (thumb-to-index touch) will toggle the object's presence or state.

(This is a simple recipe – in a real scenario, you might raycast from the hand to place objects where the user is "pinching", but that involves more AR logic. Here we keep it straightforward.)

Implementation checklist

  • Hand tracking running: Ensure the HandTrackingManager is updating landmarks every frame.
  • Permissions okay: Confirm camera permission was granted (otherwise this feature should be disabled or prompt user).
  • Detect pinch gesture: In your gesture recognizer, implement logic to detect pinch:
    • Identify landmark indices for thumb tip and index finger tip (in MediaPipe, thumb tip is landmark 4, index tip is 8 in the 21-point scheme).
    • Calculate the normalized distance between those two points. For example, in 2D image coordinates, get dx = x_thumb - x_indexdy = y_thumb - y_index, distance = sqrt(dx^2 + dy^2).
    • Determine a threshold: when fingers are apart, distance is larger; when pinched, distance becomes very small. You might set a threshold like "distance < 0.1 (in normalized units)" to consider it a pinch. Tune this if needed.
    • Additionally, ensure it's a deliberate pinch: maybe require that other fingers are relatively straight or not confound (for basic, distance is fine).
  • Debounce gesture: Avoid toggling repeatedly while fingers are together. You can maintain a boolean state: pinchActive. When distance goes below threshold and pinchActive was false, that means a new pinch event just started → trigger the action and set pinchActive = true. When distance goes above threshold (fingers released) set pinchActive = false. This way, each pinch (press and release) triggers only once.
  • Trigger AR action: Once a pinch event is recognized (thumb just touched index):
    • If object is not in scene, place it; if it is, remove it. (For example, if using ARKit/ARCore, you might add or remove an ARAnchor with a model. Or if using SceneKit/Sceneform, toggle node visibility.)
    • As feedback, perhaps change the object's appearance briefly or play a sound.
  • Handle multiple pinches: Each time the user pinches again (after unpinching), the toggle happens again.
  • Edge cases: If the user keeps holding pinch for a long time, your debounce logic ensures it doesn't spawn multiple items. If the user's hand goes out of view while pinched, you might consider that a canceled gesture – but in this simple toggle, it's okay; it will just remain in whatever state last toggled.

Pseudocode

Here's a simplified pseudocode for the pinch detection logic within a frame update loop:


// Swift-style pseudocode (could be similar in Kotlin/Java)

var pinchActive = false // state to track if currently pinching

var objectVisible = false

func onNewHandLandmarks(landmarks: [Point]) {

    guard landmarks.count == 21 else { return } // need full hand

}

let thumbTip = landmarks[4]

let indexTip = landmarks[8]

// compute squared distance in normalized coords for simplicity

let distSq = (thumbTip.x - indexTip.x)^2 + (thumbTip.y - indexTip.y)^2

let threshold: Float = 0.005 // tuned threshold (squared) for "touch"

if distSq < threshold {

if !pinchActive {

// Pinch just began

pinchActive = true

objectVisible.toggle() // flip state

if objectVisible {

placeVirtualObject() // show object in AR

showFeedback("✅ Object placed")

} else {

removeVirtualObject() // hide object

showFeedback("❌ Object removed")

}

}

} else {

// fingers apart

pinchActive = false

}

This assumes onNewHandLandmarks is called each time we get an update (perhaps every video frame or every few frames).

You would integrate this into your HandTrackingManager's callback. For example, on Android if using live stream mode, you set a ResultListener that gives you HandLandmarkerResult each frame; from that you extract the first hand's landmarks and apply logic. On iOS, you might call handLandmarker.processFrame() and get results synchronously or use the delegate for live stream.

Troubleshooting

  • Pinch not detected reliably: You might need to adjust the threshold. If it's too low, slight gaps won't register; if too high, you might get false positives when fingers are just near. Print out the distance values while testing to fine-tune. Also, ensure the camera view is not too far – if the hand is very small in frame, the normalized distance might always be small. If so, consider using a ratio (e.g., distance between thumb and index vs distance between index MCP (knuckle) and wrist as a scale).
  • Gesture misfires when not intended: Sometimes if you bring other fingers close, or your hand rotates profile, the thumb/index might appear close in 2D even if not pinching (occlusion). If you get false triggers, you can add an extra check: for example, also ensure that the angle of the index finger is bent (if doing more complex geometry), or that other fingertip distances are larger (to ensure it's specifically thumb-index touching). For basic usage, you might live with an occasional false positive but document that best practice is to face the camera directly.
  • No response in AR scene: If the object isn't appearing/disappearing, verify that the gesture logic is actually toggling objectVisible. Add logging or on-screen text when gesture triggers to ensure that part works. If it does, then the issue is likely with how you place/remove the AR object:
    • Check that you are adding the AR anchor on the main thread (for ARKit/ARCore).
    • If using a fixed position, ensure it's within view. For example, you might just place the object 1 meter in front of the camera for simplicity.
    • If nothing is visible, try a simpler approach: e.g., instead of AR, just overlay a 2D icon on the screen to confirm the toggle works. Then debug the 3D placement.
  • Continuous gesture (holding pinch) expectations: Our implementation toggles on pinch start. If you intended something like "keep holding pinch to hold an object", then you'd do a different approach (e.g., spawn on pinch start and maybe move it until pinch ends). For now, we toggle, so that should be clear to the user (maybe mention "pinch again to remove").
  • Multiple gestures interference: If you later add more gesture types, make sure to prioritize or differentiate them clearly. For example, an open palm vs pinch – if the user transitions, one gesture should end before another begins to avoid confusion. Using state machines can help (not needed for single pinch gesture).

8) Testing Matrix

Now that you have hand gesture controls integrated, test your app under various scenarios to ensure reliability:

← Scroll for more →
ScenarioExpected OutcomeNotes
Recorded video input (mock)Hand tracking works on known inputsUse a prerecorded video of a hand performing the pinch. Feed it into the app (if your pipeline allows) or simulate frames. The app should detect the gesture and toggle the object consistently. This helps in automated tests (CI) to validate the gesture logic.
Real device – ideal conditions (close range)Low-latency, accurate trackingTesting on a physical device with the hand ~30-60cm from camera, good lighting, plain background. The gesture should be recognized quickly (0.5s) and reliably every time. This is the baseline functionality.
Real device – arm's length or partial viewTracking may drop if too far or partially outIf the user's hand is at the edge of the camera view or a bit far, the detection might not find it. The expected behavior is that the app simply doesn't toggle in those cases (no false triggers). Document to users that hand should be in frame. If tracking reacquires when the hand comes back, it should still respond to gestures normally.
Background app / screen offGraceful pause and resumeWhen the app goes to background or the phone locks, the camera and tracking should pause (to save battery). Upon resuming, the system should reinitialize if needed. Expected: no crash, and tracking continues when app becomes active again. The virtual object state should remain (e.g., if it was on screen, it stays toggled on after resume).
Permission denied at runtimeShows an error & no trackingIf the user denies camera access, the feature cannot work. Your app should detect this and maybe show a message like "Camera permission is required for hand gesture control." Expected: the app does not crash; it either disables the hand tracking feature or repeatedly prompts the user (the former is better UX). Ensure the object doesn't spawn on its own in this case.
Hand removed mid-gestureNo action until new pinch after re-detectIf the user starts pinching but then moves hand out of view before releasing, the system might never register the "release." Our logic might consider pinchActive still true. When the hand is lost, we should probably reset gesture state. Ideally, handle in code: if no hand is detected for some frames, set pinchActive = false to avoid a stuck state. So when hand comes back, a new pinch can trigger action. Expected behavior: no spurious toggle when hand disappears; user just has to pinch again after returning.
Low light / dark environmentPossibly fails to detect (expected)Test in a dim room. Likely, the hand won't be detected well or at all. The expected outcome is simply that nothing happens (or tracking is intermittent). This is acceptable, but note it. If the app's purpose might involve low-light usage, consider informing the user or providing an IR-based solution (beyond scope). No app crashes should occur.
Busy background (complex scene)Minor false detections possibleTest with a background that has similar colors to skin or a lot of textures. It's possible the detector might see a "hand" where there isn't one (rare, but possible) or get confused. Ideally, no gestures trigger unexpectedly. If you notice false positives, you might tighten the detection confidence threshold (MediaPipe allows setting a confidence cutoff in options).

Each of these scenarios helps ensure your gesture control feature works robustly. Pay special attention to the background/lighting conditions, as those are the main factors affecting camera-based tracking.


9) Observability and Logging

To maintain and debug the feature in production, add logging and analytics for key events. This will help you understand how users are interacting and where issues might arise:

  • Startup events: Log when hand tracking starts and stops.
    • hand_tracking_start – when the camera and model are activated.
    • hand_tracking_stop – when they are turned off (e.g., leaving AR mode).
  • Permission state: Log if camera permission is not granted or if the user was prompted.
    • hand_tracking_permission_denied – could be a one-time log if user permanently denied, so you know how many users can't use the feature.
  • Hand detection events:
    • hand_found – each time a hand is detected after not being present.
    • hand_lost – each time tracking loses the hand (e.g., hand leaves frame). You might also log duration the hand was present.
  • Gesture attempt events:
    • pinch_start – when a pinch gesture is first recognized (even if it doesn't lead to a toggle because maybe already pinched).
    • pinch_end – when the pinch is released.
    • If you implement other gestures, similarly log fist_start, etc.
  • Gesture action outcome:
    • pinch_toggle_on – object was shown.
    • pinch_toggle_off – object was hidden.
    • If you have more complex outcomes, log those (e.g., if pinch selects one of several objects, log which one).
  • Performance metrics: It could be useful to log the average processing time or frame rate:
    • hand_tracking_fps (maybe periodically log how many frames per second the tracking runs, or log if it drops below a threshold).
    • hand_tracking_inference_ms – average milliseconds per frame for hand landmark model. If you have access to this (maybe not easily via the high-level API), could instrument.
  • Error logging: If the MediaPipe throws an error (e.g., model load failed, camera failed):
    • hand_tracking_error with details (exception messages, etc.).
    • If using a logging service, surface these so you know if many users are hitting an issue.

By analyzing these logs, you can answer questions like: How often do users engage with hand gestures? Are there many false triggers (e.g., hand_found but no pinch_start)? Does tracking frequently stop and start (maybe indicating users moving out of frame a lot)?

Additionally, if you have analytics, you might track a funnel:

  • How many users enter the AR feature vs how many actually perform a gesture successfully. This can gauge usability.

Make sure logging does not spam too much (e.g., don't log every single frame). Focus on state changes and user-triggered events. Logging and analytics will ensure you can observe the feature's performance in the wild and iterate on improvements (like adjusting gesture recognition parameters or updating the model version in future).


10) FAQ

  • Q: Do I need any special hardware (depth camera, gloves, etc.) to use MediaPipe hand tracking?

    A: No special hardware is needed – just a standard RGB camera on your device. MediaPipe Hand Tracking works with the regular camera feed, using AI to infer depth and landmarks. You don't need depth sensors or markers on the hand. This makes it usable on most smartphones. Just be mindful of lighting and camera positioning for best results.

  • Q: Which devices and platforms are supported?

    A: MediaPipe Hand Tracking supports Android and iOS devices. It's been demonstrated to run in real time even on mid-range phones. For Android, devices should ideally support ARCore if you're doing AR (for world tracking), but the hand tracker itself doesn't require ARCore. On iOS, any device running iOS 13+ with a camera (iPhone 6s or later) can run it; newer devices (with A12 Bionic or later) will perform better thanks to neural processing units. There is also a web version and Python version of MediaPipe Hands for other environments, but for this guide we focus on mobile apps.

  • Q: Can I use this in a production app?

    A: Yes, many apps use MediaPipe for hand tracking (e.g., some filter apps, games). The code is Apache-licensed. However, note that as of late 2023 it was labeled preview, so you should test thoroughly. The models are quite robust (trained on many images) and work in real time, but you'll need to account for edge cases (lighting, etc.) in your UX. Also, keep the SDK updated – Google continues to improve it (the version 0.10+ in 2025 is much improved over earlier iterations). Always comply with platform privacy rules (e.g., if you use the camera in background, disclose it, etc.). In short, it's production-capable, but do your due diligence with testing and user guidance.

  • Q: What gestures can it recognize out of the box?

    A: By default, the Hand Landmarker gives you raw landmark positions – it doesn't label gestures. You (the developer) interpret those for gestures you care about. Google provides a Gesture Recognizer task as well, which builds on the hand landmarks to classify a set of common gestures (like "Closed Fist", "Open Palm", "Pointing", "Victory/Peace", etc.). If you use that, you can get high-level gesture names directly. The set of gestures is somewhat limited (mainly basic ones and some letter shapes, depending on the model). For custom gestures (say a unique sign or a complex movement), you would need to implement logic or even train a custom model. MediaPipe's flexibility allows you to use the landmarks as inputs for your own classification algorithm (even a simple rules-based as we did for pinch, or a machine learning model if needed).

  • Q: Can I track multiple hands and do interactions between them?

    A: Yes, MediaPipe can track multiple hands (you can configure the max number of hands, e.g., 2). It will give you separate landmark sets for each hand and an identifier for left vs right. You can definitely enable two-hand interactions (like pinch your hands together, or one hand controls something the other hand reacts to). Keep in mind performance will be a bit lower with two hands and more complex logic. Also, if the hands overlap in view, tracking can be momentarily confused (one hand covering the other might cause one to disappear until visible again). But many demos (and research) have used two-hand tracking successfully. Just set numHands to 2 and design your gestures accordingly (and test the edge cases of hands switching or overlapping).

  • Q: How does this compare to platform-specific solutions (ARKit Hand Tracking or others)?

    A: Apple's ARKit as of iOS 17 does not provide full hand skeleton tracking – it offers finger skeleton for the index finger if using LiDAR (for hand interactions in VisionOS) and basic pinch detection in ARKit (2D pinch detection with ARSCNView). But it doesn't give 21 joints of the hand like MediaPipe does. Apple's Vision framework has a HandPose request (introduced in iOS 14) which does detect hand joints in images, similar to MediaPipe. If you target only iOS and want a native approach, you could use Vision's VNDetectHumanHandPose request, which provides up to two hands' joint locations. MediaPipe's advantage is cross-platform and potentially more optimized for real-time video (Vision might be a bit slower or have lower frame rate, though it's improved with Apple Neural Engine). On Android, there isn't an official equivalent (there are third-party libs like ManoMotion or Google ML Kit had some gestures but not full hand skeleton). MediaPipe fills that gap on Android. So if you need one codebase for both, MediaPipe is a great choice. If you're on a single platform, you can consider these trade-offs. In this guide, we chose MediaPipe for broad applicability and proven performance.

  • Q: Can I use hand tracking for complex gestures like sign language recognition?

    A: MediaPipe provides the raw landmarks which is a great starting point for sign language recognition research. However, recognizing full sign language involves not just static hand pose but motion and both hands, plus facial expressions sometimes. You would need to build a classification system on top of the landmarks (like a sequence model – e.g., an LSTM or Temporal CNN – to interpret sequences of hand poses into signs). There are academic projects doing exactly this. The MediaPipe team's gesture recognizer isn't that advanced to do full sign language out of the box, it's more for basic gestures. So yes, you can use MediaPipe as a foundation (and it's likely one of the best real-time hand trackers to use for this), but the gesture recognition for sign language would be your custom implementation or model. Expect to gather training data and develop an ML model if you aim for comprehensive sign language interpretation.