Exclusive: How does Microsoft Xbox Kinect work?

T3 explores Kinect's innards at Microsoft's Redmond Campus

The innards and the processes involved with the motion-control box explained

Since its unveiling at 2009’s E3, Microsoft Kinect (Or Project Natal, as it was originally known) has caused a bit of a stir. The Xbox’s foray into the motion-controlled gaming war is a controller-less one, which has left many sceptical as to its accuracy, and those that have tried it bewildered as to just how it works.

Microsoft invited T3 for a UK exclusive, all-access tour round the labs at the company's HQ in Redmond, Seattle to see exactly how Kinect works, and the work that’s gone into it. Kinect is made up of three distinct subsystems that each marries hardware and software, all of which are detailed below:

How does Xbox Kinect work? Movement tracking

Kinect’s optical setup is what allows it to track your movements in real time. It’s ridiculously complicated and made up of tech that's been around for about 15 years, but allows for effects and functions that have only been available at huge expense up until very recently.

It’s made of two main parts: a projector and an IR VGA camera. The former bounces out a laser (don’t worry, Microsoft insists it’s safe) across the entire field of play, which the camera picks up to separate you from your sofa on what’s called a ‘depth field.’ It’s essentially all the pixels that Kinect gets back as IR noise measured in varying colour dependant on how close they are to the system. That way bodies appear a bright shade of red, green etc, and things further away appear grey.

The software takes this image and runs it through a host of filters so that Kinect can work out what’s a person and what’s not. The system follows a basic system of guidelines, such as ‘a body is from x-foot tall to x-foot tall’ and ‘a person has two arms and two legs’ to work out that your coffee table or dog aren’t extra players. It’s also taught to be able to pick you out if you’re wearing baggy clothes or have hair coming over your shoulders. When we saw this as the developers see it, it was impressively accurate at sussing out each body part (right shoulder, for example) from not much information.

Once that’s sorted, it converts body part identification into a skeleton with moving joints. Kinect is preloaded with 200 common poses, so that it can fill in the blanks if you make a move that obstructs the cameras view of your entire skeleton. The only downside we could see was that fingers aren't mapped individually on the skeleton, meaning that those dreams holding a pretend gun and pulling the trigger for Kinect FPS games are over.

The system does all this continuously at 30fps.

What about that promo trailer where Kinect signs in players just by looking at them? We saw that work in real life. The reality is that you’ll need to go through an ‘enrolment’ process in for that to happen. It’s a short one, but works by mixing your skeletal measurements with some basic facial recognition software. Microsoft says that if you drastically change your appearance you’ll need to reenrol.

The problem facing the microphone subsystem is that it needs to be sensitive to voices up to 10 feet away, while being able to ignore ambient noises and any sounds other than your voice. To solve this problem, the Microsoft lab went to 250 homes with 16 microphones and took a host of recordings from different setups, determining in the end the very best mic positioning.

The end result is an array of four downward-facing (so that the front of Kinect stays clean and grill-free) mics, spaced one on the left and three on the right. In fact, this specific microphone placement is the only reason why Kinect is as wide as it is.

This array works best at picking up voices at distance, but it still needs help. The onboard processing unit cancels out noise that it determines is coming from your beefy 5.1 surround system, while a software system called ‘Beam Forming’ works with the camera to work out where you are to create an envelope of sound around you. This hammers in the sound of your voice and ignores your friends or family members either side of you.

Kinect has an ‘acoustical model’ for countries and individual regional dialects, built from 100s of hours of actors from round the world talking through various sayings.

Just like the Optics, this is happening all the time. The sound recognition works on an open-mic system, meaning the microphones are listening at all times - ready to take commands such as ‘Xbox Pause’ at any point during movie playback.

The final part of Kinect’s tech-laden underbelly is its motor. Microsoft spent time looking at the differences in living spaces across American, European and Asian homes, and realised that the camera needed to be able to move up and down to calibrate to each specific space.

When you get your hands on Kinect, you might notice that the base is quite weighty. While this is in part to stop the unit from falling over, it’s also due to the motor being located there. It’s able to move the unit’s head up and down, plus or minus 30 degrees, meaning that (when placed at the optimal 3-6ft height) it can still work regardless of your TV unit’s height.

T3 was taken round Microsoft’s labs, where we were shown the motor being tested under extreme heat, for extended use (thousands of tilts a day over a course of several months) as well as to be accurate to within one degree. A quiet room has also been built to make sure that the motor and tilting can’t be heard by the user. Microsoft insists that the action is around 24 decibels loud, while the average living space about 40.

The motor also operates the camera’s zoom function, which allows it to expand the play space. The example we were shown was somebody joining a video chat by walking in behind the sofa, at which point the picture zoomed out to compensate and allow them into the frame. There is also a fan that only kicks in when needed, so as not to interfere with the microphones.