How to build a holodeck, week 30

Meshing speedups and Tango upgrades, this week. even some video!


Week 30 - Kinect V2 depth scan performance

Meshing performance

I spent the first half of the week looking at potential performance optimisations for the depth-to-voxel process (and a little on the meshing process too). At the end of last week, I was hitting around 300ms for the depth-to-voxel scan and about 600ms for the meshing process, both of which are an order of magnitude larger than I really want.

I initially looked at speeding up the “let’s test every voxel in every cube” process, which is how Chisel does chiselling. This sounds a bit expensive, and it is, but a core part of how the removal / carving voxels process works is that you have to confirm, per voxel, that the voxel is empty (which means testing it, or setting it, or both).

I spent Monday exploring alternative methods of scanning the space, such that instead of scanning the entire volume inside the camera frustum, I only scan based on the number of pixels in the depth image, as this is a much smaller constant than the number of voxels in the volume. Maybe I can work backwards from the image to determine the voxels?

So, I did the math to work out the back plane of pixels in world space, quantised this to voxels, and then back-project based on the depth reading to find the actual voxels that need to interact. This is about 3 times faster for me - I lose a lot of the benefits of iterating over a coherent memory region (the CPU caches do a great job of holding the entire chunk of voxels, so iterating over them isn’t as expensive as you’d expect) but I’m doing about 1/25th the work, so it’s still faster.

The main thing I lose using this method is the carving cues - I can effectively carve on frontal facing surfaces, but I can’t carve what I don’t know about, so if I back-project a depth at the far end of the frustum (because an object that was there is no longer there) then I would have to test every voxel along the entire ray to carve them out, and because this is not nicely ordered in memory, the cost of that is going to be huge.

After messing about on tuesday with more memory layout and small-grain optimisations (where do I do the vector math and what can I cache, mostly) I got the back-projection working in around 30ms, the voxel integration of these arbitrary voxels running in around 100ms - so hitting under 150ms at the cost of no carving. This is running on a single core and I’m not attempting to do any sort of time-slicing or prioritisation of more-modified chunks, and I think the depth buffer itself could help hint at those areas.

I’m going to go back to looking at optimising the original implementation next week, as I think I can probably half the cost on that with a bit more thinking - but it suffers some of the same flaws as I’m expecting with the carving process anyway.

The Marching Cubes meshing cost is also substantial (150-200ms) but I’ve not really looked at that much yet.

There’s undoubtedly going to be more algo optimisations I can make, but we’re almost in the territory where splitting things up into smaller jobs and batching them up makes more sense, so I’m going to hold off on spending too much more time on performance for now and get the rest of the system working.

Multiple devices, multiple feeds, and a Tango upgrade

My next goal is to get multiple, and moving, cameras all feeding into the same data pool. Unfortunately the Unity Kinect plugin doesn’t support multiple sensors on one machine (even though technically I should be able to do it) and I don’t have a good enough spare machine to run my other kinects on - so it’s time to build a new PC! Parts are ordered, hopefully next week I’ll get that up and running.

In the meantime, I do have a selection of hardware with depth and texture feeds, and one of those even does position/rotation tracking! Yes, it’s time to dig the Tango out again.

Google finally fixed the problems with running under Unity 5.4, so I spent Wednesday thru Friday getting my Tango integration up to date with the latest release (Vega) and fixing all the things that got broken due to that - the main one being that they’ve changed how you get textures from the camera. What was originally there (when I did this months ago) was a pretty convoluted process where you take three buffers (Y,U,V) in the YCrCb space and then use a shader to convert this into an RGB RenderTexture.

That was all working nicely - except now in Vega, this is borked. And there’s NO replacement. You have to use some very silly (imho) code from their AR Overlay to render the camera image, or you have to do it entirely by hand from a YUV byte buffer. Both of these methods stink - what’s needed is “GetLatestFrame” into a Texture2D or RenderTexture.

Still, with some info from Danny at SignalGarden (thanks Danny!) I got the camera rendertexture method working, and got my Scanner app back up and running on Unity 5.4 and Vega. That took me most of the way to Friday afternoon.

Switching back to the depth feed app, it quickly became obvious that the way I’m throwing textures and meshes around in the scanner is very - lets say, provincial. I’ve got many assumptions about Tango baked into that code, and a lot of them came from my initial investigations into the tango poses and camera offsets.

So, I’ve begun a bit of a rewrite so move things over to feeds.

What’s a feed?

A feed is some chunk of data (a camera image, such as RGB or depth, or a point cloud) that has relevance to the world construction model. A feed needs to know WHERE and WHEN it came from (which matches a timestamp and a pose, otherwise known as camera extrinsics). It also needs to know the properties of the device that took the image, so in the case of depth or RGB the feed needs to provide the camera intrinsics as well as the pose.

If I pass the intrinsics around (that’s focal lengths, principal points, image size in pixels, near and far planes in world space, distortion coefficients etc) as well as the extrinsics, I can accurately project this data into the world, regardless of where the feed comes from. If you have a feed coming from a device that doesn’t know it’s extrinsics, there’s ways to calibrate that (from my manual calibration all the way up to smart calibration based off other devices in the scene).

This means, in theory, any arbitrary camera / depth scanner device could be a feed into the same dataset.

So, that’s the plan - I’m busy rewriting the Tango depth and texture processing to be feeds. That’s going to take a little while.

Next week …

Hopefully finishing up the feed rewrite, and then getting on to networking, so I can get multiple feeds from multiple devices all populating the mesh. Maybe getting texture projection working too. If I can get that working, I might attempt using the Vive camera to do some texturing on a mesh.

Lots of cool experiments to try in the next few weeks!

Written on September 17, 2016