You want to be prepared for the future of augmented reality: the day when we’re walking around wearing glasses and seeing ‘stuff’ in the world around us.
You’ve figured out that it’s going to be a big deal. You saw some of the press for Magic Leap. You’ve played around in virtual reality and understand that there’s something special about ‘tricking’ our eyes into seeing new worlds.
You’ve seen Minority Report and Bladerunner 2047.

You’ve maybe even played AR Lego or Pokemon Go and you can see how virtual avatars, little mini-worlds and 3D billboards will be able to pop up in the world around us.

You’re all in on AR.
You’re agnostic about what you can learn and so it’s not like you need to build on years of experience in WebGL or something.
The AR world is your oyster. So where do you begin?
AR Isn’t What You Think It Is
If someone right out of school, or someone who’s retired and wants to take on a new career was to ask me: “what would you recommend? Where do I begin?”
I’d tell them to learn SwiftUI.
This is, perhaps, a straw man to make a point. And I’ll come back to SwiftUI.
But the main reason for this recommendation would be this:
Magic Leap, VR and phone-based AR experiences have conditioned us to think about augmented reality in a way that won’t match what we’ll experience in our daily lives once we’re wearing glasses all day.
Yes, you can learn how to develop for VR. Yes, there’s tremendous value in learning how to use Unity.
Yes, one day we’ll walk around town and characters from Star Wars will be mixed in with the ‘real people’ and those building-sized avatars will pop out of billboards like in Bladerunner.
But even when that happens the value of AR will lie less in those moments when reality and virtuality blend together. Instead, the power of AR will lie in an always-on ambient awareness of our surroundings, with just-in-time micro transactions that turn the physical world into an interface.
If you have the luxury of planning for a longer time horizon, the real value of AR will be best expressed in your ability to deliver reactive content based on place.
And therefore the best place to start is with reactive “nugget-sized” design (perhaps linked to explorations of the AR cloud or machine learning).
Two Glasses, Two Markets
Qualcomm presented the following chart at the AWE conference:

This is one of the most important slides to understand if you want to plan for the future.
The top section is representative of the Quest. These are devices that stand on their own. Because they stand on their own they also carry ‘bulk’. They aren’t the kinds of head-mounted devices you’ll wear out for dinner with friends.
The bottom section is representative of NReal or Focals by North. They’re devices which you wear in more places, are “lighter” looking, but frankly don’t do very much.
Over time, these two form factors will merge. In 5-10 years there will be “glasses” which can do BOTH AR and VR. They won’t need a separate device attached like a phone or PC because content will be streamed via edge computing and 5G.
But regardless of how you want to interpret this chart or which devices you place in which category, there’s a simple way to summarize where we’re at:
- There is a generation of head-mounted devices that you use at home, in your living room, and which can deliver really rich ‘experiences’. (Magic Leap was supposed to be one of these, but the ‘experiences’ part fell flat).
- There is another generation of devices, with more around the corner (from Apple, Snapchat, Google and others) which you’ll wear every day but which, because of the laws of physics and the current state of displays and miniaturization, won’t have the same visual fidelity and immersion
But VR headsets are available now! A new Quest will rock the market and hopefully help Facebook sell…what? Several million more devices?
If you want to make money or do cool stuff, shouldn’t you build for the market as it exists today?
Sure. But just be prepared: because the skills you learn won’t easily transfer over to that second category.
When Apple launches glasses, they won’t be launching a “walk-around” Quest. They’ll be launching prescription glasses that just HAPPEN to do a few cool things, that pair nicely with your iPhone or Watch or TV, but that aren’t meant to create the illusion that you’re walking around a Star Wars planet when you visit the local mall.
Apple will focus on glasses that are prescription-ready and that help life to “pop”:

And, being Apple, they will sell…millions? Tens of millions? Your guess is as good as mine.
But the number of sales isn’t important. What’s important is how much of your day-to-day time they will consume.
If Apple (or Google or Snapchat or Amazon) launch glasses that you can wear all day, they will “own” maybe 14 hours of your day, and cede 2 hours to the time you spend in your living room playing on your PS5 or wearing your Oculus Quest.
What Will Glasses DO?
You have a pair of AR glasses. You wear them on your drive to work because they’re prescription glasses. You pop into Starbucks and buy a coffee. You head up to the office and boot up your iPad with the Magic Keyboard. At night, you zone out and watch Netflix after playing a game with the kids.
In this scenario, how often will your glasses display the kinds of “rich 3D content” that you saw in those Magic Leap videos?
How often will you use gesture recognition to manipulate a 3D scene, building your own virtual Lego on the dining room table, for example?
Will you ALWAYS wear them? Or will you take them off so that you can “pop in” to VR?
Now, let’s just PRETEND for a minute that “rich 3D” was possible. Do you have little 3D avatars on the desk beside you while you work? Do virtual billboards pop out at you Bladerunner/Minority Report style?
Honestly, I think most people would turn their glasses off.
It’s the famous AR dystopia visualized in this famous video:
Maybe it’s unavoidable.
But more likely, and especially at the beginning, AR will be a lot more subtle. It will be gentle on the eyes. We won’t have news items flashing up all the time, there won’t be flashing billboards and dancing avatars. It will be useful and relatively unobtrusive.
- On the drive to work you’ll get subtle wayfinding which nudges you around a traffic jam ahead
- At Starbucks, your rewards points will float above the cash register and you will be able to pay by nodding your head
- At work, the home screen widgets that Apple launched back in 2020 will magically “float” off the edges of your iPad
- At home, Netflix will have a new “AR pops” layer and will have pop-up storylines that appear next to the screen. When you watch a sport event on Apple TV, you’ll be able to zoom in on a particular player and pull up her stats for the season
Key Technologies Facilitating “Walk-Around” AR
What is it that facilitates all of this?
Sure, it will be deeply informed by the work done in VR. But “gesture recognition” won’t be something developers needed to sort out. It will be BAKED in by the makers of the glasses and will be extended by things like head nods, blinks or eye movement.
The VR folks are solving tough problems around gesture recognition. They think those skills will be super important in “AR”. It won’t be. Because those will already be solved problems by the time AR arrives. (And gestures will be a relatively minor affordance).
Mostly AR will be informed by three things:
- Location context (or the “AR Cloud”)
- Machine learning, which provides the “juice” for object and scene recognition
- And reactive-style programming
These three things fall under the larger umbrella of “spatial computing” which is a paradigm change in how we think about what our technology will DO.
Location Context
The key to “walk-around” AR is the ability of your device to understand where it is, what’s around it and to then precisely ‘place’ content and interactions.
LiDAR and your camera can scan your immediate environment. But to extend this to city-scale experiences, to persist this content across sessions and users, you need something more.
You need an AR Cloud.
The year, Apple announced location-based AR experiences. This is Apple’s version of an AR Cloud. It represents a ‘scan’ of the physical world and allows you to place digital objects in precise locations.
Apple takes a 3D map of the world from Apple Maps:

Creates a “localization map”, which is a point cloud that allows you to precisely interpret what’s around you:


Which then allows you to precisely place content. This also allows multiple users to see the same thing in the same place, because it can persist across sessions, thanks to the localization map.
The text floating above the building isn’t real, and it is precisely locked in position:

AR Geoanchors, available in 5 cities for now (with “more cities available over the summer”) will be a key facilitating technology for “walk around” glasses.
Machine Learning/AI and Intelligence
The next big development in the AR Cloud will be semantic mapping. It’s one thing to have a ‘scan’ of the world, represented by point clouds. It’s another to know what the objects are in those scans.
We see hints of this in current AR technologies where your phone can make sense of the world in a limited way. At WWDC, for example, Apple extended the capacity of ARKit and the efficacy of LiDAR scans:
- You can now easily do object occlusion. For example, digital objects can hide behind a physical tree

- Occlusion is supplemented by ‘scene detection’. You can now detect vertical and curved surfaces and use things like ray casting so that physical objects can climb up a tree or collide with the ground.
- You can now re-render reality itself using Create ML, which is used to develop visual conversion models:

- You can now use machine learning to detect body movement by using Create ML to track actions. It can be used, for example, to track a tennis match or Yoga session.
These are just hints at how machine learning is used to ‘see’ and interpret the world around us. (And they’re an important reason why Apple vastly improved Create ML and why they seemed to talk about LiDAR in every session).
But the real magic will come when our phones don’t just detect people (and their actions) or vertical and horizontal planes.
What happens when your phone knows that something is a chair and that the chair comes from IKEA? What happens when it knows what kind of dog you own (see Snapchat!)?
This kind of semantic understanding is critical to how powerful spatial computing will become.
Spatial Computing
Spatial maps. Machine learning. Devices that can ‘see’ the world around them and understand what that world is made up of: cars and pedestrians, a Starbucks counter and a display of muffins, an iPad work set-up with a Magic Keyboard and a bunch of emails you’re trying to sort through while keeping an eye on the weather for a game of golf.
The power of spatial computing lies in the convergence of awareness and action.
I know I’m entering a Starbucks, I know I’m approaching the counter, and I know that I will want to earn rewards points and pay for my coffee.
This spatial awareness matched to action is what makes spatial computing so important.
And it will have its most visible manifestation in what we see through our glasses (or, sure, how well our cars can drive themselves amongst other things).
The compute layer will be mostly ambient.
Think of it this way: BEFORE you walk into Starbucks you’ll also walk from the parking lot and along the sidewalk to the store. There will be computing happening but nothing will happen that the viewer SEES.
Sometimes, things WILL happen but you ALSO won’t see it: the temperature at home will change when you walk in the front door, an Uber will be booked to take you to the airport based on an updated flight time and you’ll only be made aware of it a few minutes before you need to leave.
When this ‘ambience’ is tied not just to physical locations (“I’m at the store”) but to precisely calibrated scans of those locations (“oh, here’s the can of soup you wanted to buy”) we will start to rely on our devices to think for us, to ping us only when needed, to let us interact with situations and environments in small and subtle ways.
Which Brings Us To SwiftUI
Now, I won’t do a big deep dive into SwiftUI other than to say a few things:
- This year, Apple demonstrated its deep commitment to SwiftUI. The message to developers was clear: “use this. It’s important”
- SwiftUI is what underpins App Clips, Widgets, Watch Complications and a whole bunch of other stuff
- SwiftUI is tailor-made for creating ‘bite sized’ nuggets of content, and takes some of the pain away for the developer: you don’t need to know what screen you’re targeting. It is loosely coupled to the end display
But most important is that SwiftUI is reactive. (OK, this gets a little geeky: it’s actually functionally reactive when used with Combine).
By which I mean that SwiftUI is structured to have a single source of truth. And that changes to that single source of truth trigger changes in the ‘view’. You no longer need to listen for changes, the framework does it for you.
Think about what that means related to spatial computing: you will need ways to respond to changes in the location/spatial status of the user, or the spatial/data status of objects.
SwiftUI (the view) and Combine (the ‘publisher’ of data) represent the types of frameworks that make a natural fit with….those little bite sized pieces of content you’ll expect to see in your glasses.
Where To Begin With AR
Everyone will make their own decisions based on what they already know, the markets they’re trying to serve, how much ‘runway’ they have for generating revenue, or what kinds of cool things they dream of creating.
But there seems to be this broader idea that “augmented reality” begins and ends with the kinds of things you develop in Unity, or the types of lessons we can carry over from VR.
It doesn’t. SOME of AR will….and at some point 7-10 years from now we’ll stop having two “buckets”. VR and AR will merge both in our glasses and how we design for it.
Today, it’s a bit like deciding to build Last of Us 2 or Twitter.
Yes, Last of Us is a great game, it’s immersive, it will make a lot of money. But you’ll play it for 30 hours and then buy another game.
But users engage with Twitter every day. In bite-sized pieces. It’s a years-long engagement.
Both are successful. Both are valid things to build.
But there’s more than one path to our augmented future and a few years from now the experiences we have wearing glasses won’t look anything at all like VR. Instead, the experiences that are with us day after day will be subtle, bite-sized, and we’ll use short voice commands and head nods for interactivity.
If you want to for the day when we can move through this spatially computed world, don’t start with Quest. Start with a widget. Work to make that widget really really smart.
And you’ll start to understand that the power will be in the ambience and reactive nature of what you’ve built, and how powerful this technology will be.