What is Apple Spatial Computing ?
and how many people have mistaken what this means …
There are probably many reviews on Apple Vision Pro (AVP) you have come across and you might have concluded it is not for you and is likely going to be a failure. What if it is not that AVP is not ready, but the computing stuff we do today is not ready for AVP?
This is a continuation of my earlier article on Apple Vision Pro (AVP). If you are keen you can find it here: https://medium.com/@jangthye/apple-vision-pro-a-glorified-display-device-d810a3bc6824
Apple has declared that the AVP is for Spatial Computing and defines this as a technology with computers blending data from the world around us in a natural way. This is a very broad statement and many vendors have also jumped on the bandwagon paddling new devices and me-too capability similar to what AVP can do (especially in providing a “virtual” display using a headset or glasses). It is very important to note the word “natural”, which is what many folks have mistaken when comparing the ability of AVP with the ability of Virtual Reality (VR) devices.
However, there seems to be a consensus that the AVP is too expensive and is not able to justify its cost, based on the current capabilities seen by the early users who bought and tested the device. Many people simply make this conclusion by what they can envisage of its capabilities using past experiences from other existing, 3D and VR applications. But no one has thought deeply about what 3D computing is really about and how we are to use 3D computing.
Apple defines 3D computing or spatial computing as how computers blend data from the world around us in a natural way. This would mean not just using data captured from the surrounding objects and people, but also from other parts of the world (a real world, not a virtual one). In a natural way would mean how we humans would directly “sense” the data using our physical abilities like sight, sound, touch, taste, and smell. The last two senses are probably not the focus of computers (actually it’s more because what we produce for taste and smell is often for private consumption). There is also how we can sense and interact using the space and physical objects around us. For example, AVP has allowed us to use our vision to focus on a point in our field of view and together with our hand and fingers, make gestures that can be used to interact with the application in AVP. Very likely, there will be more gestures we can do with our hands and fingers in the future.
Today with our current computing, we are essentially interacting with a 2D screen with audio, video, and tactile sensing. We are also interacting with either 1D or 2D data. What do I mean by 1D or 2D data? 1D data means data from a single dimension, essentially one single type of data. In computing, we call these data scalar values. Examples of 1D data can be one key press on your keyboard, or a mouse button click. 2D data refers to a set of two 1D data collected as a unit. In computing, we often use an array or vector of 2 dimensions to represent this. Examples of 2D data can be a mouse move or a touch on the screen; the 2D data values indicate the position of the mouse or the touch location. Multiple 1D and 2D data can be combined as input to the computing device, such as a PS5 game controller that has two joysticks and multiple buttons which can all be activated at the same time. Essentially, the inputs are all in 1D or 2D; although there is an accelerator sensor in the controller, it however is not processed as a 3D data input as it is mostly used for sensing the rotation of the controller. So, what then is 3D data? As an extension of 2D data, we would add another dimension. So an instance of 3D data can be a geometrical location of a point in space with X, Y, and Z coordinate values. Another example of 3D data can be the displacement vector (in 3 dimensions) when your hand moves from one location to another. So how do we perform computing on these 1D, 2D, and 3D data?
Let’s start with 1D computing. Essentially, 1D computing is the ability to process one stream of data (over time) and generate one stream of output values (over time as well). In humans, this is likened to our sense of sight, sound, touch, taste, and smell. So in 1D computing, we read one piece of information, process it, and output it with one piece of information. And if we read the information across a time interval, then we may extract additional value from that 1D data (eg. pressure for touch sensing, or interpreting temperature rising or falling). 1D computing can be further synthesized to offer more value by combining multiple 1D sensing, such as the input of left and right microphones to have stereoscopic sensing of sound direction. This is because of our natural ability to detect the sounds reaching the left and right ears with different volumes or phases (sounds are waves that may have phase differences when traveling over different distances to reach a target) to tell their direction. With simple electrical circuits, we have been doing 1D computing in many of our daily-use appliances. A simple set of analog-to-digital converters connected to a basic microprocessor could easily be used for 1D computing.
In the picture above, the folks are all using 2D computing with paper and laptops. We started 2D computing as early as when humans started to draw and write on cave walls and other physical surfaces that can be imprinted with markings. We represent information in 2D. When we started using paper (or papyrus, parchment, cloth, leather), and then later with paintings and photographs, these markings could then be moved or transported to another person, hence providing communications. When we started to have mainframes in the 1960s, we then had screens that showed 2D data over time — repeating and morphing the display elements a little over time gives us animation. With the advent of high fidelity (resolution) displays, we started to have images that are either captured from the real world or synthesized digitally to be displayed on monitor screens, appearing even better than our paintings and photographs. When we started to have more capable displays that have higher resolution and higher display frame rates, we had video recordings of scenes in the real world that could be displayed on monitors. With advancements in computer graphics, we can then create high-fidelity graphics that mirror our real world and provide the basis for Virtual Reality (VR) displays. Modern computer games now often employ VR capabilities to provide an immersive gaming experience that lets players see and feel physically involved in the game (albeit with just the display and controllers). This is achieved through digital 3D models that are used as the foundation of the scene for the games. But everything is virtual, meaning the scene is not truly based on a physical location. The actors or creatures you interact with are not real creatures or humans, and your movement is simply a viewport within the scene. Now is this 3D computing?
Technically, we are still using a 2D screen with 1D or 2D input sensors. Controllers or keyboards are essentially 1D or 2D input devices (key pressed or joy stick moved). Even with an accelerometer in the controller, the software would still map the movement in 2D. The digital model in VR may be encoded in 3D, but all interactions are processed only in 2D, and the results are seen via the 2D display screen. For example in a first-person shooter game like Call of Duty, you use two joysticks to position your operator to shoot in the VR scene, the left joystick captures your relative position on the horizontal axis, while the right joystick captures your aim in the vertical axis and left/right direction. In real 3D, you would carry a real gun at a position and shoot at the target at its 3D geometric position.
So, for 3D computing, we’ll need to see objects in 3D (with width, breadth, and height) and we need to interact with them in 3D as well (say touching a screen at a precise geometrical location). Currently, sight and sound are our only ways to sense in 3D: we can see the width and height of an object and the sense of distance from the object and we can hear the direction of a sound in 3D. We can use the physical location of our hands and body and interact via some bodily gestures (that have little movement) such as the touch of our fingers or the blinking of our eyes.
I hope you can see that the majority of our computer-based applications are not 3D computing (there are however many industrial applications that are 3D computing). Apple has defined AVP Spatial Computing to support the natural way we use 3D computing and this would mean moving and sensing with our bodies rather than with any attachment to our bodies in the space surrounding us. The AVP has multiple cameras and a LiDAR scanner that detects your hand and finger gestures and also objects that are within the field of view of the AVP. With this in mind, what is the approach towards AVP Spatial Computing? Why don’t we simply port VR applications to run in AVP? Many “new” devices or glasses are being introduced to have similar “display” capability as the AVP. This is merely replacing a physical display with a “new” type of display in the glasses. It is not a 3D-capable computing device.
One way would be to take existing applications and make them 3D-like, such as viewing content using 2D displays and mimicking the physical environment in which we watch the content. You can see this in the cinematic experience provided in Apple TV+ and Disney Plus which let you watch the movies and videos as if you are seated in a real cinema.
You watch the movie or video on a large 2D screen that is stationed a few feet in front of you and the sound from the movie appears to come from that screen.
To allow ease of migration, Apple provides development libraries that are compatible with iPad applications that can be easily displayed in a 2D display within the 3D space in the AVP. The AVP touch interactions can also easily be mapped into the iPad to provide the 2D inputs to the applications.
Another way to support spatial computing is to use it to relive 3D experience with photos and videos. A simple way would be to view the 3D content within a 2D display in the AVP. This would easily support the existing 3D media content. But an even better approach would be to provide a fully immersive experience such that the entire media would consume the whole field of view in AVP. You can see examples of this in Apple’s Immersive Video series using the AVP.
Here is an immersive video featuring a dinosaur moving out of the screen: large_2x.mp4 (you need to use the AVP for the full experience which you can get at the Apple Store).
Since AVP is essentially a headset with see-through cameras, Augmented Reality would be a great application. Apple has just added object tracking capability in Vision OS (2.0), and this would make it easy to build applications that can recognize and track the location and motion of objects in the field of view of AVP. Many digital twin applications will benefit from this and we can expect many industrial enterprises to build these applications using the AVP. Vision OS 2.0 has added many development frameworks to support the needs of such applications (eg. Volumetric APIs, Custom Hover Effects, Multiple Video Views, etc.) to add 3D annotations on top of what you see through the AVP cameras.
Another interesting suite of applications would be what we often do on a tabletop. Humans have been using their hands and tools to craft things on tabletop for ages. In Vision OS 2.0, there will be a new framework called TabletopKit to support the building of tabletop applications. TabletopKit enables developers to quickly build shared and collaborative app experiences centered around a table, like board games or a manufacturing workstation. GRL Games is using TabletopKit to create Haunted Chess, a murder mystery board game where players use 3D chess pieces and holographic cards to help solve the mystery. Check out Apple’s Developer video on TabletopKit here.
It would be interesting to see a 3D version of Mahjong, a popular tiles game often played on a table for 4 people.
Another interesting suite of use cases would be 3D telepresence, the ability for multiple individuals to remotely join in a common 3D session or room to work together while using their face (persona) and hands to physically interact virtually with the participants. Have you seen the Cisco Telepresence system some years ago (https://en.wikipedia.org/wiki/Cisco_TelePresence)?
It started as a room-to-room teleconference solution (in October 2006) where each room has a half oval table with 3 to 6 participants. When having the conference call, participants can see each other across the 3 large TV screens and cameras that show participants from the other room. The system tries to mimic having the 6 to 12 participants in a meeting room where they can see and hear each other (with directional sound) across the oval table. Now with AVP and its Persona feature, it would be possible to build a virtual 3D room where participants can join in the room with a remote presence much like Cisco Telepresence, but in a 3D virtual room session. This may sound similar to some of the VR meeting solutions from Meta and others, but they cannot interact naturally in a 3D fashion. On AVP, Zoom and Facetime could allow an AVP user to join a web conference, but they do not have a 3D room where the participants could interact in a 3D fashion or with other 3D entities (defined in a Universal Scene Description file) in the room that could be manipulated by the participants.
Just to summarize, with Spatial Computing, we need to see in 3D, hear in 3D, and feel in 3D. AVP is probably the only computer that is capable of supporting these 3D features. No other device has this level of support for 3D capabilities. Our spatial computing needs are often confused with what we are getting from modern gaming and VR headsets. They are all using 2D interaction with some form of simulation for 3D effect (eg. the use of controllers to move in a virtual 3D environment). We should not underestimate the capabilities of the AVP platform as when a new technology platform is introduced, it often faces many objections from the lack of appreciation. Just look at the Compaq portable computer when it was first introduced in 1982 and how our mobile computing has evolved.
When it was first available, the consensus was it was only useful for enterprise users (running DOS applications) with the wide availability of desktop PCs. At that time, the world had just embraced 2D computing to new levels in fixed locations (it’s a desktop that can be stationed in any room and not just computer terminal rooms). We do not even have the graphics capability to push 2D computing. AVP brings a new dimension to our computing capability just like this Compaq portable computer brings users to mobility. We are just about to start the world of 3D or Spatial Computing. It will get lighter and more capable in time to come as more 3D or Spatial computing applications appear. I hope this article will help you to see the AVP in a different light.