Apple Vision Pro and Computational Reality

Cheng Jang Thye
8 min readFeb 12, 2024

By now, you would have read many reviews on the Apple Vision Pro and might have drawn your own conclusion on whether it is worth your money to own one. I am not here to tell you about how great the Vision Pro is or how bad it’s a waste of money. I just want to draw some parallels in the history of human computer interface development and shed some thoughts on how spatial computing could evolve on Apple Vision Pro.

(Source: Apple.com)

In my previous article, I discussed about how Vision Pro could be the harbinger of the next generation of human interface device:

https://medium.com/antaeus-ar/apple-vision-pro-harbinger-of-the-next-generation-of-human-interface-devices-988d5b554e23

Now that the technical specifications of the device is available, everyone can have some idea how the device experience could be. There are many Youtube videos to show the experience of using the Vision Pro. Technically, this is inadequate since Youtube vidoes are two dimensional (2D) while Vision Pro lets you see in three dimensions (3D). Nevertheless, it is still a good approximation. Some of you may be wow’ed by the new 3D experience, while others would conclude that we are still experiencing mostly 2D style of interactions in Vision Pro due to the lack of applications that are 3D capable. Well, this is to be expected as the new 3D experience is so new that not many applications have been built with the 3D experience in mind. Let’s take a walk back in history of the human computer interface to have an idea of how the new experience can evolve.

We started with the interactive human computer interface with the terminal. We type a line of input commands, and the computer (typically the mainframe or the mini computer) replies with one or more lines of text. The computer probably perform thousands of computations on thousands of bytes (KB) of information. This might seem very limited in capability, but we soon found other ways to overcome the limitation, through the extension of time dimension by outputting multiple lines of text in sequence of time. For example, it is possible to get the computer program to somewhat lets us have some 2D experiences:

(If you copy the following lines into a fixed font window, you’ll be able to see the picture more clearly)

_________________________¶¶¶¶¶
_______________________¶¶¶¶¶11¶¶
_____________________¶¶¶¶111111¶¶
___________________¶¶¶111111111¶¶¶¶
__________________¶¶¶11111111111¶¶¶¶¶¶
_________________¶¶111111111111111111¶¶¶
________________¶¶1111111111111111111111¶
_______________¶¶1111111111111111111111¶¶
_____________¶¶¶111111111111111111111¶¶¶
__________¶¶¶¶¶1111111111111111111111¶¶¶
_________¶¶¶111111111111111111111111111¶¶¶
_______¶¶¶1111111111111111111111111111111¶¶¶
______¶¶11111111111111111111111111111111111¶¶
_____¶¶1111111111111111111111111111111111111¶¶
____¶¶111111111111111111111111111111111111111¶¶
___¶¶11111111111111111111111111111111111111111¶¶
__¶¶1111111111111111111111111¶11111111111111111¶¶
__¶¶¶111111111111111111111¶¶¶¶11111111111111111¶¶
_¶¶¶¶¶¶¶¶1111111111111¶¶¶¶¶¶¶¶¶11111111111111111¶¶
_¶¶¶¶¶¶¶¶¶¶¶¶111111¶¶¶¶¶¶¶¶¶¶¶¶¶1111111111111111¶¶
¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶111111111111111111111¶
¶¶1111¶________¶¶¶_________¶111111111111111111111¶
¶¶111¶_____¶¶¶_¶¶____¶¶_____¶11111111111111111111¶
¶¶111¶_____¶¶¶_¶¶___¶¶¶¶____¶11111111111111111111¶
¶¶1111¶______¶¶¶¶____¶¶____¶111111111111111111111¶
_¶1111¶¶___¶¶¶___¶¶_______¶¶11111111111111111111¶¶
_¶¶11111¶¶¶¶_______¶¶¶¶¶¶¶1111111111111111111111¶¶
__¶1111¶¶¶___________¶¶¶11111111111111111111111¶¶
__¶¶1¶¶¶_______________¶¶¶¶¶¶¶¶¶¶¶1111111111111¶¶
___¶¶¶¶___$$$$$$$$$$$$$¶¶_____¶¶¶¶¶¶¶111111111¶¶
____¶¶1¶¶$$$$$$$$_____¶¶____________¶¶¶111111¶¶
_____¶¶11¶¶_________¶¶¶_______________¶¶¶111¶¶
______¶¶11¶¶¶______¶¶___________________¶11¶¶
_______¶¶1¶__¶¶¶__¶¶____________________¶¶¶¶
_________¶¶¶____¶¶_____________________¶¶¶
___________¶¶¶_______________________¶¶¶
____________¶¶¶¶_________________¶¶¶¶
________________¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶

Source: https://fsymbols.com/text-art/

Computers in this era is rather simplistic in its data processing. Most of what we need then is to store and retrieve information. This may be to create a bank account, transfer money from one account to another, register the birth of a child, keep track of immigration, and so on. Information is typically stored in a file structure where records can be retrieved sequentially or based on some kind of search criteria. Information is highly encoded (meaning we use codes to represent information such as “M” for male rather than storing the complete spelling of the word). Size of data we manage is of the scale of KB’s (kilobytes or thousands of bytes) to MB’s (megabyte or million of bytes).

Let’s now move on to 2D human computer interface. This is the current common interface standard and we are all very familiar with it. But when it was first available, it was also limited in the user experience we got (text display in rows and columns). And initially, we were still consuming the applications that were in the single dimension (we still run terminal based applications). However, we started to add depth into the interface; we added sub-windows that can be layered in a stack. Correspondingly we started to organize our data in hierarchy or graph structures where we encode the relationships between data, and the sub-windows were used to navigate the hierarchies and relationships of data we manage. Size of data we managed was in the range of MB’s (megabyte) and GB’s (gigabytes).

(Source: Photo by Raymond Hsu on Unsplash)

As the interface improves in its fidelity (the 2D look improves on its resemblance to reality through increasing higher pixel densities) and we added audio, image and video capability to our computers (to playback and record sounds and video), our computer programs are now processing much more data. We started to store information in multi-media formats (audio, images, video). Technically, these media content are still single dimensional encoding of the real world analog. A sound recording is a time series of value captured through an analog to digital sampling of sound conversion. An image is a one time sample of pixels from a rectangle of light sensors capturing light in a moment. A video recording is a time series of image captured from an array of light sensors. They are all encoding of data captured through sensors that convert a physical observation into a set of digital values (a frame of video). And the video is seen by rapidly display the images or frames in a proper sequence of time (eg. 30 frames per second).

(Source: Photo by Jakob Owens on Unsplash)

As we grow our 2D displaying capabilities (from both the display technology and the GPU technology that powers the processing required to create the bits of information to display onto the screen), we started to have 3D viewing simulations. Visual cues in the interfaces and realistic images and surface textures transform our 2D display to have more lifelike rendering of objects that we often see in the real physical world. The popularity of the 3D games and industrial applications of CAD/CAM helped propel the technologies to ever increasing fidelity of the 2D screen to display 3D objects (such as use of point cloud representation of a physical airplane engine or an F1 racing car that allows viewing in any angle). The data we have to manage now scales to multiple GB’s and even TB’s (terabytes).

Just look at a recent display of a Call of Duty (Modern Warefar III) game on its realism:

(Source: Captured from a game session on PlayStation)

This however does not scale well as we further increase the number of objects to be viewed simultaneously or in the same session. Lots of information need to be downloaded to the displaying device (laptop, desktop or console) to have a continuous interaction session. So, instead of encoding data from physical objects, we now create abstract models of physical objects and use 3D computer graphics algorithms to generate the viewing of these objects in our 2D display. This is what usually known as virtual reality, but I prefer the term computational reality as we now have more ways (such as the use of AI tools) to create such lifelike viewing of almost anything and multiple people could be interacting with this reality simultaneously. The transient states of the computational reality can also be stored and stood up (or instantiated) when needed. This is similar in saving states in computer gaming where we can recover from that saved point to replay the game. But I would imagine computational reality to be capable of saving state continuously just like our physical reality.

With the advent of 3D interface technologies, we started to have 3D headsets and 3D capable (TV) screens with special glasses. Many of the early products have poor fidelity and usability, until the appearance of Apple Vision Pro.

You can find more information about how the Vision Pro interface works and how applications can be built here:

https://developer.apple.com/visionos/learn/

But essentially you will work with windows, volumes, and spaces. Windows essentially lets you fit your existing 2D application into a 2D display element on Vision Pro. This is vital to let you access the large volume of 2D applications that run on current devices, just like how 2D displays first support 1D application running in terminal. A number of iPad applications are supported on day 1 of the launch of Vision Pro. Apple has also made it easy to view your MacBook screens in Vision Pro through the Mac Virtual Display feature.

Volumes enables the addition of depth to a 2D view of an existing 2D application or you can build an entirely new application that lives in that 3D space defined by the volume. Spaces lets you use the entire view of the Vision Pro device with an immersive 3D experience of a world, an experience similar to teleportation as being mentioned by some YouTubers and the portal concept in my earlier article. 3D games can fit into this metaphor easily and the same technology could be adapted to build the new type of application services for computational reality. This is probably where we can find the killer app for Apple Vision Pro.

To support such applications, the amount of data we have to manage starts at GB level and the processing complexity requires multiple GPUs. Today’s Vision Pro might not have enough GPUs for the most sophisticated 3D rendering application, but this would certainly improve over time. Creating a sufficient level of immersive experience requires a fairly complex project and sizable effort. It will be interesting to see how this evolves in the market but this is certainly the dawn of a new era of high fidelity and immersive spatial computing and computational reality. Clever use of time extension as another dimension may be an interesting concept for 3D where you can freeze a 3D scene and allow the user to move his viewing at different location in the scene (check out JigSpace). Apple has shown the use of Persona for live communications with users of Vision Pro. The user’s face is a form of computational reality created through camera sensors in Vision Pro and AI algorithms to generate the live viewing of the user’s face covered by Vision Pro. Some of the modern movies like Avatar is also created in similar ways where the actor’s bodies are attached with sensors and filming the movie is more like recording the position and acceleration of the sensors as the actors “act out” the movie scenes, and the final movie video is generated through CGI. There will certainly be new technologies created to more effectively create 3D content just like how we have done in the 2D world. The next few years will be very interesting times for spatial computing and computational reality.

--

--

Cheng Jang Thye

An IT guy by profession, a sports fan (multiple sports), a husband with a loving wife and family, and a thinker wandering what is happening to our world.