During the thirty years since the development of Ircam’s Spat began, professional and consumer audio technology has progressed along several parallel threads – including sensory immersion, electronic transmission, content formats and creation tools. In a not-too-distant future, the authoring, consumption or performance of a significant portion of our media and music experiences might leverage a global set of frameworks and ecosystems often referred to as the Metaverse. In time, more and more of our media experiences (currently categorized into separate content industries such as music, movies and podcasts) may be cloud-based, navigable, non-destructive, ubiquitous, interoperable and adaptable to listener conditions. In this talk, we attempt to distill elements of this vision and some of the challenges that it entails, including the adoption of a common spatial audio rendering description model, and “externalized” binaural audio reproduction for AR/VR sound.