In speech, a diphone is defined as a transition between two phonemes. In a musical context, this could be taken to mean not only a transition between two vocal sounds, but more generally a transition between
two sounds of any kind, whether instrumental, vocal, or recorded « sound objects ». As such, the
definition of a musical « diphone » could also be extended to include a single stable sound or silence.
The idea of synthesis using diphones was conceived in the late 1980s by Xavier Rodet, in an attempt to
address the problem of successfully synthesizing a musical phrase using both transient and stable
Using traditional analog studio techniques, a series of transient and stable sounds can be concatenated,
or pieced together, by splicing small pieces of tape end to end. In today’s digital studios, this is done by
cross-fading. In both cases, the result usually can end up sounding far from convincing since the inner
contents of the spliced or faded sounds do not generally match. In Rodet’s system of « generalized
diphone control and synthesis », sounds are carefully analyzed, and it is the analysis data which is
« spliced » or « fad-ed » together by interpolating the values between neighboring segments of analysis
data. The resulting in-terpolated analysis can then be resynthesized, providing a much cleaner and more
natural result than could be obtained through simple tape-splicing or cross-fading.
Diphone was first implemented on UNIX workstations in 1988 by Xavier Rodet and Philippe Depalle using source-filter synthesis and, later, additive synthesis. There was only a very rudimentary graphical user
interface which allowed analysis data segments to be placed in a consecutive manner on the screen. The
first Macintosh version of Diphone, completed in 1996, was (and continues to be) programmed by Adrien