Max-Planck-Institut für Informatik
max planck institut
mpii logo Minerva of the Max Planck Society

Spotlight: Released Monday, 04 June 2012

Automated Music Processing

Meinard Müller

The digital revolution has brought about a massive increase in the availability and distribution of music-related documents of various modalities comprising textual, audio, as well as visual material. Therefore, the development of techniques and tools for organizing, structuring, retrieving, navigating, and presenting music-related data has become a major strand of research - the field is often referred to as music information retrieval (MIR). Major challenges arise because of the richness and diversity of music in form and content leading to novel and exciting research problems. Exemplarily, we discuss three current research tasks in the context of content-based audio retrieval.

The task of audio identification (also called audio fingerprinting) is to identify a particular audio recording within a given music collection using a small audio fragment as query input. Even for large scale music collections and in the presence of signal distortions such as background noise, MP3 compression artifacts, and uniform temporal distortions, recent algorithms for audio identification yield good recognition rates and are used in commercial products such as Shazam. However, existing identification algorithms cannot deal with strong non-linear temporal distortions or with other musically motivated variations that concern, for example, the articulation or instrumentation.

The task of audio matching can be seen as an extension of audio identification. Here, given a short query audio fragment, the goal is to automatically retrieve all musically related fragments contained in the documents (e. g., audio recordings, video clips) of a given music collection. Here, opposed to traditional audio identification, one allows semantically motivated variations as they typically occur in different performances and arrangements of a piece of music. For example, two performances may exhibit significant non-linear global and local differences in tempo, articulation, and phrasing as well as variations in executing ritardandi, accelerandi, fermatas, or ornamentations. Furthermore, one has to deal with considerable dynamical and spectral deviations, which are due to differences in instrumentation, loudness, tone color, accentuation, and so on. Recent matching procedures, which can deal with some of these variations, are based on so-called chroma features. Such features closely correlate to the musical aspect of harmony and have turned out to be a powerful mid-level representation applicable to a variety of multimodal retrieval scenarios. As illustration, Figure 1 shows an interface for simultaneous presentation of visual data (sheet music) and acoustic data (audio recording). The first measures of the third movement (Rondo) of Beethoven's Piano Sonata Op. 13 (Pathétique) are shown. Using a visual query (measures marked in green; theme of the Rondo), all audio documents that contain some matches are retrieved. Here, one audio recording may contain several matches (green rectangles; the theme occurs four times in the Rondo).
Audio identification and audio matching are instances of fragment-level retrieval scenarios, where time-sensitive similarity measures are needed to locally compare the query with subsections of a document. In contrast, in document-level retrieval, a single similarity measure is considered to globally compare entire documents. One recently studied instance of document-level retrieval is referred to as cover song identification, where the goal is to identify different versions of the same piece of music within a database (including cover, remake, and remix versions). Also in this context, chroma-based audio features in combination with local alignment techniques (e.g. Smith-Waterman algorithm) have been applied successfully. By using so-called shingling techniques in combination with locality sensitive hashing (LSH), the document search can be significantly accelerated.

About the author:
Meinard Müller was Junior Research Group Leader of the Group Multimedia Information Retrieval and Music Processing within the Cluster of Excellence on Multimodal Computing and Interaction since 2007 and affiliated with the MPI-INF. He received his PhD in Computer Science from from the University of Bonn in 2001 and his Habilitation in 2007 ibidem. Besides the topic Music Information Retrieval (MIR) he is interested in Multimedia Information Retrieval and Music Signal Processing. In 2012 he accepted a call from the University of Bonn.

Contact: meinard (at)

URL for this page:
Created by:Bertram Somieski/MPI-INF, 06/04/2012 01:33 PMLast modified by:Uwe Brahm/MPII/DE, 06/18/2012 11:00 AM
  • Uwe Brahm, 06/18/2012 10:59 AM
  • Uwe Brahm, 06/05/2012 03:19 PM
  • Bertram Somieski, 06/05/2012 12:06 PM
  • Bertram Somieski, 06/04/2012 02:18 PM
  • Bertram Somieski, 06/04/2012 01:54 PM -- Created document.