A Look at Some Phenomena Underlying Timing and Rhythm

Douglas Eck
Departments of Cognitive Science and Computer Science
Indiana University

Introduction

In starting a discussion about timing and rhythm phenomena it's worth observing that humans are very good at discerning small changes in the duration of an auditory signal. Creelman [1962] provides evidence for a discrimination threshold of roughly 15% of total duration for filled intervals in the range 40-400 msec. A better threshold of 10% is shown for filled and open intervals in the range 200-2000 msec [Hirsh, Monahan, Grant and Singh 1990, introduction]. Since listeners cannot synchronize effectively with IOIs outside the range of 200 msec to 1800 msec [Fraisse 1982] we may conclude that people are very good at perceiving duration across the range of what might be considered rhythmic tempos.

No one debates the correctness of these data. However, one important debate in the field of rhythm perception is that how to reconcile data which calls for local sensitivity with data which requires structural sensitivity. By local sensitivity I mean two things, a sensitivity to local time duration as well as sensitivity to local attributes of an auditory event such as pitch and amplitude. By structural sensitivity I mean the ability to find global regularities such as tonal or metrical groupings in a stream of music. One can also think of the distinction between local and structural information in terms of the resolving power required by models sensitive to such information. For example, since the data on local information shows that listeners can judge durational differences as small as 6 msec (15% of 40 msec), one may conclude that a model accounting for local effects must resolve certain durations down to at least 6 msec. Also, since the data on structural sensitivity shows that listeners focus attention to regularly recurring points in the temporal stream in a way that is invariant across different absolute durations, one may conclude that a model sensitive to structural effects would have the power to sense periodicities in general.

Note that a commitment is not required on how such a resolving device would be implemented. Consider the argument one could make about the 6 msec duration example: "Since a period of 6 msec equals a frequency of 166.6Hz and since signal processing theory requires that reliable sampling be done at a minimum of twice the fastest signal being sampled--the Nyquist frequency--a model needs a clock with a frequency of 333.3Hz or a period of 3 msec in order to get the desired resolving power." This argument is flawed: a single fast clock is only one way to gain the ability to resolve durations down to 6 msec and in fact it is probably the wrong one given that humans can only achieve such resolution after lots of training on specific kinds of signals.

The two kinds of information, local and structural, map onto two general hypotheses about listeners' perception of time. The first hypothesis might be called the Psychoacoustic Hypothesis. It claims that local interactions account for most if not all of the variability in durational perception data. Therefore, under the psychoacoustic hypothesis, more complicated mechanisms sensitive to structure are not necessary. A second hypothesis which might be called the Structural Hypothesis holds that global aspects of signal structure such as metrical organization and (in music) melodic grouping by necessity affect a listener's perception of time. Therefore, a theory of timing and rhythm perception is incomplete without accounting for these effects.

Like most simple two-way splits, this is in many ways a false dichotomy. I'll try to bring together aspects from both areas by discussing hypotheses which do not fit neatly into either category. These alternate hypotheses, collected under the heading of Hybrid Hypotheses hold (in differing ways) that a complete theory of timing and rhythm perception will need to take into account both the local and the structural aspects of music. Furthermore, some researchers suggest that non-temporal structure such as melodic structure in music is inextricably bound to the perception of temporal regularity and hence cannot be ignored.

Psychoacoustic Hypothesis

As was mentioned in the introduction, Creelman [1962] showed that for filled intervals people do a good job of detecting small changes in durations (down to 10% or 15% of total duration). He performed experiments using two tones separated by a fixed-duration pause. The task for the listener was to judge whether the second tone was longer than the first one. Creelman concluded that listeners only attended to local information in the signal. This comes as no surprise since two tones provide very little in the way of structural information. More importantly, Creelman provided what many consider a best just noticeable difference (JND) for durational judgment and his data is used by many who follow.

Monahan and Hirsh [1990] provided more evidence that local temporal information is sufficient to make judgments about duration. Building on earlier data from Hirsh, Monahan and Grant [1990], they presented rhythmical patterns consisting of isochronous pulse trains with two beats removed. Listeners were asked to judge whether specific beats had or had not been delayed. Here are their eight test patterns with Xs representing beats and underscores representing rests:

   x_x_xxxx, xx_x_xxx, xxx_x_xx
   x_xx_xxx, xx_xx_xx, xxx_xx_x
   x_xxx_xx, xx_xxx_x
The patterns were presented with different beats in each used as the target. Then the patterns were reversed and presented again. Monahan and Hirsh noted that, despite the fact that the reversed patterns had different metrical interpretations (using the rules from Povel [1981]) the same Weber fraction model of average interval holds. That is, at least for slow presentation of non-cycled patterns, listeners did not take advantage of global metrical structure as predicted by Povel [1981]. They worked only with local information, namely the average of the preceding and ensuing IOIs. Though Povel's model accounts for slightly more of the variance, Monahan and Hirsh preferred the average interval model due to its relative simplicity. Note, however, that if these patterns are cycled their metrical interpretations (accent assignments) change. For these cycled patterns, a model sensitive to global structure like Povel [1981] accounts for more of the variance (Povel and Essens [1985], Handel [1989]).

Halpbern and Darwin [1982] also provided a Weber fraction model. In their experiment, four clicks were played for a listener. The first three were isochronous (varied over several IOIs) while the fourth click ranging from 10% early to 10% late as compared to the IOI of each of the first three. Halpbern and Darwin showed that discrimination of the onset of this fourth click approached 15% of the duration of a single tone, or 6 msec for single 40 msec tones, as suggested by Creelman [1962]. They noted that under minimal uncertainty, temporal position doesn't matter at all; only the local inter-onset information was exploited by listeners. However, they also noted that when the first three clicks are presented with non-uniform IOIs, the task gets much harder, showing that at the very least listeners exploit the periodicity present in the easier isochronous presentations, a kind of global structure.

Espinoza-Varas and Watson [1986] offered a similar result. They presented short ten-tone sequences (each of the ten tones was 40 msec in duration) with one tone sometimes lengthened in duration. They asked listeners to judge whether or not a target had been lengthened. They noted that under low uncertainty (subjects were given ample training and information to help them such as the location of the target tone) listeners do such a good job at the task that it is unlikely that they are exploiting temporal or spectral positioning of the signal. As did Monahan and Hirsh, they observed that later tones were easier to judge. Furthermore they observed that when the ten tones were of nonuniform duration (still summing to 400ms) the task was much harder even with low uncertainty. Hence they conclude that even short (400 msec) stimuli afford rhythm tracking.

These results have been concerned with local duration, a temporal cue. Other data, some from the same experiments, support the conclusion that certain changes in local non-temporal information also affect duration judgments. Hirsh, Monahan and Grant [1982] note that changing the pitch of the delayed tone makes duration judgment harder. Espinoza-Varas and Watson [1986] note when pitch is always changing, higher frequency tones are easier to resolve. On the other hand, other local non-temporal dimensions do not seem to matter. Creelman [1962] shows that an increase in signal voltage does not help much in durational judgments once the signal is loud enough to be reliably heard. Also, Creelman notes that non-auditory information, in his case a marker light coinciding with the onset of a tone, does not help listeners resolve duration. One can conclude that certain kinds of non-temporal local information can hinder performance presumably by demanding attentional resources for another dimension like pitch, but that non-temporal local information rarely helps listeners resolve duration better than they would on simple same-pitched sinusoidal tones. Furthermore, it seems clear that a complete model of timing and rhythm cognition will need to be sensitive in some way to local durational effects.

Metrical Hypothesis

The Metrical Hypothesis places strong importance on relative time. Much support exists for the thesis that rhythm cognition requires sensitivity to whole number ratios between intervals in a signal. Povel [1981] shows that when people are asked to reproduce temporal patterns, they shorten or lengthen intervals to make them conform to a one-to-one or a two-to-one ratio. Martin [1972] provides anecdotal evidence that relative time is important not only in music but in speech. He notes that slips of the tongue follow only a small set of speech accent patterns. He also observes that whispered speech is made more intelligible by over-emphasizing stressed syllables.

Povel and Essens [1985] worked with isochronous repeating pulse trains that had certain elements removed (simple metrical patterns). Here is an example of one of their patterns. Note that unlike Monahan and Hirsh, these patterns cycled.

 xxx_x__xx__xx___ 
 1234567890123456
Povel and Essens [1985] show that it is easier for subjects to tap with simple metrical patterns that have strong metrical interpretations. They used metrical preference rules suggested in Povel [1981] to assign metrical strength. Though an in-depth definition of metrical and a discussion of Povel's metrical preference rules is outside of the scope of this paper, it is important to note that a metrical pattern is one with particular hierarchical temporal organization. Specifically, the hierarchy is organized in terms of low-order rational relationships between different levels.) Povel and Essens [1985] note as well that, for those patterns without a strong metrical interpretation, subjects would hear the patterns in terms of element grouping rather than timing structure and that they would have a very hard time accurately perceiving duration accurately. In this way they show that for longer, cycled patterns, a theory which only looks at local information is not sufficient to account for variance.

Haken, Kelso and Bunz [1985] present data on asynchronous finger-wagging which suggests that relative time is important in simple motor control tasks. They asked subjects to wag their index fingers at asynchrony and measured the phase alignment as subjects speed up the wagging. They provide a model which uses only relative phase and instantaneous frequency to account for the behavior exhibited when subjects sped up and were unable to keep their fingers out of phase. Though this model achieves its phase transition through sensitivity to real world frequency (and so is not only concerned with relative time), the fact that it considers relative phase an important macroscopic parameter makes it interesting as an example of a system exploiting relative time structure in motor control.

Jones and Boltz [1989] present data showing that duration expectancies can be explicitly manipulated by changing global structure. By changing the hierarchical structure of a simple folk melody they were able to vary the predictions made by listeners on when the song would end. Jones and Boltz show that manipulations at a relatively high level in a metrical hierarchy (the phrase level) are able to influence durational predictions.

Pitt and Samuel [1990] refine an experiment by Shields, McHugh and Martin [1974] and show that in speech, structural information aids listeners in finding a target phoneme in an auditory stream. They played sentences where the target phoneme would fall either on or off of a metrically-accented point in the speech stream. The target phoneme was embedded in a word modified so that local accent did not play a role (e.g. per-MIT the verb and PER-mit the noun were combined to form the uniformly-accented PER-MIT). Normal sentence rhythm provided enough information so that listeners would more reliably hear the target phoneme when it coincided with stress. Pitt and Samuel concluded that normal sentence rhythm is predictive, but not overly predictive of stress. They conclude that their experiments provide limited support for "the Attentional Bounce Hypothesis" (their term for sensitivity to rhythmic structure).

There are many more examples of how structural information affects judgments in the realm of rhythm. Handel [1989] and Fraisse [1982] document many reliable perceptual effects found by manipulating different structural dimensions. The accent of each k-th beat in a sequence produces a perceived lengthening of the interval preceding the accented beat. Stretching (delaying) each k-th interval in a sequence causes a perceived amplification of certain beats. Modifying the pitch of each k-th event in a sequence causes the modified elements to seem accented. Listeners take advantage of structural information to make auditory durational judgments and to focus attention in music and in speech. Structural information also plays a role in motor control tasks such as finger wagging. Taken as a whole, the data suggests that human beings are acutely sensitive to relative time structure and to global structure in general. A complete model of timing and rhythm cognition will need to exhibit similar sensitivity.

Hybrid Hypotheses

Alternate hypotheses exist which blend different aspects of the two outlined above. I'll make no attempt to paint a coherent picture of these hypotheses but rather will simply try to draw out some interesting observations that they bring to light.

Repp [1992, in press] suggests that performed music contains expressive timing information which requires both relative and absolute timing sensitivity. By lengthening or shortening different IOIs in computer-controlled performances he showed that changes are easier to hear in some structural positions than others. He claims that perceptual biases were responsible for some of this behavior and that the biases are mirrored in expressive performances of the same music. Though the nature of expressive timing is hotly contested (Repp [1990], Clynes [1990]) it is clear that small variations in inter onset timing can make the difference between a piece of music sounding "wooden" or "mechanical" and it sounding "alive" or "realized." Nor is expressive timing arbitrary. Repp [in press] notes that expressive performance is closely tied to musical structure, particularly rhythmic grouping. If this is the case then it seems clear that a complete theory of timing and rhythm cognition needs to account for both relative-time relationships between layers of a metrical hierarchy and absolute time local perturbations which are an important part of performed music.

The effects of tempo change on perceived rhythmic structure do not cleanly fit into either of the two hypotheses outlined above. Since tempo detection involves predictions about when the next beat will occur, it clearly requires some kind of structural sensitivity. Perhaps the tempo of a single metronome might be known through local effects only, but the tempo of a complicated piece of music is only known after (at least) one level of the metrical hierarchy is extracted. Yet relative time is not enough to account for the effects of large tempo changes. That is, a model sensitive to only the ratios between completely arbitrary levels cannot account for the fact that the same piece played at a much faster tempo will yield a different metrical interpretation. Furthermore, base tempo is not decoupled from absolute time: people have preferred tempos which are defined in terms of milliseconds, not intervals [Fraisse 1982]. These observations lead to the conclusion that a theory which accounts for tempo changes must be sensitive in one way or another to both relative and sensitive time. This represents a combination of the two hypotheses outlined above.

Finally, Bamberger [1991] asked eight- and nine-year-olds to produce pictorial representations of musical rhythms they had heard. One of her main observations was that metrical interpretations and phrasal interpretations developed concurrently. (By phrasal, I mean interpretations which paid more attention to tonal groupings than to metrical groupings. A phrasal grouping might group together notes which would be otherwise separated by a metrical grouping.) Bamberger was clear in her observation that children did not start by producing "primitive" phrasal interpretations and then move on to more "sophisticated" metrical interpretations. Rather, as different children developed, they produced increasingly sophisticated drawings of both types. The important point here is that the ability to group a musical stream by melodic phrase develops concurrently with the ability to group it by absolute and/or relative time. Therefore one can argue that a new hybrid hypothesis of timing and rhythm cognition must account for this. To ignore tonal grouping in favor of metrical grouping is to work against Bamberger's developmental data.

Conclusions

A complete theory of timing and rhythm phenomena must account for both relative and absolute timing effects. This is not to say that one should embrace naive time (Port, Cummins and McAuley [1995]) by presuming a fast absolute clock in the brain. On the contrary, the relative time phenomena suggest strongly that we do not simply possess a highly accurate ticking device and there ways other than fast clocks to get good timing resolution. For example, one might use a network of slower oscillators which resolve fast local timing with phase coupling and complex interaction. Another potential solution lies in joining relative time software models with physical systems such as robotic arms in order to exploit the rich dynamics of mass springs.
A bibliography is available
Go to Doug's Home Page
deck@indiana.edu
Last modified: Thu Apr 9 12:13:49 EST 1998