No one debates the correctness of these data. However, one important debate in the field of rhythm perception is that how to reconcile data which calls for local sensitivity with data which requires structural sensitivity. By local sensitivity I mean two things, a sensitivity to local time duration as well as sensitivity to local attributes of an auditory event such as pitch and amplitude. By structural sensitivity I mean the ability to find global regularities such as tonal or metrical groupings in a stream of music. One can also think of the distinction between local and structural information in terms of the resolving power required by models sensitive to such information. For example, since the data on local information shows that listeners can judge durational differences as small as 6 msec (15% of 40 msec), one may conclude that a model accounting for local effects must resolve certain durations down to at least 6 msec. Also, since the data on structural sensitivity shows that listeners focus attention to regularly recurring points in the temporal stream in a way that is invariant across different absolute durations, one may conclude that a model sensitive to structural effects would have the power to sense periodicities in general.
Note that a commitment is not required on how such a resolving device would be implemented. Consider the argument one could make about the 6 msec duration example: "Since a period of 6 msec equals a frequency of 166.6Hz and since signal processing theory requires that reliable sampling be done at a minimum of twice the fastest signal being sampled--the Nyquist frequency--a model needs a clock with a frequency of 333.3Hz or a period of 3 msec in order to get the desired resolving power." This argument is flawed: a single fast clock is only one way to gain the ability to resolve durations down to 6 msec and in fact it is probably the wrong one given that humans can only achieve such resolution after lots of training on specific kinds of signals.
The two kinds of information, local and structural, map onto two general hypotheses about listeners' perception of time. The first hypothesis might be called the Psychoacoustic Hypothesis. It claims that local interactions account for most if not all of the variability in durational perception data. Therefore, under the psychoacoustic hypothesis, more complicated mechanisms sensitive to structure are not necessary. A second hypothesis which might be called the Structural Hypothesis holds that global aspects of signal structure such as metrical organization and (in music) melodic grouping by necessity affect a listener's perception of time. Therefore, a theory of timing and rhythm perception is incomplete without accounting for these effects.
Like most simple two-way splits, this is in many ways a false dichotomy. I'll try to bring together aspects from both areas by discussing hypotheses which do not fit neatly into either category. These alternate hypotheses, collected under the heading of Hybrid Hypotheses hold (in differing ways) that a complete theory of timing and rhythm perception will need to take into account both the local and the structural aspects of music. Furthermore, some researchers suggest that non-temporal structure such as melodic structure in music is inextricably bound to the perception of temporal regularity and hence cannot be ignored.
Monahan and Hirsh [1990] provided more evidence that local temporal information is sufficient to make judgments about duration. Building on earlier data from Hirsh, Monahan and Grant [1990], they presented rhythmical patterns consisting of isochronous pulse trains with two beats removed. Listeners were asked to judge whether specific beats had or had not been delayed. Here are their eight test patterns with Xs representing beats and underscores representing rests:
x_x_xxxx, xx_x_xxx, xxx_x_xx x_xx_xxx, xx_xx_xx, xxx_xx_x x_xxx_xx, xx_xxx_xThe patterns were presented with different beats in each used as the target. Then the patterns were reversed and presented again. Monahan and Hirsh noted that, despite the fact that the reversed patterns had different metrical interpretations (using the rules from Povel [1981]) the same Weber fraction model of average interval holds. That is, at least for slow presentation of non-cycled patterns, listeners did not take advantage of global metrical structure as predicted by Povel [1981]. They worked only with local information, namely the average of the preceding and ensuing IOIs. Though Povel's model accounts for slightly more of the variance, Monahan and Hirsh preferred the average interval model due to its relative simplicity. Note, however, that if these patterns are cycled their metrical interpretations (accent assignments) change. For these cycled patterns, a model sensitive to global structure like Povel [1981] accounts for more of the variance (Povel and Essens [1985], Handel [1989]).
Halpbern and Darwin [1982] also provided a Weber fraction model. In their experiment, four clicks were played for a listener. The first three were isochronous (varied over several IOIs) while the fourth click ranging from 10% early to 10% late as compared to the IOI of each of the first three. Halpbern and Darwin showed that discrimination of the onset of this fourth click approached 15% of the duration of a single tone, or 6 msec for single 40 msec tones, as suggested by Creelman [1962]. They noted that under minimal uncertainty, temporal position doesn't matter at all; only the local inter-onset information was exploited by listeners. However, they also noted that when the first three clicks are presented with non-uniform IOIs, the task gets much harder, showing that at the very least listeners exploit the periodicity present in the easier isochronous presentations, a kind of global structure.
Espinoza-Varas and Watson [1986] offered a similar result. They presented short ten-tone sequences (each of the ten tones was 40 msec in duration) with one tone sometimes lengthened in duration. They asked listeners to judge whether or not a target had been lengthened. They noted that under low uncertainty (subjects were given ample training and information to help them such as the location of the target tone) listeners do such a good job at the task that it is unlikely that they are exploiting temporal or spectral positioning of the signal. As did Monahan and Hirsh, they observed that later tones were easier to judge. Furthermore they observed that when the ten tones were of nonuniform duration (still summing to 400ms) the task was much harder even with low uncertainty. Hence they conclude that even short (400 msec) stimuli afford rhythm tracking.
These results have been concerned with local duration, a temporal cue. Other data, some from the same experiments, support the conclusion that certain changes in local non-temporal information also affect duration judgments. Hirsh, Monahan and Grant [1982] note that changing the pitch of the delayed tone makes duration judgment harder. Espinoza-Varas and Watson [1986] note when pitch is always changing, higher frequency tones are easier to resolve. On the other hand, other local non-temporal dimensions do not seem to matter. Creelman [1962] shows that an increase in signal voltage does not help much in durational judgments once the signal is loud enough to be reliably heard. Also, Creelman notes that non-auditory information, in his case a marker light coinciding with the onset of a tone, does not help listeners resolve duration. One can conclude that certain kinds of non-temporal local information can hinder performance presumably by demanding attentional resources for another dimension like pitch, but that non-temporal local information rarely helps listeners resolve duration better than they would on simple same-pitched sinusoidal tones. Furthermore, it seems clear that a complete model of timing and rhythm cognition will need to be sensitive in some way to local durational effects.
Povel and Essens [1985] worked with isochronous repeating pulse trains
that had certain elements removed (simple metrical
patterns). Here is an example of one of their patterns. Note that
unlike Monahan and Hirsh, these patterns cycled.
xxx_x__xx__xx___ 1234567890123456Povel and Essens [1985] show that it is easier for subjects to tap with simple metrical patterns that have strong metrical interpretations. They used metrical preference rules suggested in Povel [1981] to assign metrical strength. Though an in-depth definition of metrical and a discussion of Povel's metrical preference rules is outside of the scope of this paper, it is important to note that a metrical pattern is one with particular hierarchical temporal organization. Specifically, the hierarchy is organized in terms of low-order rational relationships between different levels.) Povel and Essens [1985] note as well that, for those patterns without a strong metrical interpretation, subjects would hear the patterns in terms of element grouping rather than timing structure and that they would have a very hard time accurately perceiving duration accurately. In this way they show that for longer, cycled patterns, a theory which only looks at local information is not sufficient to account for variance.
Haken, Kelso and Bunz [1985] present data on asynchronous finger-wagging which suggests that relative time is important in simple motor control tasks. They asked subjects to wag their index fingers at asynchrony and measured the phase alignment as subjects speed up the wagging. They provide a model which uses only relative phase and instantaneous frequency to account for the behavior exhibited when subjects sped up and were unable to keep their fingers out of phase. Though this model achieves its phase transition through sensitivity to real world frequency (and so is not only concerned with relative time), the fact that it considers relative phase an important macroscopic parameter makes it interesting as an example of a system exploiting relative time structure in motor control.
Jones and Boltz [1989] present data showing that duration expectancies can be explicitly manipulated by changing global structure. By changing the hierarchical structure of a simple folk melody they were able to vary the predictions made by listeners on when the song would end. Jones and Boltz show that manipulations at a relatively high level in a metrical hierarchy (the phrase level) are able to influence durational predictions.
Pitt and Samuel [1990] refine an experiment by Shields, McHugh and Martin [1974] and show that in speech, structural information aids listeners in finding a target phoneme in an auditory stream. They played sentences where the target phoneme would fall either on or off of a metrically-accented point in the speech stream. The target phoneme was embedded in a word modified so that local accent did not play a role (e.g. per-MIT the verb and PER-mit the noun were combined to form the uniformly-accented PER-MIT). Normal sentence rhythm provided enough information so that listeners would more reliably hear the target phoneme when it coincided with stress. Pitt and Samuel concluded that normal sentence rhythm is predictive, but not overly predictive of stress. They conclude that their experiments provide limited support for "the Attentional Bounce Hypothesis" (their term for sensitivity to rhythmic structure).
There are many more examples of how structural information affects judgments in the realm of rhythm. Handel [1989] and Fraisse [1982] document many reliable perceptual effects found by manipulating different structural dimensions. The accent of each k-th beat in a sequence produces a perceived lengthening of the interval preceding the accented beat. Stretching (delaying) each k-th interval in a sequence causes a perceived amplification of certain beats. Modifying the pitch of each k-th event in a sequence causes the modified elements to seem accented. Listeners take advantage of structural information to make auditory durational judgments and to focus attention in music and in speech. Structural information also plays a role in motor control tasks such as finger wagging. Taken as a whole, the data suggests that human beings are acutely sensitive to relative time structure and to global structure in general. A complete model of timing and rhythm cognition will need to exhibit similar sensitivity.
Repp [1992, in press] suggests that performed music contains expressive timing information which requires both relative and absolute timing sensitivity. By lengthening or shortening different IOIs in computer-controlled performances he showed that changes are easier to hear in some structural positions than others. He claims that perceptual biases were responsible for some of this behavior and that the biases are mirrored in expressive performances of the same music. Though the nature of expressive timing is hotly contested (Repp [1990], Clynes [1990]) it is clear that small variations in inter onset timing can make the difference between a piece of music sounding "wooden" or "mechanical" and it sounding "alive" or "realized." Nor is expressive timing arbitrary. Repp [in press] notes that expressive performance is closely tied to musical structure, particularly rhythmic grouping. If this is the case then it seems clear that a complete theory of timing and rhythm cognition needs to account for both relative-time relationships between layers of a metrical hierarchy and absolute time local perturbations which are an important part of performed music.
The effects of tempo change on perceived rhythmic structure do not cleanly fit into either of the two hypotheses outlined above. Since tempo detection involves predictions about when the next beat will occur, it clearly requires some kind of structural sensitivity. Perhaps the tempo of a single metronome might be known through local effects only, but the tempo of a complicated piece of music is only known after (at least) one level of the metrical hierarchy is extracted. Yet relative time is not enough to account for the effects of large tempo changes. That is, a model sensitive to only the ratios between completely arbitrary levels cannot account for the fact that the same piece played at a much faster tempo will yield a different metrical interpretation. Furthermore, base tempo is not decoupled from absolute time: people have preferred tempos which are defined in terms of milliseconds, not intervals [Fraisse 1982]. These observations lead to the conclusion that a theory which accounts for tempo changes must be sensitive in one way or another to both relative and sensitive time. This represents a combination of the two hypotheses outlined above.
Finally, Bamberger [1991] asked eight- and nine-year-olds to produce pictorial representations of musical rhythms they had heard. One of her main observations was that metrical interpretations and phrasal interpretations developed concurrently. (By phrasal, I mean interpretations which paid more attention to tonal groupings than to metrical groupings. A phrasal grouping might group together notes which would be otherwise separated by a metrical grouping.) Bamberger was clear in her observation that children did not start by producing "primitive" phrasal interpretations and then move on to more "sophisticated" metrical interpretations. Rather, as different children developed, they produced increasingly sophisticated drawings of both types. The important point here is that the ability to group a musical stream by melodic phrase develops concurrently with the ability to group it by absolute and/or relative time. Therefore one can argue that a new hybrid hypothesis of timing and rhythm cognition must account for this. To ignore tonal grouping in favor of metrical grouping is to work against Bamberger's developmental data.