Editing Audio time stretching and pitch scaling

{{Short description|Changing the speed or duration of an audio signal without affecting its pitch}}
{{Redirect|Timestretch|the album|Timestretch (album){{!}}''Timestretch'' (album)}}

'''Time stretching''' is the process of changing the speed or duration of an [[audio signal processing|audio signal]] without affecting its [[pitch (music)|pitch]]. '''Pitch scaling''' is the opposite: the process of changing the pitch without affecting the speed. [[Pitch shift]] is pitch scaling implemented in an [[effects unit]] and intended for live performance. [[Pitch control]] is a simpler process which affects pitch and speed simultaneously by slowing down or speeding up a recording. 

These processes are often used to match the pitches and tempos of two pre-recorded clips for mixing when the clips cannot be reperformed or resampled. Time stretching is often used to adjust [[radio commercial]]s<ref>{{cite magazine |url=http://www.tvtechnology.com/features/audio_notes/f_audionotes.shtml |archive-url=https://web.archive.org/web/20080527184101/http://www.tvtechnology.com/features/audio_notes/f_audionotes.shtml |archive-date=2008-05-27 |title=Dolby, The Chipmunks And NAB2004}}</ref> and the audio of [[television advertisement]]s<ref>{{Cite web|url=https://www.atarimagazines.com/creative/v9n7/122_Variable_speech.php|title=Variable speech.|website=www.atarimagazines.com}}</ref> to fit exactly into the 30 or 60 seconds available. It can be used to conform longer material to a designated time slot, such as a 1-hour broadcast. 

==Resampling==
The simplest way to change the duration or pitch of an audio recording is to change the playback speed. For a [[digital audio]] recording, this can be accomplished through [[sample rate conversion]]. When using this method, the frequencies in the recording are always scaled at the same ratio as the speed, transposing its perceived pitch up or down in the process. Slowing down the recording to increase duration also lowers the pitch, while speeding it up for a shorter duration respectively raises the pitch, creating the so-called [[Alvin and the Chipmunks#Recording technique|Chipmunk effect]]. When resampling audio to a notably lower pitch, it may be preferred that the source audio is of a higher sample rate, as slowing down the playback rate will reproduce an audio signal of a lower resolution, and therefore reduce the perceived clarity of the sound. On the contrary, when resampling audio to a notably higher pitch, it may be preferred to incorporate an interpolation filter, as frequencies that surpass the [[Nyquist frequency]] (determined by the sampling rate of the audio reproduction software or device) will create usually undesired sound distortions, a phenomenon that is also known as [[Sampling_(signal_processing)#Practical_considerations|aliasing.]]

== Frequency domain ==

=== Phase vocoder ===
{{Main|Phase vocoder}}
One way of stretching the length of a signal without affecting the pitch is to build a [[phase vocoder]] after Flanagan, Golden, and Portnoff.

Basic steps:
#compute the instantaneous frequency/amplitude relationship of the signal using the [[Short-time Fourier transform|STFT]], which is the [[discrete Fourier transform]] of a short, overlapping and smoothly windowed block of samples;
#apply some processing to the Fourier transform magnitudes and phases (like resampling the FFT blocks); and
#perform an inverse STFT by taking the inverse Fourier transform on each chunk and adding the resulting waveform chunks, also called overlap and add (OLA).<ref>{{cite journal
 | author = Jont B. Allen
 |date=June 1977
 | title = Short Time Spectral Analysis, Synthesis, and Modification by Discrete Fourier Transform
 | journal = IEEE Transactions on Acoustics, Speech, and Signal Processing
 | volume = ASSP-25
 | number = 3
 | pages = 235–238
}}</ref>

The phase vocoder handles [[Sine wave|sinusoid]] components well, but early implementations introduced considerable smearing on [[transient (acoustics)|transient]] ("beat") waveforms at all non-integer compression/expansion rates, which renders the results phasey and diffuse. Recent improvements allow better quality results at all compression/expansion ratios but a residual smearing effect still remains.

The phase vocoder technique can also be used to perform pitch shifting, chorusing, timbre manipulation, harmonizing, and other unusual modifications, all of which can be changed as a function of time.

[[Image:Sinusoidal Analysis & Synthesis (McAulay-Quatieri 1988).svg|thumb|300px|Sinusoidal analysis/synthesis system (based on {{harvnb|McAulay|Quatieri|1988|p=161}})<ref>{{citation
 |last1        = McAulay
 |first1       = R. J.
 |last2        = Quatieri
 |first2       = T. F.
 |author-link2  = Thomas F. Quatieri
 |title        = Speech Processing Based on a Sinusoidal Model
 |url          = http://www.ll.mit.edu/publications/journal/pdf/vol01_no2/1.2.3.speechprocessing.pdf
 |journal      = The Lincoln Laboratory Journal
 |volume       = 1
 |issue        = 2
 |date         = 1988
 |pages        = 153&ndash;167
 |access-date  = 2014-09-07
 |archive-url  = https://web.archive.org/web/20120521071601/http://www.ll.mit.edu/publications/journal/pdf/vol01_no2/1.2.3.speechprocessing.pdf
 |archive-date = 2012-05-21
 |url-status     = dead
}}</ref>]]

=== Sinusoidal spectral modeling ===
{{See also|Spectral modeling synthesis}}

Another method for time stretching relies on a [[Spectral modelling synthesis|spectral model]] of the signal.  In this method, peaks are identified in frames using the [[Short-time Fourier transform|STFT]] of the signal, and sinusoidal "tracks" are created by connecting peaks in adjacent frames.  The tracks are then re-synthesized at a new time scale.  This method can yield good results on both polyphonic and percussive material, especially when the signal is separated into sub-bands.  However, this method is more computationally demanding than other methods.{{citation needed|date=December 2012}}

[[File:MonophonicSoundCylinderModel.svg|thumb|200px|right|Modelling a monophonic sound as observation along a helix of a function with a cylinder domain]]

== Time domain ==

=== SOLA ===
{{See also|PSOLA}}

[[Rabiner]] and Schafer in 1978 put forth an alternate solution that works in the [[time domain]]: attempt to find the [[periodic signal|period]] (or equivalently the [[fundamental frequency]]) of a given section of the wave using some [[pitch detection algorithm]] (commonly the peak of the signal's [[autocorrelation]], or sometimes [[cepstrum|cepstral]] processing), and [[fade (audio engineering)|crossfade]] one period into another.

This is called [[time-domain harmonic scaling]]<ref>{{cite journal
 | author = David Malah
 |date=April 1979
 | title = Time-domain algorithms for harmonic bandwidth reduction and time scaling of speech signals
 | journal = IEEE Transactions on Acoustics, Speech, and Signal Processing
 | volume = ASSP-27
 | issue = 2
 | pages = 121–133
}}</ref> or the synchronized overlap-add method (SOLA) and performs somewhat faster than the phase vocoder on slower machines but fails when the autocorrelation mis-estimates the period of a signal with complicated harmonics (such as [[orchestra]]l pieces).

[[Adobe Audition]] (formerly Cool Edit Pro) seems to solve this by looking for the period closest to a center period that the user specifies, which should be an integer multiple of the tempo, and between 30 [[hertz|Hz]] and the lowest bass frequency.

This is much more limited in scope than the phase vocoder based processing, but can be made much less processor intensive, for real-time applications. It provides the most coherent results{{Citation needed|date=June 2017}} for single-pitched sounds like voice or musically monophonic instrument recordings.

High-end commercial audio processing packages either combine the two techniques (for example by separating the signal into sinusoid and transient waveforms), or use other techniques based on the [[wavelet]] transform, or artificial neural network processing{{Citation needed|date=June 2016}}, producing the highest-quality time stretching.

=== Frame-based approach ===
[[File:GeneralizedPrinciple TSM.png|thumb|450px|Frame-based approach of many TSM procedures]]

In order to preserve an audio signal's pitch when stretching or compressing its duration, many time-scale modification (TSM) procedures follow a frame-based approach.<ref>{{cite journal
 | author = Jonathan Driedger and Meinard Müller
 | title = A Review of Time-Scale Modification of Music Signals
 | journal = Applied Sciences
 | volume = 6
 | issue = 2
 | pages = 57
 | year = 2016
 | doi=10.3390/app6020057
| doi-access = free
 }}</ref>
Given an original discrete-time audio signal, this strategy's first step is to split the signal 
into short ''analysis frames'' of fixed length.
The analysis frames are spaced by a fixed number of samples, called the ''analysis hopsize'' <math>H_a\in\mathbb{N}</math>.
To achieve the actual time-scale modification, the analysis frames are then temporally relocated
to have a ''synthesis hopsize'' <math>H_s\in\mathbb{N}</math>.
This frame relocation results in a modification of the signal's duration by a ''stretching factor'' of
<math>\alpha=H_s/H_a</math>.
However, simply superimposing the unmodified analysis frames typically results in undesired artifacts
such as phase discontinuities or amplitude fluctuations.
To prevent these kinds of artifacts, the analysis frames are adapted to form ''synthesis frames'', prior to
the reconstruction of the time-scale modified output signal.

The strategy of how to derive the synthesis frames from the analysis frames is a key difference among
different TSM procedures.

== Speed hearing and speed talking==
For the specific case of speech, time stretching can be performed using [[PSOLA]].

[[Time-compressed speech]] is the representation of verbal text in compressed time. While one might expect speeding up to reduce comprehension, Herb Friedman says that "Experiments have shown that the brain works most efficiently if the information rate through the ears—via speech—is the 'average' reading rate, which is about 200–300 wpm (words per minute), yet the average rate of speech is in the neighborhood of 100–150 wpm."<ref>[http://www.atarimagazines.com/creative/v9n7/122_Variable_speech.php Variable Speech], Creative Computing Vol. 9, No. 7 / July 1983 / p. 122</ref>

Listening to time-compressed speech is seen as the equivalent of [[speed reading]].{{By whom|date=April 2023}}<ref>{{Cite web|url=http://www.nevsblog.com/2006/06/23/listen-to-podcasts-in-half-the-time/|title=Listen to podcasts in half the time|access-date=2008-07-24|archive-date=2011-08-29|archive-url=https://web.archive.org/web/20110829034255/http://www.nevsblog.com/2006/06/23/listen-to-podcasts-in-half-the-time|url-status=dead}}</ref><ref>{{cite web |title=Speeding iPods |url=http://cid.lib.byu.edu/?p=128 |archive-url=https://web.archive.org/web/20060902102443/http://cid.lib.byu.edu/?p=128 |archive-date=2006-09-02}}</ref>

== Pitch scaling ==
{{multiple image |align=right |direction=vertical |width=220
 | image1   = H7600 Harmonizer Effects Processor by Eventide.tif
 | caption1 = Pitch shifting (frequency scaling) is provided on [[Eventide, Inc|Eventide]] [[Harmonizer]]
 | image2   = BodeFrequencyShifter.jpg
 | caption2 = Frequency shifting  provided by [[Harald Bode|Bode]] Frequency Shifter ''does not'' keep frequency ratio and harmony.
}}

These techniques can also be used to [[transposition (music)|transpose]] an audio sample while holding speed or duration constant.  This may be accomplished by time stretching and then resampling back to the original length.  Alternatively, the frequency of the sinusoids in a [[sinusoidal model]] may be altered directly, and the signal reconstructed at the appropriate time scale.

Transposing can be called ''[[frequency]] scaling'' or ''[[pitch shift]]ing'', depending on perspective.

For example, one could move the pitch of every note up by a perfect fifth, keeping the tempo the same.
One can view this transposition as "pitch shifting", "shifting" each note up 7 keys on a piano keyboard, or adding a fixed amount on the [[Mel scale]], or adding a fixed amount in linear [[pitch space]].
One can view the same transposition as "frequency scaling", "scaling" (multiplying) the frequency of every note by 3/2.

Musical transposition preserves the ratios of the [[harmonic]] frequencies that determine the sound's [[timbre]], unlike the ''frequency shift'' performed by [[amplitude modulation]], which adds a fixed frequency offset to the frequency of every note. (In theory one could perform a literal ''pitch scaling'' in which the musical pitch space location is scaled [a higher note would be shifted at a greater interval in linear pitch space than a lower note], but that is highly unusual, and not musical.{{Citation needed|date=November 2012}})

Time domain processing works much better here, as smearing is less noticeable, but scaling vocal samples distorts the [[formant]]s into a sort of [[Alvin and the Chipmunks]]-like effect, which may be desirable or undesirable. A process that preserves the formants and character of a voice involves analyzing the signal with a [[vocoder|channel vocoder]] or [[Linear predictive coding|LPC]] vocoder plus any of several [[pitch detection algorithm]]s and then resynthesizing it at a different fundamental frequency.

A detailed description of older analog recording techniques for pitch shifting can be found at {{slink|Alvin and the Chipmunks|Recording technique}}.

== DJing ==
Time stretching and pitch scaling is used extensively by [[Disc jockey|DJs]] in addition to [[beatmixing]] when playing and creating [[DJ mix|set]]. In order to seamlessly blend two tracks together, the tempo of a track can be adjusted to match another track such that the beats line up. Pitch scaling is commonly used to retain the pitch of a track. Pitch scaling is also used by DJs for [[harmonic mixing]], to transform tracks into compatible keys so that they sound pleasing when mixed together. Time stretching and pitch scaling are included in modern DJ hardware ([[CDJ|CDJs]] and [[DJ controller|DJ controllers]]) and software (such as [[VirtualDJ]], [[Serato]], and Rekordbox).

== Music production ==
Time stretching and pitch scaling is used in [[digital audio workstation]] software for working with [[Loop (music)|music loops]], sound clips which can be repeated and transposed to form a song. The pitch and tempo of multiple loops are aligned to create tracks. Notable software includes [[Acid Pro]] with its "Acidized" loops feature and [[FL Studio]].

== In consumer software ==

Pitch-corrected audio timestretch is found in every modern [[web browser]] as part of the [[HTML]] standard for media playback.<ref>{{cite web |title=HTMLMediaElement.playbackRate - Web APIs |url=https://developer.mozilla.org/en-US/docs/Web/API/HTMLMediaElement/playbackRate |website=MDN |access-date=1 September 2021}}</ref> Similar controls are ubiquitous in media applications and frameworks such as [[GStreamer]] and [[Unity (game engine)|Unity]].

==See also==
*[[Beatmatching]]
*[[Dynamic tonality]] — real-time changes of [[Dynamic tuning|tuning]] and [[Dynamic timbres|timbre]]
*[[Pitch correction]]
*[[Scrubbing (audio)]]
*[[Nightcore]]

== References ==
<references />

== External links ==
*[http://blogs.zynaptiq.com/bernsee/time-pitch-overview/ Time Stretching and Pitch Shifting Overview] A comprehensive overview of current time and pitch modification techniques by Stephan Bernsee
*[http://blogs.zynaptiq.com/bernsee/pitch-shifting-using-the-ft/ Stephan Bernsee's smbPitchShift C source code] C source code for doing frequency domain pitch manipulation
*[https://github.com/janesconference/KievII/blob/master/dsp/pitchshift.js pitchshift.js from KievII] A Javascript pitchshifter based on smbPitchShift code, from the open source [https://github.com/janesconference/KievII KievII library]
*[http://www.panix.com/~jens/pvoc-dolson.par The Phase Vocoder: A Tutorial] - A good description of the phase vocoder
*[http://www.ee.columbia.edu/~dpwe/papers/LaroD99-pvoc.pdf New Phase-Vocoder Techniques for Pitch-Shifting, Harmonizing and Other Exotic Effects]
*[https://web.archive.org/web/20040617224423/http://www.ircam.fr/equipes/analyse-synthese/roebel/paper/dafx2003.pdf A new Approach to Transient Processing in the Phase Vocoder]
*[https://web.archive.org/web/20061223184759/http://keizai.yokkaichi-u.ac.jp/~ikeda/research/picola.html PICOLA and TDHS]
*[http://www.guitarpitchshifter.com How to build a pitch shifter] Theory, equations, figures and performances of a real-time guitar pitch shifter running on a DSP chip
*[http://www.zynaptiq.com/ztx/ ZTX Time Stretching Library] Free and commercial versions of a popular 3rd party time stretching library for iOS, Linux, Windows and Mac OS X
*[https://licensing.zplane.de/technology#elastique Elastique by zplane] commercial cross-platform library, mainly used by DJ and DAW manufacturers
*[http://www.qneomusic.com Voice Synth] from Qneo - specialized synthesizer for creative voice sculpting
*[https://www.audiolabs-erlangen.de/resources/MIR/TSMtoolbox/ TSM toolbox] Free MATLAB implementations of various Time-Scale Modification procedures
*{{web archive |url=https://web.archive.org/web/20230202074135/http://www.paulnasca.com/open-source-projects#TOC-Paul-s-Extreme-Sound-Stretch |title=PaulStretch}}, a well-known algorithm for extreme (>10&times;) time stretching
*[https://bungee.parabolaresearch.com Bungee] open source and commercial libraries for real time audio stretching
*[https://breakfastquay.com/rubberband/ Rubber Band] — open source library for time stretching and pitch shifting
*[https://www.surina.net/soundtouch/ SoundTouch] — open-source library for changing the tempo, pitch and playback rate

{{Music production}}

{{DEFAULTSORT:Audio time-scale pitch modification}}
[[Category:Audio engineering]]
[[Category:Digital signal processing]]
[[Category:Sound effects]]