Editing Head-related transfer function (section)

== Technical derivation ==
{{multiple image |align=right 
|image1=FreqHRTF.jpg|width1=260|caption1=A sample of frequency response of ears:
* ''<span style="color:#080;">green curve</span>'': left ear &nbsp; ''X''<sub>L</sub>(''f'')
* ''<span style="color:#008;">blue curve</span>'': &nbsp; right ear ''X''<sub>R</sub>(''f'')
for a sound source <!-- ''X''(''f'') --> from upward front.<!-- (0° [[azimuth]] 15° [[elevation]]), where horizontal axis is frequency [&times; 10<sup>4</sup> Hz], vertical axis is magnitude of frequency response [dB] -->}}

{{multiple image | align=right|image2=HRTFazimuth.png| width2=260| caption2= An example of how the HRTF tilt with azimuth taken from a point of reference is derived}}

Linear systems analysis defines the [[transfer function]] as the complex ratio between the output signal spectrum and the input signal spectrum as a function of frequency.  Blauert (1974; cited in Blauert, 1981) initially defined the transfer function as the free-field transfer function (FFTF). Other terms include free-field to eardrum transfer function and the pressure transformation from the free-field to the eardrum.  Less specific descriptions include the pinna transfer function, the outer ear transfer function, the pinna response, or directional transfer function (DTF).

The transfer function ''H''(''f'') of any linear [[time-invariant system]] at frequency ''f'' is:

:''H''(''f'') = Output(''f'') / Input(''f'')

One method used to obtain the HRTF from a given source location is therefore to measure the head-related impulse response (HRIR), ''h''(''t''), at the ear drum for the impulse ''Δ''(''t'') placed at the source.  The HRTF ''H''(''f'') is the [[Fourier transform]] of the HRIR ''h''(''t'').

Even when measured for a "dummy head" of idealized geometry, HRTF are complicated functions of [[frequency]] and the [[spherical coordinate system|three spatial variables]].  For distances greater than 1 m from the head, however, the HRTF can be said to attenuate inversely with range. It is this [[far field]] HRTF, ''H''(''f'', ''θ'', ''φ''), that has most often been measured.  At closer range, the difference in level observed between the ears can grow quite large, even in the low-frequency region within which negligible level differences are observed in the far field.

HRTFs are typically measured in an [[anechoic chamber]] to minimize the influence of early reflections and [[reverberation]] on the measured response.  HRTFs are measured at small increments of ''θ'' such as 15° or 30° in the horizontal plane, with [[interpolation]] used to synthesize ''HRTF''s for arbitrary positions of ''θ''.  Even with small increments, however, interpolation can lead to front-back confusion, and optimizing the interpolation procedure is an active area of research.

In order to maximize the [[signal-to-noise ratio]] (SNR) in a measured HRTF, it is important that the impulse being generated be of high volume.  In practice, however, it can be difficult to generate impulses at high volumes and, if generated, they can be damaging to human ears, so it is more common for HRTFs to be directly calculated in the [[frequency domain]] using a frequency-swept [[sine|sine wave]] or by using [[maximum length sequence]]s.  User fatigue is still a problem, however, highlighting the need for the ability to interpolate based on fewer measurements.

The head-related transfer function is involved in resolving the [[cone of confusion]], a series of points where [[interaural time difference]] (ITD) and [[interaural level difference]] (ILD) are identical for sound sources from many locations around the ''0'' part of the cone. When a sound is received by the ear it can either go straight down the ear into the ear canal or it can be reflected off the [[Pinna (anatomy)|pinnae]] of the ear, into the ear canal a fraction of a second later. The sound will contain many frequencies, so therefore many copies of this signal will go down the ear all at different times depending on their frequency (according to reflection, diffraction, and their interaction with high and low frequencies and the size of the structures of the ear.) These copies overlap each other, and during this, certain signals are enhanced (where the phases of the signals match) while other copies are canceled out (where the phases of the signal do not match). Essentially, the brain is looking for frequency notches in the signal that correspond to particular known directions of sound.{{Citation needed|date=January 2008}}

If another person's ears were substituted, the individual would not immediately be able to localize sound, as the patterns of enhancement and cancellation would be different from those patterns the person's auditory system is used to. However, after some weeks, the auditory system would adapt to the new head-related transfer function.<ref>{{Cite journal | url =  http://www.mbfys.ru.nl/~johnvo/papers/nn98.pdf | journal = Nature Neuroscience |date=September 1998 | title = Relearning sound localization with new ears | volume = 1 | issue = 5 | pages = 417–421 | doi = 10.1038/1633 | author = Hofman, Paul M. | pmid =  10196533 | last2 =  Van Riswick | first2 =  JG | last3 =  Van Opstal | first3 =  AJ | s2cid = 10088534 }}</ref> The inter-subject variability in the spectra of HRTFs has been studied through cluster analyses.<ref>So, R.H.Y., Ngan, B., Horner, A., Leung, K.L., Braasch, J. and Blauert, J. (2010) Toward orthogonal non-individualized head-related transfer functions for forward and backward directional sound: cluster analysis and an experimental study. Ergonomics, 53(6), pp.767-781.</ref>

Assessing the variation through changes between the person's ear, we can limit our perspective with the degrees of freedom of the head and its relation with the spatial domain. Through this, we eliminate the tilt and other co-ordinate parameters that add complexity. For the purpose of calibration we are only concerned with the direction level to our ears, ergo a specific degree of freedom. Some of the ways in which we can deduce an expression to calibrate the HRTF are:

# Localization of sound in Virtual Auditory space<ref name="Carlile, S 1996">{{Cite book |last=Carlile |first=S. |title=Virtual Auditory Space: Generation and Applications |date=1996 |publisher=Springer |isbn=9783662225967 |edition=1 |location=Berlin, Heidelberg}}</ref>
# HRTF Phase synthesis<ref name = Tashev>{{cite book|author=Tashev, Ivan|title=2014 Information Theory and Applications Workshop (ITA) |chapter=HRTF phase synthesis via sparse representation of anthropometric features |date= 2014|pages= 1–5 |doi=10.1109/ITA.2014.6804239|isbn=978-1-4799-3589-5|s2cid=13232557}}</ref>
# HRTF Magnitude synthesis<ref name = Bilinski>{{cite book|author= Bilinski, Piotr|author2=Ahrens, Jens|author3=Thomas, Mark RP|author4=Tashev, Ivan|author5=Platt, John C|title=2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |chapter=HRTF magnitude synthesis via sparse representation of anthropometric features |publisher=IEEE ICASSP, Florence, Italy|date = 2014| pages = 4468–4472| doi=10.1109/ICASSP.2014.6854447|isbn=978-1-4799-2893-4|s2cid=5619011|chapter-url=https://hal.inria.fr/hal-01097303/file/Piotr%20Bilinski%20-%20ICASSP%202014%20-%20HRTF%20Synthesis.pdf}}</ref>

=== Localization of sound in virtual auditory space===
A basic assumption in the creation of a virtual auditory space is that if the acoustical waveforms present at a listener's eardrums are the same under headphones as in free field, then the listener's experience should also be the same.

Typically, sounds generated from headphones are perceived as originating from within the head. In the virtual auditory space, the headphones should be able to "externalize" the sound. Using the HRTF, sounds can be spatially positioned using the technique described below.<ref name="Carlile, S 1996"/>

Let ''x''{{sub|1}}(''t'') represent an electrical signal driving a loudspeaker and ''y''{{sub|1}}(''t'') represent the signal received by a microphone inside the listener's eardrum. Similarly, let ''x''{{sub|2}}(''t'') represent the electrical signal driving a headphone and ''y''{{sub|2}}(''t'') represent the microphone response to the signal. The goal of the virtual auditory space is to choose ''x''{{sub|2}}(''t'') such that ''y''{{sub|2}}(''t'') = ''y''{{sub|1}}(''t''). Applying the Fourier transform to these signals, we come up with the following two equations:

: ''Y''{{sub|1}} = ''X''{{sub|1}}LFM, and
: ''Y''{{sub|2}} = ''X''{{sub|2}}HM,

where ''L'' is the transfer function of the loudspeaker in the free field, ''F'' is the HRTF, ''M'' is the microphone transfer function, and ''H'' is the headphone-to-eardrum transfer function. Setting ''Y''{{sub|1}} = ''Y''{{sub|2}}, and solving for ''X''{{sub|2}} yields

: ''X''{{sub|2}} = ''X''{{sub|1}}LF/H.

By observation, the desired transfer function is

: ''T''= ''LF''/''H''.
 
Therefore, theoretically, if ''x''{{sub|1}}(''t'') is passed through this filter and the resulting ''x''{{sub|2}}(''t'') is played on the headphones, it should produce the same signal at the eardrum. Since the filter applies only to a single ear, another one must be derived for the other ear. This process is repeated for many places in the virtual environment to create an array of head-related transfer functions for each position to be recreated while ensuring that the sampling conditions are set by the [[Nyquist rate|Nyquist criteria]].

=== HRTF phase synthesis ===
There is less reliable phase estimation in the very low part of the frequency band, and in the upper frequencies the phase response is affected by the features of the pinna. Earlier studies also show that the HRTF phase response is mostly linear and that listeners are insensitive to the details of the interaural phase spectrum as long as the interaural time delay (ITD) of the combined low-frequency part of the waveform is maintained. This is the modeled phase response of the subject HRTF as a time delay, dependent on the direction and elevation.<ref name="Tashev"/>

A scaling factor is a function of the anthropometric features. For example, a training set of N subjects would consider each HRTF phase and describe a single ITD scaling factor as the average delay of the group. This computed scaling factor can estimate the time delay as function of the direction and elevation for any given individual. Converting the time delay to phase response for the left and the right ears is trivial.

The HRTF phase can be described by the [[Interaural time difference#Duplex theory|ITD]] scaling factor. This is in turn quantified by the anthropometric data of a given individual taken as the source of reference. For a generic case we consider ''β'' as a sparse vector

: <math> \beta = [ \beta_1, \beta_2, \ldots, \beta_N ]^T </math>

that represents the subject's anthropometric features as a linear superposition of the anthropometric features from the training data (y{{sup|'}} = β{{sup|T}} X), and then apply the same sparse vector directly on the scaling vector H. We can write this task as a minimization problem, for a non-negative shrinking parameter ''λ'':

: <math> \beta = \operatorname{argmin}\limits_\beta \left( \sum_{a=1}^A \left( y_a - \sum_{n=1}^N \beta_n X_n^2 \right) + \lambda \sum_{n=1}^N \beta_n \right) </math>

From this,
ITD scaling factor value H{{sup|'}} is estimated as:

: <math> H' = \sum_{n=1}^N \beta_n H_n. </math>

where The ITD scaling factors for all persons in the dataset are stacked in a vector ''H'' ∈ '''''R'''''{{sup|''N''}}, so the value ''H''{{sup|''n''}} corresponds to the scaling factor of the n-th person.

=== HRTF magnitude synthesis ===
We solve the above minimization problem using [[least absolute shrinkage and selection operator]]. We assume that the HRTFs are represented by the same relation as the anthropometric features.<ref name="Bilinski"/> Therefore, once we learn the sparse vector β from the anthropometric features, we directly apply it to the HRTF tensor data and the subject's HRTF values H{{sup|'}} given by:

: <math> H'_{d,k} = \sum_{n=1}^N \beta_n H_{n,d,k} </math>

where The HRTFs for each subject are described by a tensor of size ''D''&nbsp;×&nbsp;''K'', where ''D'' is the number of HRTF directions and ''K'' is the number of frequency bins. All ''H''{{sub|''n'',''d'',''k''}} corresponds to all the HRTFs of the training set are stacked in a new tensor ''H'' ∈ '''''R'''''{{sup|''N''×''D''×''K''}}, so the value H{{sub|n,d,k}} corresponds to the ''k''-th frequency bin for ''d''-th HRTF direction of the ''n''-th person. Also ''H''{{sup|'}}{{sub|''d'',''k''}} corresponds to ''k''-th frequency for every d-th HRTF direction of the synthesized HRTF.