Editing Scale-invariant feature transform (section)

== Algorithm ==

=== Scale-space extrema detection ===
We begin by detecting points of interest, which are termed ''keypoints'' in the SIFT framework. The image is [[Convolution|convolved]] with Gaussian filters at different scales, and then the difference of successive [[Gaussian blur|Gaussian-blurred]] images are taken. Keypoints are then taken as maxima/minima of the [[Difference of Gaussians]] (DoG) that occur at multiple scales. Specifically, a DoG image <math>D \left( x, y, \sigma \right)</math> is given by

:<math>D \left( x, y, \sigma \right) = L \left( x, y, k_i\sigma \right) - L \left( x, y, k_j\sigma \right)</math>,
:where <math>L \left( x, y, k\sigma \right)</math> is the convolution of the original image <math>I \left( x, y \right)</math> with the [[Gaussian blur]] <math>G \left( x, y, k\sigma \right)</math> at scale <math>k\sigma</math>, i.e.,

:<math>L \left( x, y, k\sigma \right) = G \left( x, y, k\sigma \right) * I \left( x, y \right)</math>

Hence a DoG image between scales <math>k_i\sigma</math> and <math>k_j\sigma</math> is just the difference of the Gaussian-blurred images at scales <math>k_i\sigma</math> and <math>k_j\sigma</math>. For [[scale space]] extrema detection in the SIFT algorithm, the image is first convolved with Gaussian-blurs at different scales. The convolved images are grouped by octave (an octave corresponds to doubling the value of <math>\sigma</math>), and the value of <math>k_i</math> is selected so that we obtain a fixed number of convolved images per octave. Then the Difference-of-Gaussian images are taken from adjacent Gaussian-blurred images per octave.

Once DoG images have been obtained, keypoints are identified as local minima/maxima of the DoG images across scales. This is done by comparing each pixel in the DoG images to its eight neighbors at the same scale and nine corresponding neighboring pixels in each of the neighboring scales. If the pixel value is the maximum or minimum among all compared pixels, it is selected as a candidate keypoint.

This keypoint detection step is a variation of one of the [[blob detection]] methods developed by Lindeberg by detecting scale-space extrema of the scale normalized Laplacian;<ref name="Lin94Book" /><ref name="Lindeberg1998" /> that is, detecting points that are local extrema with respect to both space and scale, in the discrete case by comparisons with the nearest 26 neighbors in a discretized scale-space volume. The difference of Gaussians operator can be seen as an approximation to the Laplacian, with the implicit normalization in the [[pyramid (image processing)|pyramid]] also constituting a discrete approximation of the scale-normalized Laplacian.<ref name="Lindeberg2012" /> Another real-time implementation of scale-space extrema of the Laplacian operator has been presented by Lindeberg and Bretzner based on a hybrid pyramid representation,<ref name="Lindenberg2003" /> which was used for human-computer interaction by real-time gesture recognition in Bretzner et al. (2002).<ref>Lars Bretzner, Ivan Laptev, Tony Lindeberg [http://kth.diva-portal.org/smash/record.jsf?pid=diva2%3A462620&dswid=608 "Hand gesture recognition using multi-scale colour features, hierarchical models and particle filtering"], Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, Washington, DC, USA, 21–21 May 2002, pages 423-428. {{ISBN|0-7695-1602-5}}, {{doi|10.1109/AFGR.2002.1004190}}</ref>

=== Keypoint localization ===
[[File:Sift keypoints filtering.jpg|thumb|After scale space extrema are detected (their location being shown in the uppermost image) the SIFT algorithm discards low-contrast keypoints (remaining points are shown in the middle image) and then filters out those located on edges. Resulting set of keypoints is shown on last image.]] Scale-space extrema detection produces too many keypoint candidates, some of which are unstable. The next step in the algorithm is to perform a detailed fit to the nearby data for accurate location, scale, and ratio of [[principal curvatures]]. This information allows the rejection of points which are low contrast (and are therefore sensitive to noise) or poorly localized along an edge.

==== Interpolation of nearby data for accurate position ====
First, for each candidate keypoint, interpolation of nearby data is used to accurately determine its position. The initial approach was to just locate each keypoint at the location and scale of the candidate keypoint.<ref name="Lowe1999" /> The new approach calculates the interpolated location of the extremum, which substantially improves matching and stability.<ref name=Lowe2004 /> The interpolation is done using the quadratic [[Taylor expansion]] of the Difference-of-Gaussian scale-space function, <math>D \left( x, y, \sigma \right)</math> with the candidate keypoint as the origin. This Taylor expansion is given by:

:<math>D(\textbf{x}) = D + \frac{\partial D}{\partial \textbf{x}}^T\textbf{x} + \frac{1}{2}\textbf{x}^T \frac{\partial^2 D}{\partial \textbf{x}^2} \textbf{x}</math>

where D and its derivatives are evaluated at the candidate keypoint and <math>\textbf{x} = \left( x, y, \sigma \right)^T</math> is the offset from this point. The location of the extremum, <math>\hat{\textbf{x}}</math>, is determined by taking the derivative of this function with respect to <math>\textbf{x}</math> and setting it to zero. If the offset <math>\hat{\textbf{x}}</math> is larger than <math>0.5</math> in any dimension, then that's an indication that the extremum lies closer to another candidate keypoint. In this case, the candidate keypoint is changed and the interpolation performed instead about that point. Otherwise the offset is added to its candidate keypoint to get the interpolated estimate for the location of the extremum. A similar subpixel determination of the locations of scale-space extrema is performed in the real-time implementation based on hybrid pyramids developed by Lindeberg and his co-workers.<ref name="Lindenberg2003" />

==== Discarding low-contrast keypoints ====
To discard the keypoints with low contrast, the value of the second-order Taylor expansion <math>D(\textbf{x})</math> is computed at the offset <math>\hat{\textbf{x}}</math>. If this value is less than <math>0.03</math>, the candidate keypoint is discarded. Otherwise it is kept, with final scale-space location <math>\textbf{y} + \hat{\textbf{x}}</math>, where <math>\textbf{y}</math> is the original location of the keypoint.

==== Eliminating edge responses ====
The DoG function will have strong responses along edges, even if the candidate keypoint is not robust to small amounts of noise. Therefore, in order to increase stability, we need to eliminate the keypoints that have poorly determined locations but have high edge responses.

For poorly defined peaks in the DoG function, the [[principal curvature]] across the edge would be much larger than the principal curvature along it. Finding these principal curvatures amounts to solving for the [[Eigenvalues and eigenvectors|eigenvalues]] of the second-order [[Hessian matrix]], '''H''':

:<math> \textbf{H} =  \begin{bmatrix}
  D_{xx} & D_{xy} \\
  D_{xy} & D_{yy}
\end{bmatrix} </math>

The eigenvalues of '''H''' are proportional to the principal curvatures of D. It turns out that the ratio of the two eigenvalues, say <math>\alpha</math> is the larger one, and <math>\beta</math> the smaller one, with ratio <math>r = \alpha/\beta</math>, is sufficient for SIFT's purposes. The trace of '''H''', i.e., <math>D_{xx} + D_{yy}</math>, gives us the sum of the two eigenvalues, while its determinant, i.e., <math>D_{xx} D_{yy} - D_{xy}^2</math>, yields the product. The ratio <math> \text{R} = \operatorname{Tr}(\textbf{H})^2 / \operatorname{Det}(\textbf{H})</math> can be shown to be equal to <math>(r+1)^2/r</math>, which depends only on the ratio of the eigenvalues rather than their individual values. R is minimum when the eigenvalues are equal to each other. Therefore, the higher the [[absolute difference]] between the two eigenvalues, which is equivalent to a higher absolute difference between the two principal curvatures of D, the higher the value of R. It follows that, for some threshold eigenvalue ratio <math>r_{\text{th}}</math>, if R for a candidate keypoint is larger than <math>(r_{\text{th}} + 1)^2/r_{\text{th}}</math>, that keypoint is poorly localized and hence rejected. The new approach uses <math>r_{\text{th}} = 10</math>.<ref name="Lowe2004" />

This processing step for suppressing responses at edges is a transfer of a corresponding approach in the [[Harris corner detector|Harris operator]] for corner detection. The difference is that the measure for thresholding is computed from the Hessian matrix instead of a [[Structure tensor|second-moment matrix]].

=== Orientation assignment ===
In this step, each keypoint is assigned one or more orientations based on local image gradient directions. This is the key step in achieving [[rotational invariance|invariance to rotation]] as the keypoint descriptor can be represented relative to this orientation and therefore achieve invariance to image rotation.

First, the Gaussian-smoothed image <math>L \left( x, y, \sigma \right)</math> at the keypoint's scale <math>\sigma</math> is taken so that all computations are performed in a scale-invariant manner. For an image sample <math>L \left( x, y \right)</math> at scale <math>\sigma</math>, the gradient magnitude, <math>m \left( x, y \right)</math>, and orientation, <math>\theta \left( x, y \right)</math>, are precomputed using pixel
differences:

:<math>m \left( x, y \right) = \sqrt{\left( L \left( x+1, y \right) - L \left( x-1, y \right) \right)^2 + \left( L \left( x, y+1 \right) - L \left( x, y-1 \right) \right)^2}</math>

:<math>\theta \left( x, y \right) = \mathrm{atan2}\left(L \left( x, y+1 \right) - L \left( x, y-1 \right), L \left( x+1, y \right) - L \left( x-1, y \right) \right)</math>

The magnitude and direction calculations for the gradient are done for every pixel in a neighboring region around the keypoint in the Gaussian-blurred image L. An orientation histogram with 36 bins is formed, with each bin covering 10 degrees. Each sample in the neighboring window added to a histogram bin is weighted by its gradient magnitude and by a Gaussian-weighted circular window with a <math>\sigma</math> that is 1.5 times that of the scale of the keypoint. The peaks in this histogram correspond to dominant orientations. Once the histogram is filled, the orientations corresponding to the highest peak and local peaks that are within 80% of the highest peaks are assigned to the keypoint. In the case of multiple orientations being assigned, an additional keypoint is created having the same location and scale as the original keypoint for each additional orientation.

=== Keypoint descriptor ===
Previous steps found keypoint locations at particular scales and assigned orientations to them. This ensured invariance to image location, scale and rotation. Now we want to compute a descriptor vector for each keypoint such that the descriptor is highly distinctive and partially invariant to the remaining variations such as illumination, 3D viewpoint, etc. This step is performed on the image closest in scale to the keypoint's scale.

First a set of orientation histograms is created on 4×4 pixel neighborhoods with 8 bins each. These histograms are computed from magnitude and orientation values of samples in a 16×16 region around the keypoint such that each histogram contains samples from a 4×4 subregion of the original neighborhood region. The image gradient magnitudes and orientations are sampled around the keypoint location, using the scale of the keypoint to select the level of Gaussian blur for the image. In order to achieve orientation invariance, the coordinates of the descriptor and the gradient orientations are rotated relative to the keypoint orientation. The magnitudes are further weighted by a Gaussian function with <math>\sigma</math> equal to one half the width of the descriptor window. The descriptor then becomes a vector of all the values of these histograms. Since there are 4 × 4 = 16 histograms each with 8 bins the vector has 128 elements. This vector is then normalized to unit length in order to enhance invariance to affine changes in illumination. To reduce the effects of non-linear illumination a threshold of 0.2 is applied and the vector is again normalized. The thresholding process, also referred to as clamping, can improve matching results even when non-linear illumination effects are not present.<ref name=":0">Kirchner, Matthew R. "[https://arxiv.org/abs/1811.03173 Automatic thresholding of SIFT descriptors]." In ''Image Processing (ICIP), 2016 IEEE International Conference on'', pp. 291-295. IEEE, 2016.</ref> The threshold of 0.2 was empirically chosen, and by replacing the fixed threshold with one systematically calculated, matching results can be improved.<ref name=":0" />

Although the dimension of the descriptor, i.e. 128, seems high, descriptors with lower dimension than this don't perform as well across the range of matching tasks<ref name="Lowe2004" /> and the computational cost remains low due to the approximate BBF (see below) method used for finding the nearest neighbor. Longer descriptors continue to do better but not by much and there is an additional danger of increased sensitivity to distortion and occlusion. It is also shown that feature matching accuracy is above 50% for viewpoint changes of up to 50 degrees. Therefore, SIFT descriptors are invariant to minor affine changes. To test the distinctiveness of the SIFT descriptors, matching accuracy is also measured against varying number of keypoints in the testing database, and it is shown that matching accuracy decreases only very slightly for very large database sizes, thus indicating that SIFT features are highly distinctive.