Editing Scale-invariant feature transform (section)

=== Scale-space extrema detection ===
We begin by detecting points of interest, which are termed ''keypoints'' in the SIFT framework. The image is [[Convolution|convolved]] with Gaussian filters at different scales, and then the difference of successive [[Gaussian blur|Gaussian-blurred]] images are taken. Keypoints are then taken as maxima/minima of the [[Difference of Gaussians]] (DoG) that occur at multiple scales. Specifically, a DoG image <math>D \left( x, y, \sigma \right)</math> is given by

:<math>D \left( x, y, \sigma \right) = L \left( x, y, k_i\sigma \right) - L \left( x, y, k_j\sigma \right)</math>,
:where <math>L \left( x, y, k\sigma \right)</math> is the convolution of the original image <math>I \left( x, y \right)</math> with the [[Gaussian blur]] <math>G \left( x, y, k\sigma \right)</math> at scale <math>k\sigma</math>, i.e.,

:<math>L \left( x, y, k\sigma \right) = G \left( x, y, k\sigma \right) * I \left( x, y \right)</math>

Hence a DoG image between scales <math>k_i\sigma</math> and <math>k_j\sigma</math> is just the difference of the Gaussian-blurred images at scales <math>k_i\sigma</math> and <math>k_j\sigma</math>. For [[scale space]] extrema detection in the SIFT algorithm, the image is first convolved with Gaussian-blurs at different scales. The convolved images are grouped by octave (an octave corresponds to doubling the value of <math>\sigma</math>), and the value of <math>k_i</math> is selected so that we obtain a fixed number of convolved images per octave. Then the Difference-of-Gaussian images are taken from adjacent Gaussian-blurred images per octave.

Once DoG images have been obtained, keypoints are identified as local minima/maxima of the DoG images across scales. This is done by comparing each pixel in the DoG images to its eight neighbors at the same scale and nine corresponding neighboring pixels in each of the neighboring scales. If the pixel value is the maximum or minimum among all compared pixels, it is selected as a candidate keypoint.

This keypoint detection step is a variation of one of the [[blob detection]] methods developed by Lindeberg by detecting scale-space extrema of the scale normalized Laplacian;<ref name="Lin94Book" /><ref name="Lindeberg1998" /> that is, detecting points that are local extrema with respect to both space and scale, in the discrete case by comparisons with the nearest 26 neighbors in a discretized scale-space volume. The difference of Gaussians operator can be seen as an approximation to the Laplacian, with the implicit normalization in the [[pyramid (image processing)|pyramid]] also constituting a discrete approximation of the scale-normalized Laplacian.<ref name="Lindeberg2012" /> Another real-time implementation of scale-space extrema of the Laplacian operator has been presented by Lindeberg and Bretzner based on a hybrid pyramid representation,<ref name="Lindenberg2003" /> which was used for human-computer interaction by real-time gesture recognition in Bretzner et al. (2002).<ref>Lars Bretzner, Ivan Laptev, Tony Lindeberg [http://kth.diva-portal.org/smash/record.jsf?pid=diva2%3A462620&dswid=608 "Hand gesture recognition using multi-scale colour features, hierarchical models and particle filtering"], Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, Washington, DC, USA, 21–21 May 2002, pages 423-428. {{ISBN|0-7695-1602-5}}, {{doi|10.1109/AFGR.2002.1004190}}</ref>