Editing Scale-invariant feature transform (section)

== Stages ==

=== Scale-invariant feature detection ===
Lowe's method for image feature generation transforms an image into a large collection of feature vectors, each of which is invariant to image translation, scaling, and rotation, partially invariant to illumination changes, and robust to local geometric distortion. These features share similar properties with neurons in the [[primary visual cortex]] that encode basic forms, color, and movement for object detection in primate vision.<ref name="Serre2005" /> Key locations are defined as maxima and minima of the result of [[difference of Gaussians]] function applied in [[scale space]] to a series of smoothed and resampled images. Low-contrast candidate points and edge response points along an edge are discarded. Dominant orientations are assigned to localized key points. These steps ensure that the key points are more stable for matching and recognition. SIFT descriptors robust to local affine distortion are then obtained by considering pixels around a radius of the key location, blurring, and resampling local image orientation planes.

=== Feature matching and indexing ===
Indexing consists of storing SIFT keys and identifying matching keys from the new image. Lowe used a modification of the [[k-d tree]] algorithm called the '''[[Best bin first|best-bin-first]] search''' (BBF) method<ref name="Beis1997" /> that can identify the [[nearest neighbor search|nearest neighbors]] with high probability using only a limited amount of computation. The BBF algorithm uses a modified search ordering for the [[k-d tree]] algorithm so that bins in feature space are searched in the order of their closest distance from the query location. This search order requires the use of a [[heap (data structure)|heap]]-based [[priority queue]] for efficient determination of the search order. We obtain a candidate for each keypoint by identifying its nearest neighbor in the database of keypoints from training images. The nearest neighbors are defined as the keypoints with minimum [[Euclidean distance]] from the given descriptor vector. The way Lowe<ref name=Lowe2004 /> determined whether a given candidate should be kept or 'thrown out' is by checking the ratio between the distance from this given candidate and the distance from the closest keypoint which is not of the same object class as the candidate at hand (candidate feature vector / closest different class feature vector), the idea is that we can only be sure of candidates in which features/keypoints from distinct object classes don't "clutter" it (not geometrically clutter in the feature space necessarily but more so clutter along the right half (>0) of the real line), this is an obvious consequence of using [[Euclidean distance]] as our nearest neighbor measure. The ratio threshold for rejection is whenever it is above 0.8. This method eliminated 90% of false matches while discarding less than 5% of correct matches. To further improve the efficiency of the best-bin-first algorithm search was cut off after checking the first 200 nearest neighbor candidates. For a database of 100,000 keypoints, this provides a speedup over exact nearest neighbor search by about 2 orders of magnitude, yet results in less than a 5% loss in the number of correct matches.

=== Cluster identification by Hough transform voting ===
[[Hough transform]] is used to cluster reliable model [[Hypothesis|hypotheses]] to search for keys that agree upon a particular model [[Pose (computer vision)|pose]]. Hough transform identifies clusters of features with a consistent interpretation by using each feature to vote for all object poses that are consistent with the feature. When clusters of features are found to vote for the same pose of an object, the probability of the interpretation being correct is much higher than for any single feature. An entry in a [[hash table]] is created predicting the model location, orientation, and scale from the match hypothesis. The hash table is searched to identify all clusters of at least 3 entries in a bin, and the bins are sorted into decreasing order of size.

Each of the SIFT keypoints specifies 2D location, scale, and orientation, and each matched keypoint in the database has a record of its parameters relative to the training image in which it was found. The similarity transform implied by these 4 parameters is only an approximation to the full 6 degree-of-freedom pose space for a 3D object and also does not account for any non-rigid deformations. Therefore, Lowe<ref name=Lowe2004 /> used broad bin sizes of 30 degrees for orientation, a factor of 2 for scale, and 0.25 times the maximum projected training image dimension (using the predicted scale) for location. The SIFT key samples generated at the larger scale are given twice the weight of those at the smaller scale. This means that the larger scale is in effect able to filter the most likely neighbors for checking at the smaller scale. This also improves recognition performance by giving more weight to the least-noisy scale. To avoid the problem of boundary effects in bin assignment, each keypoint match votes for the 2 closest bins in each dimension, giving a total of 16 entries for each hypothesis and further broadening the pose range.

=== Model verification by linear least squares ===
Each identified cluster is then subject to a verification procedure in which a [[linear least squares (mathematics)|linear least squares]] solution is performed for the parameters of the [[affine transformation]] relating the model to the image. The affine transformation of a model point [x y]<sup>T</sup> to an image point [u v]<sup>T</sup> can be written as below

:<math>
\begin{bmatrix} u \\ v \end{bmatrix} = \begin{bmatrix} m_1 & m_2 \\ m_3 & m_4 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} + \begin{bmatrix} t_x \\ t_y \end{bmatrix}
</math>

where the model translation is [t<sub>x</sub> t<sub>y</sub>]<sup>T</sup> and the affine rotation, scale, and stretch are represented by the parameters m<sub>1</sub>, m<sub>2</sub>, m<sub>3</sub> and m<sub>4</sub>. To solve for the transformation parameters the equation above can be rewritten to gather the unknowns into a column vector.

:<math>
\begin{bmatrix} x & y & 0 & 0 & 1 & 0 \\ 0 & 0 & x & y & 0 & 1 \\ ....\\ ....\end{bmatrix} \begin{bmatrix}m1 \\ m2 \\ m3 \\ m4 \\ t_x \\ t_y \end{bmatrix} = \begin{bmatrix} u \\ v  \\ . \\  . \end{bmatrix}
</math>

This equation shows a single match, but any number of further matches can be added, with each match contributing two more rows to the first and last matrix. At least 3 matches are needed to provide a solution.
We can write this linear system as

:<math>A\hat{\mathbf{x}} \approx \mathbf{b},</math>

where ''A'' is a known ''m''-by-''n'' [[Matrix (mathematics)|matrix]] (usually with ''m'' > ''n''), '''x''' is an unknown ''n''-dimensional parameter [[vector space|vector]], and '''b''' is a known ''m''-dimensional measurement vector.

Therefore, the minimizing vector <math>\hat{\mathbf{x}}</math> is a solution of the '''normal equation'''
:<math> A^T \! A \hat{\mathbf{x}} = A^T \mathbf{b}. </math>

The solution of the system of linear equations is given in terms of the matrix <math>(A^TA)^{-1}A^T</math>, called the [[Moore–Penrose inverse|pseudoinverse]] of ''A'', by

:<math> \hat{\mathbf{x}} = (A^T\!A)^{-1} A^T \mathbf{b}. </math>

which minimizes the sum of the squares of the distances from the projected model locations to the corresponding image locations.

=== Outlier detection ===
[[Outlier]]s can now be removed by checking for agreement between each image feature and the model, given the parameter solution. Given the [[linear least squares (mathematics)|linear least squares]] solution, each match is required to agree within half the error range that was used for the parameters in the [[Hough transform]] bins. As outliers are discarded, the linear least squares solution is re-solved with the remaining points, and the process iterated. If fewer than 3 points remain after discarding [[outlier]]s, then the match is rejected. In addition, a top-down matching phase is used to add any further matches that agree with the projected model position, which may have been missed from the [[Hough transform]] bin due to the similarity transform approximation or other errors.

The final decision to accept or reject a model hypothesis is based on a detailed probabilistic model.<ref name="Lowe2001" /> This method first computes the expected number of false matches to the model pose, given the projected size of the model, the number of features within the region, and the accuracy of the fit. A [[Bayesian probability]] analysis then gives the probability that the object is present based on the actual number of matching features found. A model is accepted if the final probability for a correct interpretation is greater than 0.98. Lowe's SIFT based object recognition gives excellent results except under wide illumination variations and under non-rigid transformations.