Editing Scale-invariant feature transform (section)

== Applications ==

=== Object recognition using SIFT features ===
Given SIFT's ability to find distinctive keypoints that are invariant to location, scale and rotation, and robust to [[affine transformation]]s (changes in [[Linear scale|scale]], [[rotation]], [[Shear mapping|shear]], and position) and changes in illumination, they are usable for object recognition. The steps are given below.

* First, SIFT features are obtained from the input image using the algorithm described above.
* These features are matched to the SIFT feature database obtained from the training images. This feature matching is done through a Euclidean-distance based nearest neighbor approach. To increase robustness, matches are rejected for those keypoints for which the ratio of the nearest neighbor distance to the second-nearest neighbor distance is greater than 0.8. This discards many of the false matches arising from background clutter. Finally, to avoid the expensive search required for finding the Euclidean-distance-based nearest neighbor, an approximate algorithm called the best-bin-first algorithm is used.<ref name=Beis1997 /> This is a fast method for returning the nearest neighbor with high probability, and can give speedup by factor of 1000 while finding nearest neighbor (of interest) 95% of the time.
* Although the distance ratio test described above discards many of the false matches arising from background clutter, we still have matches that belong to different objects. Therefore, to increase robustness to object identification, we want to cluster those features that belong to the same object and reject the matches that are left out in the clustering process. This is done using the [[Hough transform]]. This will identify clusters of features that vote for the same object pose. When clusters of features are found to vote for the same pose of an object, the probability of the interpretation being correct is much higher than for any single feature. Each keypoint votes for the set of object poses that are consistent with the keypoint's location, scale, and orientation. ''Bins'' that accumulate at least 3 votes are identified as candidate object/pose matches.
* For each candidate cluster, a least-squares solution for the best estimated affine projection parameters relating the training image to the input image is obtained. If the projection of a keypoint through these parameters lies within half the error range that was used for the parameters in the Hough transform bins, the keypoint match is kept. If fewer than 3 points remain after discarding outliers for a bin, then the object match is rejected. The least-squares fitting is repeated until no more rejections take place. This works better for planar surface recognition than 3D object recognition since the affine model is no longer accurate for 3D objects.
* In this journal,<ref name="Sirmacek2009" /> authors proposed a new approach to use SIFT descriptors for multiple object detection purposes. The proposed multiple object detection approach is tested on aerial and satellite images.

SIFT features can essentially be applied to any task that requires identification of matching locations between images. Work has been done on applications such as recognition of particular object categories in 2D images, 3D reconstruction,
motion tracking and segmentation, robot localization, image panorama stitching and [[Epipolar geometry|epipolar]] calibration. Some of these are discussed in more detail below.

=== Robot localization and mapping ===
In this application,<ref name="Se2001" /> a trinocular stereo system is used to determine 3D estimates for keypoint locations. Keypoints are used only when they appear in all 3 images with consistent disparities, resulting in very few outliers. As the robot moves, it localizes itself using feature matches to the existing 3D map, and then incrementally adds features to the map while updating their 3D positions using a [[Kalman filter]]. This provides a robust and accurate solution to the problem of robot localization in unknown environments. Recent 3D solvers leverage the use of keypoint directions to solve trinocular geometry from three keypoints<ref name="SIFTOrientationTrifocal">{{cite arXiv
 |last1=Fabbri |first1=Ricardo
 |last2=Duff |first2=Timothy
 |last3=Fan |first3=Hongyi
 |last4=Regan |first4=Margaret
 |last5=de Pinho |first5=David
 |last6=Tsigaridas |first6=Elias
 |last7=Wampler |first7=Charles
 |last8=Hauenstein |first8=Jonathan
 |last9=Kimia |first9=Benjamin
 |last10=Leykin |first10=Anton
 |last11=Pajdla |first11=Tomas
 |date=23 Mar 2019
 |title=Trifocal Relative Pose from Lines at Points and its Efficient Solution
 |eprint=1903.09755
 |class=cs.CV
}}</ref> and absolute pose from only two keypoints,<ref name="SIFTOrientationPose">{{cite book
 |last1=Fabbri |first1=Ricardo
 |last2=Giblin |first2=Peter
 |last3=Kimia |first3=Benjamin
 |title=Computer Vision – ECCV 2012
 |chapter=Camera Pose Estimation Using First-Order Curve Differential Geometry
 |date=2012
 |volume=7575 |pages=231–244
 |url=https://rfabbri.github.io/stuff/fabbri-giblin-kimia-eccv2012-final-ext.pdf
 |doi=10.1007/978-3-642-33765-9_17
 |series=Lecture Notes in Computer Science
 |isbn=978-3-642-33764-2
 |s2cid=15402824
 }}</ref> an often disregarded but useful measurement available in SIFT. These orientation measurements reduce the number of required correspondences, further increasing robustness exponentially.

=== Panorama stitching ===
SIFT feature matching can be used in [[image stitching]] for fully automated [[panorama]] reconstruction from non-panoramic images. The SIFT features extracted from the input images are matched against each other to find ''k'' nearest-neighbors for each feature. These correspondences are then used to find ''m'' candidate matching images for each image. [[homography|Homographies]] between pairs of images are then computed using [[Random sample consensus|RANSAC]] and a probabilistic model is used for verification. Because there is no restriction on the input images, graph search is applied to find [[Connected component (graph theory)|connected components]] of image matches such that each connected component will correspond to a panorama. Finally for each connected component [[bundle adjustment]] is performed to solve for joint camera parameters, and the panorama is rendered using [[multi-band blending]]. Because of the SIFT-inspired object recognition approach to panorama stitching, the resulting system is insensitive to the ordering, orientation, scale and illumination of the images. The input images can contain multiple panoramas and noise images (some of which may not even be part of the composite image), and panoramic sequences are recognized and rendered as output.<ref name="Brown2003" />

=== 3D scene modeling, recognition and tracking ===
This application uses SIFT features for [[3D single-object recognition|3D object recognition]] and [[3D modeling]] in context of [[augmented reality]], in which synthetic objects with accurate pose are superimposed on real images. SIFT matching is done for a number of 2D images of a scene or object taken from different angles. This is used with [[bundle adjustment]] initialized from an [[essential matrix]] or [[trifocal tensor]] to build a sparse 3D model of the viewed scene and to simultaneously recover camera poses and [[Geometric camera calibration|calibration]] parameters. Then the position, orientation and size of the virtual object are defined relative to the coordinate frame of the recovered model. For online [[match moving]], SIFT features again are extracted from the current video frame and matched to the features already computed for the world model, resulting in a set of 2D-to-3D correspondences. These correspondences are then used to compute the current camera pose for the virtual projection and final rendering. A regularization technique is used to reduce the jitter in the virtual projection.<ref name="Gordon2006" /> The use of SIFT directions have also been used to increase robustness of this process.<ref name="SIFTOrientationTrifocal" /><ref name="SIFTOrientationPose" />  3D extensions of SIFT have also been evaluated for [[true 3D]] object recognition and retrieval.<ref name=Flitton2010 /><ref name="flitton13interestpoint">{{cite journal| author=Flitton, G.T., Breckon, T.P., Megherbi, N.| title=A Comparison of 3D Interest Point Descriptors with Application to Airport Baggage Object Detection in Complex CT Imagery| journal=Pattern Recognition| volume=46| issue=9| pages=2420–2436| year=2013| doi=10.1016/j.patcog.2013.02.008| bibcode=2013PatRe..46.2420F| hdl=1826/15213| hdl-access=free}}</ref>

=== 3D SIFT-like descriptors for human action recognition ===
Extensions of the SIFT descriptor to 2+1-dimensional spatio-temporal data in context of [[human action recognition]] in video sequences have been studied.<ref name="Flitton2010" /><ref name="Laptev2004" /><ref name="Laptev2007" /><ref name="Scovanner2007" /> The computation of local position-dependent histograms in the 2D SIFT algorithm are extended from two to three dimensions to describe SIFT features in a spatio-temporal domain. For application to human action recognition in a video sequence, sampling of the training videos is carried out either at spatio-temporal interest points or at randomly determined locations, times and scales. The spatio-temporal regions around these interest points are then described using the 3D SIFT descriptor. These descriptors are then clustered to form a spatio-temporal [[Bag of words model]]. 3D SIFT descriptors extracted from the test videos are then matched against these ''words'' for human action classification.

The authors report much better results with their 3D SIFT descriptor approach than with other approaches like simple 2D SIFT descriptors and Gradient Magnitude.<ref name="Niebles2006" />

=== Analyzing the Human Brain in 3D Magnetic Resonance Images ===
The Feature-based [[Brain morphometry|Morphometry]] (FBM) technique<ref name="Toews2010" /> uses extrema in a difference of Gaussian scale-space to analyze and classify 3D [[magnetic resonance image]]s (MRIs) of the human brain. FBM models the image probabilistically as a collage of independent features, conditional on image geometry and group labels, e.g. healthy subjects and subjects with [[Alzheimer's disease]] (AD). Features are first extracted in individual images from a 4D difference of Gaussian scale-space, then modeled in terms of their appearance, geometry and group co-occurrence statistics across a set of images. FBM was validated in the analysis of AD using a set of ~200 volumetric MRIs of the human brain, automatically identifying established indicators of AD in the brain and classifying mild AD in new images with a rate of 80%.<ref name=Toews2010 />