Editing Scale-invariant feature transform (section)

== Overview ==
{{Technical|section|date=October 2010}}

For any object in an image, we can extract important points in the image to provide a "feature description" of the object. This description, extracted from a training image, can then be used to locate the object in a new (previously unseen) image containing other objects. In order to do this reliably, the features should be detectable even if the image is scaled, or if it has noise and different illumination. Such points usually lie on high-contrast regions of the image, such as object edges.

Another important characteristic of these features is that the relative positions between them in the original scene should not change between images. For example, if only the four corners of a door were used as features, they would work regardless of the door's position; but if points in the frame were also used, the recognition would fail if the door is opened or closed. Similarly, features located in articulated or flexible objects would typically not work if any change in their internal geometry happens between two images in the set being processed. In practice, SIFT detects and uses a much larger number of features from the images, which reduces the contribution of the errors caused by these local variations in the average error of all feature matching errors.

SIFT<ref name="patent" /> can robustly identify objects even among clutter and under partial occlusion, because the SIFT feature descriptor is invariant to [[Scaling (geometry)|uniform scaling]], [[Orientation (geometry)|orientation]], illumination changes, and partially invariant to [[Affine transformation|affine distortion]].<ref name="Lowe1999" /> This section summarizes the original SIFT algorithm and mentions a few competing techniques available for object recognition under clutter and partial occlusion.

The SIFT descriptor is based on image measurements in terms of ''receptive fields''<ref name="KoeDoo87" /><ref name="KoeDoo92" /><ref name="Lin13BICY" /><ref name="Lin13-AdvImgPhy" /> over which ''local scale invariant reference frames''<ref name="Lin13PONE" /><ref name="Lin14CompVis" /> are established by ''local scale selection''.<ref name="Lin94Book" /><ref name="Lindeberg1998" /><ref name="Lin14CompVis" /> A general theoretical explanation about this is given in the Scholarpedia article on SIFT.<ref name="Lindeberg2012" />

{| class="wikitable"
! Problem
! Technique
! Advantage
|-
| key localization / scale / rotation
| [[Difference of Gaussians]] / [[Scale-space representation|scale-space pyramid]] / orientation assignment
| accuracy, stability, scale & rotational invariance
|-
| geometric distortion
| blurring / resampling of local image orientation planes
| affine invariance
|-
| indexing and matching
| [[Nearest neighbor search|nearest neighbor]] / [[Best Bin First]] search
| Efficiency / speed
|-
| Cluster identification
| [[Hough Transform]] voting
| reliable [[Pose (computer vision)|pose]] models
|-
| Model verification / outlier detection
| [[Linear least squares]]
| better error tolerance with fewer matches
|-
| Hypothesis acceptance
| [[Bayesian Probability]] analysis
| reliability
|}

=== Types of features ===
{{Unreferenced section|date=April 2022}}
The detection and description of local image features can help in object recognition. The SIFT features are local and based on the appearance of the object at particular interest points, and are invariant to image scale and rotation. They are also robust to changes in illumination, noise, and minor changes in viewpoint. In addition to these properties, they are highly distinctive, relatively easy to extract and allow for correct object identification with low probability of mismatch. They are relatively easy to match against a (large) database of local features but, however, the high dimensionality can be an issue, and generally probabilistic algorithms such as [[k-d tree]]s with [[best bin first]] search are used. Object description by set of SIFT features is also robust to partial occlusion; as few as 3 SIFT features from an object are enough to compute its location and pose. Recognition can be performed in close-to-real time, at least for small databases and on modern computer hardware.{{Citation needed|date=August 2008}}