Editing Distance matrix (section)

== Information retrieval ==

=== Distance matrices using Gaussian mixture distance ===

*[https://www.researchgate.net/publication/220723359_Evaluation_of_Distance_Measures_Between_Gaussian_Mixture_Models_of_MFCCs]* Gaussian mixture distance for performing accurate [[nearest neighbor search]] for information retrieval. Under an established Gaussian finite mixture model for the distribution of the data in the database, the Gaussian mixture distance is formulated based on minimizing the [[Kullback-Leibler divergence]] between the distribution of the retrieval data and the data in database. In the comparison of performance of the Gaussian mixture distance with the well-known [[Euclidean distance|Euclidean]] and [[Mahalanobis distance|Mahalanobis]] distances based on a precision performance measurement, experimental results demonstrate that the Gaussian mixture distance function is superior in the others for different types of testing data.

Potential basic algorithms worth noting on the topic of information retrieval is [[Fish School Search]] algorithm an information retrieval that partakes in the act of using distance matrices in order for gathering collective behavior of fish schools. By using a feeding operator to update their weights

Eq. A:

:<math>
x_i(t+1)=x_{i}(t)- step_{vol} rand(0,1)\frac{x_{i}(t) - B(t)}{distance(x_{i}(t),B(t))},
</math>

Eq. B:

:<math>
x_i(t+1)=x_{i}(t)+step_{vol} rand(0,1)\frac{x_{i}(t) - B(t)}{distance(x_{i}(t),B(t))},
</math>

Stepvol defines the size of the maximum volume displacement preformed with the distance matrix, specifically using a [[Euclidean distance]] matrix.

=== Evaluation of the similarity or dissimilarity of Cosine similarity and Distance matrices ===
[[File:SimilarityTOidistance.png|none|thumb|Conversion formula between cosine similarity and Euclidean distance]]

*[https://www.sciencedirect.com/science/article/pii/S0020025507002630]* While the [[Cosine similarity]] measure is perhaps the most frequently applied proximity measure in information retrieval by measuring the angles between documents in the search space on the base of the cosine. Euclidean distance is invariant to mean-correction. The sampling distribution of a mean is generated by repeated sampling from the same population and recording of the sample means obtained. This forms a distribution of different means, and this distribution has its own mean and variance. For the data which can be negative as well as positive, the [[null distribution]] for cosine similarity is the distribution of the [[dot product]] of two independent random unit vectors. This distribution has a mean of zero and a variance of 1/n. While [[Euclidean distance]] will be invariant to this correction.

=== Clustering Documents ===
The implementation of [[hierarchical clustering]] with distance-based metrics to organize and group similar documents together will require the need and utilization of a distance matrix. The distance matrix will represent the degree of association that a document has with another document that will be used to create clusters of closely associated documents that will be utilized in retrieval methods of relevant documents for a user's query.

=== Isomap ===
[[Isomap]] incorporates distance matrices to utilize [[geodesic distance]]s to able to compute lower-dimensional embeddings. This helps to address a collection of documents that reside within a massive number of dimensions and empowers to perform document clustering.

=== Neighborhood Retrieval Visualizer (NeRV) ===
An algorithm used for both unsupervised and supervised visualization that uses distance matrices to find similar data based on the similarities shown on a display/screen.

The distance matrix needed for Unsupervised NeRV can be computed through fixed input pairwise distances.

The distance matrix needed for Supervised NeRV requires formulating a supervised distance metric to be able to compute the distance of the input in a supervised manner.