Editing Recommender system (section)

=== Performance measures ===
Evaluation is important in assessing the effectiveness of recommendation algorithms. To measure the [[effectiveness]] of recommender systems, and compare different approaches, three types of [[evaluation]]s are available: user studies, [[A/B testing|online evaluations (A/B tests)]], and offline evaluations.<ref name=":0" />

The commonly used metrics are the [[mean squared error]] and [[root mean squared error]], the latter having been used in the Netflix Prize. The information retrieval metrics such as [[precision and recall]] or [[Discounted Cumulative Gain|DCG]] are useful to assess the quality of a recommendation method. Diversity, novelty, and coverage are also considered as important aspects in evaluation.<ref>Lathia, N., Hailes, S., Capra, L., Amatriain, X.: [http://www.academia.edu/download/46585553/lathia_sigir10.pdf Temporal diversity in recommender systems]{{dead link|date=July 2022|bot=medic}}{{cbignore|bot=medic}}. In: Proceedings of the 33rd International ACMSIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 210–217. ACM, New York</ref> However, many of the classic evaluation measures are highly criticized.<ref name="Turpin2001">{{cite book| author1=Turpin, Andrew H| author2=Hersh, William| chapter=Why batch and user evaluations do not give the same results| title=Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval| year=2001| pages=225–231}}</ref>

Evaluating the performance of a recommendation algorithm on a fixed test dataset will always be extremely challenging as it is impossible to accurately predict the reactions of real users to the recommendations. Hence any metric that computes the effectiveness of an algorithm in offline data will be imprecise.

User studies are rather a small scale. A few dozens or hundreds of users are presented recommendations created by different recommendation approaches, and then the users judge which recommendations are best.

In A/B tests, recommendations are shown to typically thousands of users of a real product, and the recommender system randomly picks at least two different recommendation approaches to generate recommendations. The effectiveness is measured with implicit measures of effectiveness such as [[conversion rate]] or [[click-through rate]].

Offline evaluations are based on historic data, e.g. a dataset that contains information about how users previously rated movies.<ref>{{Cite web|url=https://grouplens.org/datasets/movielens/|title=MovieLens dataset|date=2013-09-06}}</ref>

The effectiveness of recommendation approaches is then measured based on how well a recommendation approach can predict the users' ratings in the dataset. While a rating is an explicit expression of whether a user liked a movie, such information is not available in all domains. For instance, in the domain of citation recommender systems, users typically do not rate a citation or recommended article. In such cases, offline evaluations may use implicit measures of effectiveness. For instance, it may be assumed that a recommender system is effective that is able to recommend as many articles as possible that are contained in a research article's reference list. However, this kind of offline evaluations is seen critical by many researchers.<ref name=":4">{{Cite journal|last1=Chen |first1=Hung-Hsuan|last2=Chung |first2=Chu-An|last3=Huang |first3= Hsin-Chien|last4=Tsui |first4=Wen|date=2017-09-01|title=Common Pitfalls in Training and Evaluating Recommender Systems|journal=ACM SIGKDD Explorations Newsletter|volume=19|pages=37–45|language=EN|doi=10.1145/3137597.3137601|s2cid=10651930}}</ref><ref>{{Cite book|title=User Modeling, Adaptation, and Personalization|url=https://archive.org/details/usermodelingadap00pero|url-access=limited|last1=Jannach|first1=Dietmar|last2=Lerche|first2=Lukas|last3=Gedikli|first3=Fatih|last4=Bonnin|first4=Geoffray|chapter=What Recommenders Recommend – an Analysis of Accuracy, Popularity, and Sales Diversity Effects |date=2013-06-10|publisher=Springer Berlin Heidelberg|isbn=978-3-642-38843-9|editor-last=Carberry|editor-first=Sandra|series=Lecture Notes in Computer Science|volume=7899 |pages=[https://archive.org/details/usermodelingadap00pero/page/n44 25]–37|language=en|doi=10.1007/978-3-642-38844-6_3|editor-last2=Weibelzahl|editor-first2=Stephan|editor-last3=Micarelli|editor-first3=Alessandro|editor-last4=Semeraro|editor-first4=Giovanni|citeseerx = 10.1.1.465.96}}</ref><ref name=":1">{{Cite book|last1=Turpin|first1=Andrew H.|last2=Hersh|first2=William|title=Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval |chapter=Why batch and user evaluations do not give the same results |date=2001-01-01|series=SIGIR '01|location=New York, NY, USA|publisher=ACM|pages=[https://archive.org/details/proceedingsof24t0000inte/page/225 225–231]|doi=10.1145/383952.383992|isbn=978-1-58113-331-8|citeseerx=10.1.1.165.5800|s2cid=18903114|chapter-url=https://archive.org/details/proceedingsof24t0000inte/page/225}}</ref><ref name=":0" /> For instance, it has been shown that results of offline evaluations have low correlation with results from user studies or A/B tests.<ref name=":1" /><ref>{{Cite book|title=Research and Advanced Technology for Digital Libraries|volume = 9316|last1=Langer|first1=Stefan|date=2015-09-14|publisher=Springer International Publishing|isbn=978-3-319-24591-1|editor-last=Kapidakis|editor-first=Sarantos|series=Lecture Notes in Computer Science|pages=153–168|language=en|doi=10.1007/978-3-319-24592-8_12|editor-last2=Mazurek|editor-first2=Cezary|editor-last3=Werla|editor-first3=Marcin|chapter = A Comparison of Offline Evaluations, Online Evaluations, and User Studies in the Context of Research-Paper Recommender Systems}}</ref> A dataset popular for offline evaluation has been shown to contain duplicate data and thus to lead to wrong conclusions in the evaluation of algorithms.<ref name="BasaranNtoutsi2017">{{cite book|last1=Basaran|first1=Daniel|title=Proceedings of the 2017 SIAM International Conference on Data Mining|last2=Ntoutsi|first2=Eirini|last3=Zimek|first3=Arthur|year=2017|pages=390–398|doi=10.1137/1.9781611974973.44|isbn=978-1-61197-497-3}}</ref> Often, results of so-called offline evaluations do not correlate with actually assessed user-satisfaction.<ref>{{Cite book|last1=Beel|first1=Joeran|last2=Genzmehr|first2=Marcel|last3=Langer|first3=Stefan|last4=Nürnberger|first4=Andreas|last5=Gipp|first5=Bela|title=Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation |chapter=A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation |date=2013-01-01|series=RepSys '13|location=New York, NY, USA|publisher=ACM|pages=7–14|doi=10.1145/2532508.2532511|isbn=978-1-4503-2465-6|citeseerx=10.1.1.1031.973|s2cid=8202591}}</ref> This is probably because offline training is highly biased toward the highly reachable items, and offline testing data is highly influenced by the outputs of the online recommendation module.<ref name=":4" /><ref name="cañamares2018">{{cite conference|author1=Cañamares, Rocío|author2=Castells, Pablo|title=Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems|conference=41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2018)|location=Ann Arbor, Michigan, USA|date=July 2018|publisher=ACM|pages=415–424|doi=10.1145/3209978.3210014|url=http://ir.ii.uam.es/pubs/sigir2018.pdf|access-date=2021-03-05|archive-date=2021-04-14|archive-url=https://web.archive.org/web/20210414070127/http://ir.ii.uam.es/pubs/sigir2018.pdf}}</ref> Researchers have concluded that the results of offline evaluations should be viewed critically.<ref name="cañamares2020">{{cite journal|author1=Cañamares, Rocío|author2=Castells, Pablo|author3=Moffat, Alistair|date=March 2020|title=Offline Evaluation Options for Recommender Systems|journal=Information Retrieval|volume=23|issue=4|publisher=Springer|pages=387–410|doi=10.1007/s10791-020-09371-3|s2cid=213169978|url=http://ir.ii.uam.es/pubs/irj2020.pdf}}</ref>