Editing Recommender system (section)

== Evaluation ==
=== Performance measures ===
Evaluation is important in assessing the effectiveness of recommendation algorithms. To measure the [[effectiveness]] of recommender systems, and compare different approaches, three types of [[evaluation]]s are available: user studies, [[A/B testing|online evaluations (A/B tests)]], and offline evaluations.<ref name=":0" />

The commonly used metrics are the [[mean squared error]] and [[root mean squared error]], the latter having been used in the Netflix Prize. The information retrieval metrics such as [[precision and recall]] or [[Discounted Cumulative Gain|DCG]] are useful to assess the quality of a recommendation method. Diversity, novelty, and coverage are also considered as important aspects in evaluation.<ref>Lathia, N., Hailes, S., Capra, L., Amatriain, X.: [http://www.academia.edu/download/46585553/lathia_sigir10.pdf Temporal diversity in recommender systems]{{dead link|date=July 2022|bot=medic}}{{cbignore|bot=medic}}. In: Proceedings of the 33rd International ACMSIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 210–217. ACM, New York</ref> However, many of the classic evaluation measures are highly criticized.<ref name="Turpin2001">{{cite book| author1=Turpin, Andrew H| author2=Hersh, William| chapter=Why batch and user evaluations do not give the same results| title=Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval| year=2001| pages=225–231}}</ref>

Evaluating the performance of a recommendation algorithm on a fixed test dataset will always be extremely challenging as it is impossible to accurately predict the reactions of real users to the recommendations. Hence any metric that computes the effectiveness of an algorithm in offline data will be imprecise.

User studies are rather a small scale. A few dozens or hundreds of users are presented recommendations created by different recommendation approaches, and then the users judge which recommendations are best.

In A/B tests, recommendations are shown to typically thousands of users of a real product, and the recommender system randomly picks at least two different recommendation approaches to generate recommendations. The effectiveness is measured with implicit measures of effectiveness such as [[conversion rate]] or [[click-through rate]].

Offline evaluations are based on historic data, e.g. a dataset that contains information about how users previously rated movies.<ref>{{Cite web|url=https://grouplens.org/datasets/movielens/|title=MovieLens dataset|date=2013-09-06}}</ref>

The effectiveness of recommendation approaches is then measured based on how well a recommendation approach can predict the users' ratings in the dataset. While a rating is an explicit expression of whether a user liked a movie, such information is not available in all domains. For instance, in the domain of citation recommender systems, users typically do not rate a citation or recommended article. In such cases, offline evaluations may use implicit measures of effectiveness. For instance, it may be assumed that a recommender system is effective that is able to recommend as many articles as possible that are contained in a research article's reference list. However, this kind of offline evaluations is seen critical by many researchers.<ref name=":4">{{Cite journal|last1=Chen |first1=Hung-Hsuan|last2=Chung |first2=Chu-An|last3=Huang |first3= Hsin-Chien|last4=Tsui |first4=Wen|date=2017-09-01|title=Common Pitfalls in Training and Evaluating Recommender Systems|journal=ACM SIGKDD Explorations Newsletter|volume=19|pages=37–45|language=EN|doi=10.1145/3137597.3137601|s2cid=10651930}}</ref><ref>{{Cite book|title=User Modeling, Adaptation, and Personalization|url=https://archive.org/details/usermodelingadap00pero|url-access=limited|last1=Jannach|first1=Dietmar|last2=Lerche|first2=Lukas|last3=Gedikli|first3=Fatih|last4=Bonnin|first4=Geoffray|chapter=What Recommenders Recommend – an Analysis of Accuracy, Popularity, and Sales Diversity Effects |date=2013-06-10|publisher=Springer Berlin Heidelberg|isbn=978-3-642-38843-9|editor-last=Carberry|editor-first=Sandra|series=Lecture Notes in Computer Science|volume=7899 |pages=[https://archive.org/details/usermodelingadap00pero/page/n44 25]–37|language=en|doi=10.1007/978-3-642-38844-6_3|editor-last2=Weibelzahl|editor-first2=Stephan|editor-last3=Micarelli|editor-first3=Alessandro|editor-last4=Semeraro|editor-first4=Giovanni|citeseerx = 10.1.1.465.96}}</ref><ref name=":1">{{Cite book|last1=Turpin|first1=Andrew H.|last2=Hersh|first2=William|title=Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval |chapter=Why batch and user evaluations do not give the same results |date=2001-01-01|series=SIGIR '01|location=New York, NY, USA|publisher=ACM|pages=[https://archive.org/details/proceedingsof24t0000inte/page/225 225–231]|doi=10.1145/383952.383992|isbn=978-1-58113-331-8|citeseerx=10.1.1.165.5800|s2cid=18903114|chapter-url=https://archive.org/details/proceedingsof24t0000inte/page/225}}</ref><ref name=":0" /> For instance, it has been shown that results of offline evaluations have low correlation with results from user studies or A/B tests.<ref name=":1" /><ref>{{Cite book|title=Research and Advanced Technology for Digital Libraries|volume = 9316|last1=Langer|first1=Stefan|date=2015-09-14|publisher=Springer International Publishing|isbn=978-3-319-24591-1|editor-last=Kapidakis|editor-first=Sarantos|series=Lecture Notes in Computer Science|pages=153–168|language=en|doi=10.1007/978-3-319-24592-8_12|editor-last2=Mazurek|editor-first2=Cezary|editor-last3=Werla|editor-first3=Marcin|chapter = A Comparison of Offline Evaluations, Online Evaluations, and User Studies in the Context of Research-Paper Recommender Systems}}</ref> A dataset popular for offline evaluation has been shown to contain duplicate data and thus to lead to wrong conclusions in the evaluation of algorithms.<ref name="BasaranNtoutsi2017">{{cite book|last1=Basaran|first1=Daniel|title=Proceedings of the 2017 SIAM International Conference on Data Mining|last2=Ntoutsi|first2=Eirini|last3=Zimek|first3=Arthur|year=2017|pages=390–398|doi=10.1137/1.9781611974973.44|isbn=978-1-61197-497-3}}</ref> Often, results of so-called offline evaluations do not correlate with actually assessed user-satisfaction.<ref>{{Cite book|last1=Beel|first1=Joeran|last2=Genzmehr|first2=Marcel|last3=Langer|first3=Stefan|last4=Nürnberger|first4=Andreas|last5=Gipp|first5=Bela|title=Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation |chapter=A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation |date=2013-01-01|series=RepSys '13|location=New York, NY, USA|publisher=ACM|pages=7–14|doi=10.1145/2532508.2532511|isbn=978-1-4503-2465-6|citeseerx=10.1.1.1031.973|s2cid=8202591}}</ref> This is probably because offline training is highly biased toward the highly reachable items, and offline testing data is highly influenced by the outputs of the online recommendation module.<ref name=":4" /><ref name="cañamares2018">{{cite conference|author1=Cañamares, Rocío|author2=Castells, Pablo|title=Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems|conference=41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2018)|location=Ann Arbor, Michigan, USA|date=July 2018|publisher=ACM|pages=415–424|doi=10.1145/3209978.3210014|url=http://ir.ii.uam.es/pubs/sigir2018.pdf|access-date=2021-03-05|archive-date=2021-04-14|archive-url=https://web.archive.org/web/20210414070127/http://ir.ii.uam.es/pubs/sigir2018.pdf}}</ref> Researchers have concluded that the results of offline evaluations should be viewed critically.<ref name="cañamares2020">{{cite journal|author1=Cañamares, Rocío|author2=Castells, Pablo|author3=Moffat, Alistair|date=March 2020|title=Offline Evaluation Options for Recommender Systems|journal=Information Retrieval|volume=23|issue=4|publisher=Springer|pages=387–410|doi=10.1007/s10791-020-09371-3|s2cid=213169978|url=http://ir.ii.uam.es/pubs/irj2020.pdf}}</ref>

=== Beyond accuracy ===
Typically, research on recommender systems is concerned with finding the most accurate recommendation algorithms. However, there are a number of factors that are also important.

*'''Diversity''' – Users tend to be more satisfied with recommendations when there is a higher intra-list diversity, e.g. items from different artists.<ref name="Ziegler2005">{{cite book| vauthors=Ziegler CN, McNee SM, Konstan JA, Lausen G| chapter=Improving recommendation lists through topic diversification| title=Proceedings of the 14th international conference on World Wide Web| year=2005| pages=22–32}}</ref><ref name="castells2015">{{cite book
|last1=Castells|first1=Pablo|last2=Hurley|first2= Neil J.|last3=Vargas|first3= Saúl
|editor1-last=Ricci|editor1-first=Francesco|editor2-last=Rokach|editor2-first=Lior|editor3-last=Shapira |editor3-first=Bracha
|title=Recommender Systems Handbook|date=2015|publisher=Springer US|isbn=978-1-4899-7637-6|edition=2
|chapter=Novelty and Diversity in Recommender Systems|chapter-url = https://link.springer.com/chapter/10.1007/978-1-4899-7637-6_26|doi=10.1007/978-1-4899-7637-6_26|pages=881–918
}}</ref>

*'''Recommender persistence''' – In some situations, it is more effective to re-show recommendations,<ref name="Beel2013e">{{cite book|author1=Joeran Beel |author2=Stefan Langer |author3=Marcel Genzmehr |author4=Andreas Nürnberger | chapter=Persistence in Recommender Systems: Giving the Same Recommendations to the Same Users Multiple Times| title=Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries (TPDL 2013)|date=September 2013| volume=8092| pages=390–394| publisher=Springer|editor1=Trond Aalberg |editor2=Milena Dobreva |editor3=Christos Papatheodorou |editor4=Giannis Tsakonas |editor5=Charles Farrugia | series=Lecture Notes of Computer Science (LNCS)| chapter-url=http://docear.org/papers/persistence_in_recommender_systems_--_giving_the_same_recommendations_to_the_same_users_multiple_times.pdf | access-date=1 November 2013}}</ref> or let users re-rate items,<ref name="Cosley2003">{{cite book| author1=Cosley, D.| author2= Lam, S.K.| author3= Albert, I.| author4=Konstan, J.A.| author5= Riedl, J | chapter=Is seeing believing?: how recommender system interfaces affect users' opinions| title=Proceedings of the SIGCHI conference on Human factors in computing systems| year=2003| pages=585–592| s2cid=8307833|chapter-url=https://pdfs.semanticscholar.org/d7d5/47012091d11ba0b0bf4a6630c5689789c22e.pdf}}</ref> than showing new items. There are several reasons for this. Users may ignore items when they are shown for the first time, for instance, because they had no time to inspect the recommendations carefully.
*'''Privacy''' – Recommender systems usually have to deal with privacy concerns<ref name="Pu2012">{{cite journal| author1=Pu, P.| author2=Chen, L.| author3=Hu, R.| title=Evaluating recommender systems from the user's perspective: survey of the state of the art| journal=User Modeling and User-Adapted Interaction| year=2012| pages=1–39|url=http://doc.rero.ch/record/317166/files/11257_2011_Article_9115.pdf}}</ref> because users have to reveal sensitive information. Building [[user profiles]] using collaborative filtering can be problematic from a privacy point of view. Many European countries have a strong culture of [[information privacy|data privacy]], and every attempt to introduce any level of user [[Profiling (information science)|profiling]] can result in a negative customer response. Much research has been conducted on ongoing privacy issues in this space. The [[Netflix Prize]] is particularly notable for the detailed personal information released in its dataset. Ramakrishnan et al. have conducted an extensive overview of the trade-offs between personalization and privacy and found that the combination of weak ties (an unexpected connection that provides serendipitous recommendations) and other data sources can be used to uncover identities of users in an anonymized dataset.<ref name="privacyoverview">{{cite journal |author1 = Naren Ramakrishnan |author2 = Benjamin J. Keller |author3 = Batul J. Mirza |author4 = Ananth Y. Grama |author5 = George Karypis |journal = IEEE Internet Computing |title = Privacy risks in recommender systems |year = 2001 |volume = 5 |issue = 6 |url = https://archive.org/details/sigir2002proceed0000inte/page/54 |isbn = 978-1-58113-561-9 |publisher = [[IEEE Educational Activities Department]] |location = Piscataway, NJ |pages = [https://archive.org/details/sigir2002proceed0000inte/page/54 54–62] |doi = 10.1109/4236.968832 |citeseerx = 10.1.1.2.2932 |s2cid = 1977107 }}
</ref>

*'''User demographics''' – Beel et al. found that user demographics may influence how satisfied users are with recommendations.<ref name="Beel2013f">{{cite book|author1=Joeran Beel |author2=Stefan Langer |author3=Andreas Nürnberger |author4=Marcel Genzmehr | chapter=The Impact of Demographics (Age and Gender) and Other User Characteristics on Evaluating Recommender Systems| title=Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries (TPDL 2013)|date=September 2013| pages=400–404| publisher=Springer|editor1=Trond Aalberg |editor2=Milena Dobreva |editor3=Christos Papatheodorou |editor4=Giannis Tsakonas |editor5=Charles Farrugia | chapter-url=http://docear.org/papers/the_impact_of_users'_demographics_(age_and_gender)_and_other_characteristics_on_evaluating_recommender_systems.pdf | access-date=1 November 2013}}</ref> In their paper they show that elderly users tend to be more interested in recommendations than younger users.
*'''Robustness''' – When users can participate in the recommender system, the issue of fraud must be addressed.<ref name="Konstan2012">{{cite journal| vauthors=Konstan JA, Riedl J| title=Recommender systems: from algorithms to user experience| journal=User Modeling and User-Adapted Interaction| volume=22| issue=1–2| year=2012| pages=1–23|url=https://link.springer.com/content/pdf/10.1007/s11257-011-9112-x.pdf| doi=10.1007/s11257-011-9112-x| s2cid=8996665| doi-access=free}}</ref>
*'''Serendipity''' – [[Serendipity]] is a measure of "how surprising the recommendations are".<ref name="Ricci2011">{{cite book| vauthors=Ricci F, Rokach L, Shapira B, Kantor BP| title=Recommender systems handbook| year=2011| pages=1–35| bibcode=2011rsh..book.....R}}</ref><ref name=castells2015/> For instance, a recommender system that recommends milk to a customer in a grocery store might be perfectly accurate, but it is not a good recommendation because it is an obvious item for the customer to buy. "[Serendipity] serves two purposes: First, the chance that users lose interest because the choice set is too uniform decreases. Second, these items are needed for algorithms to learn and improve themselves".<ref>{{Cite journal|last1=Möller|first1=Judith|last2=Trilling|first2=Damian|last3=Helberger|first3=Natali|last4=van Es|first4=Bram|date=2018-07-03|title=Do not blame it on the algorithm: an empirical assessment of multiple recommender systems and their impact on content diversity|url=https://www.tandfonline.com/doi/full/10.1080/1369118X.2018.1444076|journal=Information, Communication & Society|language=en|volume=21|issue=7|pages=959–977|doi=10.1080/1369118X.2018.1444076|s2cid=149344712|issn=1369-118X|hdl=11245.1/4242e2e0-3beb-40a0-a6cb-d8947a13efb4|hdl-access=free}}</ref>
*'''Trust''' – A recommender system is of little value for a user if the user does not trust the system.<ref name="Montaner2002">{{cite book| last1=Montaner|first1= Miquel|last2= López|first2= Beatriz|last3= de la Rosa|first3= Josep Lluís| chapter=Developing trust in recommender agents| title=Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1| year=2002| pages=304–305|chapter-url=https://www.researchgate.net/publication/221454720}}</ref> Trust can be built by a recommender system by explaining how it generates recommendations, and why it recommends an item.
*'''Labelling''' – User satisfaction with recommendations may be influenced by the labeling of the recommendations.<ref name="Beel2013a">{{cite conference|vauthors=Beel, Joeran, Langer, Stefan, Genzmehr, Marcel| chapter=Sponsored vs. Organic (Research Paper) Recommendations and the Impact of Labeling| title=Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries (TPDL 2013)|date=September 2013| pages=395–399| chapter-url=http://docear.org/papers/sponsored_vs._organic_(research_paper)_recommendations_and_the_impact_of_labeling.pdf|editor1=Trond Aalberg |editor2=Milena Dobreva |editor3=Christos Papatheodorou |editor4=Giannis Tsakonas |editor5=Charles Farrugia | access-date=2 December 2013}}</ref> For instance, in the cited study [[click-through rate]] (CTR) for recommendations labeled as "Sponsored" were lower (CTR=5.93%) than CTR for identical recommendations labeled as "Organic" (CTR=8.86%). Recommendations with no label performed best (CTR=9.87%) in that study.

=== Reproducibility ===
Recommender systems are notoriously difficult to evaluate offline, with some researchers claiming that this has led to a [[reproducibility crisis]] in recommender systems publications. The topic of reproducibility seems to be a recurrent issue in some Machine Learning publication venues, but does not have a considerable effect beyond the world of scientific publication. In the context of recommender systems a 2019 paper surveyed a small number of hand-picked publications applying deep learning or neural methods to the top-k recommendation problem, published in top conferences (SIGIR, KDD, WWW, [[ACM Conference on Recommender Systems|RecSys]], IJCAI), has shown that on average less than 40% of articles could be reproduced by the authors of the survey, with as little as 14% in some conferences. The articles considers a number of potential problems in today's research scholarship and suggests improved scientific practices in that area.<ref>{{cite journal |last1=Ferrari Dacrema |first1=Maurizio |last2=Boglio |first2=Simone |last3=Cremonesi |first3=Paolo |last4=Jannach |first4=Dietmar |title=A Troubling Analysis of Reproducibility and Progress in Recommender Systems Research |journal=ACM Transactions on Information Systems |date=8 January 2021 |volume=39 |issue=2 |pages=1–49 |doi=10.1145/3434185 |url=https://dl.acm.org/doi/10.1145/3434185 |arxiv=1911.07698|hdl=11311/1164333 |s2cid=208138060 }}</ref><ref>{{cite book |last1=Ferrari Dacrema |first1=Maurizio |last2=Cremonesi |first2=Paolo |last3=Jannach |first3=Dietmar |title=Proceedings of the 13th ACM Conference on Recommender Systems |chapter=Are we really making much progress? A worrying analysis of recent neural recommendation approaches |series=RecSys '19 |date=2019 |pages=101–109 |doi=10.1145/3298689.3347058 |hdl=11311/1108996 |chapter-url=https://dl.acm.org/authorize?N684126
 |access-date=16 October 2019 |publisher=ACM|arxiv=1907.06902 |isbn=978-1-4503-6243-6 |s2cid=196831663 }}</ref><ref>{{cite book |last1=Rendle |first1=Steffen |last2=Krichene |first2=Walid |last3=Zhang |first3=Li |last4=Anderson |first4=John |title=Fourteenth ACM Conference on Recommender Systems |chapter=Neural Collaborative Filtering vs. Matrix Factorization Revisited |date=22 September 2020 |pages=240–248 |doi=10.1145/3383313.3412488|arxiv=2005.09683 |isbn=978-1-4503-7583-2 |doi-access=free }}</ref>
More recent work on benchmarking a set of the same methods came to qualitatively very different results<ref>{{cite book|last1=Sun|first1=Zhu|last2=Yu|first2=Di|last3=Fang|first3=Hui|last4=Yang|first4=Jie|last5=Qu|first5=Xinghua|last6=Zhang|first6=Jie|last7=Geng|first7=Cong|title=Fourteenth ACM Conference on Recommender Systems |chapter=Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison |chapter-url=https://dl.acm.org/doi/10.1145/3383313.3412489|year=2020|pages=23–32|publisher=ACM|doi=10.1145/3383313.3412489|isbn=978-1-4503-7583-2|s2cid=221785064}}</ref> whereby neural methods were found to be among the best performing methods. Deep learning and neural methods for recommender systems have been used in the winning solutions in several recent recommender system challenges, WSDM,<ref>{{cite journal|last1=Schifferer|first1=Benedikt|last2=Deotte|first2=Chris|last3=Puget|first3=Jean-François|last4=de Souza Pereira|first4=Gabriel|last5=Titericz|first5=Gilberto|last6=Liu|first6=Jiwei|last7=Ak|first7=Ronay|title=Using Deep Learning to Win the Booking.com WSDM WebTour21 Challenge on Sequential Recommendations|url=https://web.ec.tuwien.ac.at/webtour21/wp-content/uploads/2021/03/shifferer.pdf|journal=WSDM '21: ACM Conference on Web Search and Data Mining|publisher=ACM|access-date=April 3, 2021|archive-date=March 25, 2021|archive-url=https://web.archive.org/web/20210325063047/https://web.ec.tuwien.ac.at/webtour21/wp-content/uploads/2021/03/shifferer.pdf|url-status=dead}}</ref> [[RecSys Challenge]].<ref>{{cite book|last1=Volkovs|first1=Maksims|last2=Rai|first2=Himanshu|last3=Cheng|first3=Zhaoyue|last4=Wu|first4=Ga|last5=Lu|first5=Yichao|last6=Sanner|first6=Scott|title=Proceedings of the ACM Recommender Systems Challenge 2018 |chapter=Two-stage Model for Automatic Playlist Continuation at Scale |chapter-url=https://dl.acm.org/doi/10.1145/3267471.3267480|year=2018|pages=1–6|publisher=ACM|doi=10.1145/3267471.3267480|isbn=978-1-4503-6586-4|s2cid=52942462}}</ref>
Moreover, neural and deep learning methods are widely used in industry where they are extensively tested.<ref name="ntfx">Yves Raimond, Justin Basilico [https://www2.slideshare.net/moustaki/deep-learning-for-recommender-systems-86752234 Deep Learning for Recommender Systems], Deep Learning Re-Work SF Summit 2018</ref><ref name="yt"/><ref name="amzn"/> The topic of reproducibility is not new in recommender systems. By 2011, [[Michael Ekstrand|Ekstrand]], [[Joseph A. Konstan|Konstan]], et al. criticized that "it is currently difficult to reproduce and extend recommender systems research results," and that evaluations are "not handled consistently".<ref>{{Cite book|last1=Ekstrand|first1=Michael D.|last2=Ludwig|first2=Michael|last3=Konstan|first3=Joseph A.|last4=Riedl|first4=John T.|title=Proceedings of the fifth ACM conference on Recommender systems |chapter=Rethinking the recommender research ecosystem |date=2011-01-01|series=RecSys '11|location=New York, NY, USA|publisher=ACM|pages=133–140|doi=10.1145/2043932.2043958|isbn=978-1-4503-0683-6|s2cid=2215419}}</ref> Konstan and Adomavicius conclude that "the Recommender Systems research community is facing a crisis where a significant number of papers present results that contribute little to collective knowledge [...] often because the research lacks the [...] evaluation to be properly judged and, hence, to provide meaningful contributions."<ref>{{Cite book|last1=Konstan|first1=Joseph A.|last2=Adomavicius|first2=Gediminas|title=Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation |chapter=Toward identification and adoption of best practices in algorithmic recommender systems research |date=2013-01-01|series=RepSys '13|location=New York, NY, USA|publisher=ACM|pages=23–28|doi=10.1145/2532508.2532513|isbn=978-1-4503-2465-6|s2cid=333956}}</ref> As a consequence, much research about recommender systems can be considered as not reproducible.<ref name=":2">{{Cite journal|last1=Breitinger|first1=Corinna|last2=Langer|first2=Stefan|last3=Lommatzsch|first3=Andreas|last4=Gipp|first4=Bela|date=2016-03-12|title=Towards reproducibility in recommender-systems research|journal=User Modeling and User-Adapted Interaction|language=en|volume=26|issue=1|pages=69–101|doi=10.1007/s11257-016-9174-x|s2cid=388764|issn=0924-1868|url=http://nbn-resolving.de/urn:nbn:de:bsz:352-0-324818}}</ref> Hence, operators of recommender systems find little guidance in the current research for answering the question, which recommendation approaches to use in a recommender systems. [[Alan Said|Said]] and [[Alejandro Bellogín|Bellogín]] conducted a study of papers published in the field, as well as benchmarked some of the most popular frameworks for recommendation and found large inconsistencies in results, even when the same algorithms and data sets were used.<ref>{{Cite book|last1=Said|first1=Alan|last2=Bellogín|first2=Alejandro|title=Proceedings of the 8th ACM Conference on Recommender systems |chapter=Comparative recommender system evaluation |date=2014-10-01|series=RecSys '14|location=New York, NY, USA|publisher=ACM|pages=129–136|doi= 10.1145/2645710.2645746|isbn=978-1-4503-2668-1|hdl=10486/665450|s2cid=15665277}}</ref> Some researchers demonstrated that minor variations in the recommendation algorithms or scenarios led to strong changes in the effectiveness of a recommender system. They conclude that seven actions are necessary to improve the current situation:<ref name=":2" /> "(1) survey other research fields and learn from them, (2) find a common understanding of reproducibility, (3) identify and understand the determinants that affect reproducibility, (4) conduct more comprehensive experiments (5) modernize publication practices, (6) foster the development and use of recommendation frameworks, and (7) establish best-practice guidelines for recommender-systems research."