papers-we-love_papers-we-love

mirror of https://github.com/papers-we-love/papers-we-love.git synced 2025-12-08 15:42:10 +00:00

History

Khalian 3b98e3bfd3 Adding the paper which introduced the bm25 similarity measure		2016-02-01 01:48:10 -05:00
..
graph_of_word_and_tw_idf.pdf	added tw-idf paper and pdf	2014-03-20 00:31:48 -04:00
ocapi-trec3.pdf	Adding the paper which introduced the bm25 similarity measure	2016-02-01 01:48:10 -05:00
README.md	Adding the paper which introduced the bm25 similarity measure	2016-02-01 01:48:10 -05:00

README.md

Information Retrieval

Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. (Says Wikipedia).

The included documents are

📜 Graph of Word and TW-IDF - Francois Rousseau & Michalis Vazirgiannis

The traditional IR system stores term-specific statistics (typically a term's frequency in each document - which we call TF) in an index. Such a model ignores dependencies between terms and considers a document's terms to occur independently of each other (and is aptly called the bag-of-words model). In this paper the authors use a statistic that uses a graph representation of a document to encode dependencies between terms and replace the TF statistic with a new TW statistic based on the graph constructed and achieve significantly better results that popular existing models. This paper won a honorable mention at CIKM 2013.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
📜 Okapi System - Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

This paper introduces the now famous Okapi information retrieval framework which introduces the BM25 ranking function for ranked retrieval. It is one of the first implementations of the probabilistic retrieval frameworks in literature. BM25 is a bag of words retrieval function. The IDF(Inverse document frequency) term can be interpreted via information theory. If a query q appears in n(q) docs the probability of picking a doc randomly and it containing that term :p(q) = n(q) / D, where D is the number of documents. The information content based on shannon's noisy channel model is = -log(p(q)) = log (D / n(q)). Smoothing by adding a constant to both numberator and demoninator leads to IDF term used in BM25. BM25 has been shown to be one of the best probabilistic weighting schemes. While the paper was in postscript form, the committer has changed the format to pdf as per guidelines of papers we love via ps2pdf.