<ahref="#Download_the_XOWA_search_databases_from_archive.org"><spanclass="tocnumber">3.1</span><spanclass="toctext">Download the XOWA search databases from archive.org</span></a>
</li>
<liclass="toclevel-2 tocsection-5">
<ahref="#Use_page-length_instead_of_PageRank"><spanclass="tocnumber">3.2</span><spanclass="toctext">Use page-length instead of PageRank</span></a>
</li>
<liclass="toclevel-2 tocsection-6">
<ahref="#Use_PageRank_but_limit_to_1_iteration"><spanclass="tocnumber">3.3</span><spanclass="toctext">Use PageRank but limit to 1 iteration</span></a>
</li>
<liclass="toclevel-2 tocsection-7">
<ahref="#Use_PageRank_but_limit_to_1000_iteration"><spanclass="tocnumber">3.4</span><spanclass="toctext">Use PageRank but limit to 1000 iteration</span></a>
XOWA stores this data in an <ahref="/site/en.wikipedia.org/wiki/Inverted_index">inverted index</a>. From a database standpoint, they are placed in two database tables called search_word and search_link.
XOWA then downloads a list of pagelinks from Wikimedia's dump servers. For example, for 2016-03 English Wikipeda, the link is <ahref="http://dumps.wikimedia.org/enwiki/20160305/enwiki-20160305-pagelinks.sql.gz"rel="nofollow"class="external free">http://dumps.wikimedia.org/enwiki/20160305/enwiki-20160305-pagelinks.sql.gz</a>
XOWA then applies a series of calculations to come up with a page score for each page. For more info, see <ahref="/wiki/Help/Features/Search/Score"id="xolnki_3"title="Help/Features/Search/Score">Help/Features/Search/Score</a>
Due to the nature of the PageRank algorithm, a lot of additional time and disk-space is needed. These requirements are especially dramatic for English Wikipedia:
</p>
<ul>
<li>
<b>125+ GB hard disk space needed</b>: The pagelinks dump is compressed at 4.7 GB (.gz), expands to 40 GB (.sql) and will require a scratch space of 80 GB (.sqlite3).
</li>
<li>
<b>8+ hours of processing time needed</b>: The PageRank algorithm is computationally expensive on three fronts:
<ul>
<li>
English Wikipedia has 16.3 million pages
</li>
<li>
Each page links to each other through over 1 billion links
</li>
<li>
PageRank needs approximately 20 iterations to completely rank all pages.
Note, on a <ahref="/wiki/Help/Admin/Environment/Machine"id="xolnki_4"title="Help/Admin/Environment/Machine">machine with a fast processor and an SSD</a> this process will <b>only</b> take about 2 hours.
Monthly versions of English Wikipedia's search databases will be posted to <ahref="https://archive.org/edit/Xowa_enwiki_latest"rel="nofollow"class="external free">https://archive.org/edit/Xowa_enwiki_latest</a> . You can just download a 2 GB dump of these databases and replace your copies.
XOWA can use page-length and skip the pagelinks download (125+ GB) as well as the PageRank running time (8+ hours). However the generated results will not be as accurate as PageRank. Specifically, long pages like "List of ...." will have a high page score.
This option will still require a lot of disk space, but will limit the running time to a few hours. To use this option, do the same as above, but change "PageRank iteration count" to 1.
This option will create the full version of PageRank search indexes. To use this option, do the same as above, but change "PageRank iteration count" to 1000.
<b>Recreate through <ahref="/wiki/Dashboard/Wiki_maintenance"id="xolnki_6"title="Dashboard/Wiki maintenance">Dashboard/Wiki_maintenance</a></b>: A search index is built when a wiki is first created. You can also recreate it at <ahref="/wiki/Dashboard/Wiki_maintenance"id="xolnki_7"title="Dashboard/Wiki maintenance">Dashboard/Wiki_maintenance</a>
<li><ahref="http://dumps.wikimedia.org/backup-index.html"title="Get wiki datababase dumps directly from Wikimedia">Wikimedia dumps</a></li>
<li><ahref="https://archive.org/search.php?query=xowa"title="Search archive.org for XOWA files">XOWA @ archive.org</a></li>
<li><ahref="http://en.wikipedia.org"title="Visit Wikipedia (and compare to XOWA!)">English Wikipedia</a></li>
</ul>
</div>
</div>
<divclass="portal"id='xowa-portal-donate'>
<h3>Donate</h3>
<divclass="body">
<ul>
<li><ahref="https://archive.org/donate/index.php"title="Support archive.org!">archive.org</a></li><!-- listed first due to recent fire damages: http://blog.archive.org/2013/11/06/scanning-center-fire-please-help-rebuild/ -->