slower: small wikis will be subsecond, but en.wikipedia.org searches can take 1+ hour for each search
</td>
<td>
fast: en.wikipedia.org searches can execute in less than a second.
</td>
</tr>
<tr>
<td>
disk space
</td>
<td>
no additional space is needed
</td>
<td>
additional space is needed. en.wikipedia.org will use at least 9 GB
</td>
</tr>
<tr>
<td>
syntax
</td>
<td>
uses same syntax as title search. See <ahref="http://xowa.org/home/wiki/App/Search.html"id="xolnki_2"title="App/Search">App/Search</a>
</td>
<td>
uses Lucene syntax. See <ahref="https://lucene.apache.org/core/2_9_4/queryparsersyntax.html"rel="nofollow"class="external text">the lucene search page</a> as well as below.
Options can be configured at <ahref="http://xowa.org/home/wiki/Special:XowaCfg%3Fgrp%3Dxowa.addon.fulltext_search.html"id="xolnki_3"title="Special:XowaCfg?grp=xowa.addon.fulltext search">Special:XowaCfg?grp=xowa.addon.fulltext search</a>
</p>
<p>
In addition, the Special:XowaSearch page also has a copy of the more-frequently used options.
The best reference for Lucene syntax is probably <ahref="https://lucene.apache.org/core/2_9_4/queryparsersyntax.html"rel="nofollow"class="external text">the lucene search page</a>. The following is an edited version of that page
Body is the HTML of a page without the markup. So <code><span title='some more words'>word</span></code> will only have <code>word</code>, and ignore <code>span</code>, <code>title</code>, <code>some</code>, <code>more</code>, and <code>words</code>.
</p>
<p>
In addition, XOWA uses three other fields: page_id, title, and page_score. These are included for system purposes only.
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search: <code>roam~</code>
</p>
<p>
This search will find terms like <code>foam</code> and <code>roams</code>.
</p>
<p>
An additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example: <code>roam~0.8</code>
</p>
<p>
The default that is used if the parameter is not given is 0.5.
Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search:
Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.
</p>
<p>
Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for
</p>
<p>
<code>jakarta apache</code>
</p>
<p>
and you want the term "jakarta" to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type:
</p>
<p>
<code>jakarta^4 apache</code>
</p>
<p>
This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example:
</p>
<p>
<code>"jakarta apache"^4 "Apache Lucene"</code>
</p>
<p>
By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2)
Boolean operators allow terms to be combined through logic operators. Lucene supports AND, "+", OR, NOT and "-" as Boolean operators(Note: Boolean operators must be ALL CAPS).
The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. The OR operator links two terms and finds a matching document if either of the terms exist in a document. This is equivalent to a union using sets. The symbol || can be used in place of the word OR.
</p>
<p>
To search for documents that contain either <code>jakarta apache</code> or just <code>jakarta</code> use the query:
The AND operator matches documents where both terms exist anywhere in the text of a single document. This is equivalent to an intersection using sets. The symbol && can be used in place of the word AND.
</p>
<p>
To search for documents that contain "jakarta apache" and "Apache Lucene" use the query:
The NOT operator excludes documents that contain the term after NOT. This is equivalent to a difference using sets. The symbol ! can be used in place of the word NOT.
</p>
<p>
To search for documents that contain "jakarta apache" but not "Apache Lucene" use the query:
</p>
<p>
<code>"jakarta apache" NOT "Apache Lucene"</code>
</p>
<p>
Note: The NOT operator cannot be used with just one term. For example, the following search will return no results:
<li><ahref="http://dumps.wikimedia.org/backup-index.html"title="Get wiki datababase dumps directly from Wikimedia">Wikimedia dumps</a></li>
<li><ahref="https://archive.org/search.php?query=xowa"title="Search archive.org for XOWA files">XOWA @ archive.org</a></li>
<li><ahref="http://en.wikipedia.org"title="Visit Wikipedia (and compare to XOWA!)">English Wikipedia</a></li>
</ul>
</div>
</div>
<divclass="portal"id='xowa-portal-donate'>
<h3>Donate</h3>
<divclass="body">
<ul>
<li><ahref="https://archive.org/donate/index.php"title="Support archive.org!">archive.org</a></li><!-- listed first due to recent fire damages: http://blog.archive.org/2013/11/06/scanning-center-fire-please-help-rebuild/ -->