1
0
mirror of https://github.com/gnosygnu/xowa.git synced 2026-03-02 03:49:30 +00:00
This commit is contained in:
gnosygnu
2020-10-19 09:44:07 -04:00
parent f78ca4456c
commit e70c900140
108 changed files with 4919 additions and 2611 deletions

View File

@@ -422,7 +422,7 @@
</dl>
<ul>
<li>
<b>Requires separate post-processing generation step</b>: The wikitext dumps were automatically generated by downloading an XML dump. The HTML dumps requires another post-processing step that is not simple to run (See: <a href="/wiki/Dev/Command-line/Dumps" id="xolnki_13" title="Dev/Command-line/Dumps" class="xowa-visited">Dev/Command-line/Dumps</a>)
<b>Requires separate post-processing generation step</b>: The wikitext dumps were automatically generated by downloading an XML dump. The HTML dumps requires another post-processing step that is not simple to run (See: <a href="/wiki/Dev/Command-line/Dumps" id="xolnki_13" title="Dev/Command-line/Dumps">Dev/Command-line/Dumps</a>)
</li>
</ul>
<dl>
@@ -451,7 +451,7 @@
The new XOWA Search Engine uses PageRank to rate pages by importance. Although this works well for Wikipedia, it sometimes overrates pages which exist for encyclopedic book-keeping.
</p>
<p>
For example, a lot of Wikipedia pages will have a small box called "Authority Control" at the bottom of the page. This box will have links to other pages like <a href="https://en.wikipedia.org/wiki/Integrated_Authority_Control" rel="nofollow" class="external free">https://en.wikipedia.org/wiki/Integrated_Authority_Control</a> If a million pages have this Integrated Authority Control link, then PageRank rates this page highly. ("1 million pages link to it!") However, the page itself is fairly short, and is not really one of the most important articles in Wikipedia (it would score higher than India, Insect, Italy, etc).
For example, a lot of Wikipedia pages will have a small box called "Authority Control" at the bottom of the page. This box will have links to other pages like <a href="/site/en.wikipedia.org/wiki/Integrated_Authority_Control">https://en.wikipedia.org/wiki/Integrated_Authority_Control</a> If a million pages have this Integrated Authority Control link, then PageRank rates this page highly. ("1 million pages link to it!") However, the page itself is fairly short, and is not really one of the most important articles in Wikipedia (it would score higher than India, Insect, Italy, etc).
</p>
<p>
v3.6.3 tries to reduce the importance of these pages if these articles are "short". This heuristic was already present in the previous versions of the search engine, but has been further tweaked.