You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
gnosygnu_xowa/home/wiki/Dev/Design/Data_dump_format.html

239 lines
10 KiB

<!DOCTYPE html>
<html dir="ltr">
<head>
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
<title>Dev/Design/Data dump format - XOWA</title>
<link rel="shortcut icon" href="https://gnosygnu.github.io/xowa/xowa_logo.png" />
<link rel="stylesheet" href="https://gnosygnu.github.io/xowa/xowa_common.css" type="text/css">
</head>
<body class="mediawiki ltr sitedir-ltr ns-0 ns-subject skin-vector action-submit vector-animateLayout" spellcheck="false">
<div id="mw-page-base" class="noprint"></div>
<div id="mw-head-base" class="noprint"></div>
<div id="content" class="mw-body">
<h1 id="firstHeading" class="firstHeading"><span>Dev/Design/Data dump format</span></h1>
<div id="bodyContent" class="mw-body-content">
<div id="siteSub">From XOWA: the free, open-source, offline wiki application</div>
<div id="contentSub"></div>
<div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr">
<p>
The Wikimedia data dump files are released in compressed forms: <a href="http://en.wikipedia.org/bzip2" rel="nofollow" class="external text">bzip2</a> or <a href="http://en.wikipedia.org/gzip" rel="nofollow" class="external text">gzip</a>. Prior to v0.5.2, XOWA required that the files be uncompressed in order to read them. v0.5.2 allows the user the option to either read directly from the compressed or uncompressed file.
</p>
<div id="toc" class="toc">
<div id="toctitle" class="toctitle">
<h2>
Contents
</h2>
</div>
<ul>
<li class="toclevel-1 tocsection-1">
<a href="#bzip2:_disk_space_vs_speed"><span class="tocnumber">1</span> <span class="toctext">bzip2: disk space vs speed</span></a>
</li>
<li class="toclevel-1 tocsection-2">
<a href="#bzip2:_Application_install_(GUI)"><span class="tocnumber">2</span> <span class="toctext">bzip2: Application install (GUI)</span></a>
</li>
<li class="toclevel-1 tocsection-3">
<a href="#Command-line_install"><span class="tocnumber">3</span> <span class="toctext">Command-line install</span></a>
</li>
<li class="toclevel-1 tocsection-4">
<a href="#gzip"><span class="tocnumber">4</span> <span class="toctext">gzip</span></a>
</li>
<li class="toclevel-1 tocsection-5">
<a href="#References"><span class="tocnumber">5</span> <span class="toctext">References</span></a>
</li>
</ul>
</div>
<h2>
<span class="mw-headline" id="bzip2:_disk_space_vs_speed">bzip2: disk space vs speed</span>
</h2>
<p>
Currently, reading from a bzip2 file is much slower than unzipping and reading from the xml file.<sup id="cite_ref-0" class="reference"><a href="#cite_note-0">[1]</a></sup>
</p>
<p>
For example, using a 10 GB English Wikipedia dump file:
</p>
<ul>
<li>
<b>unzip</b> takes 120 minutes and +40 GB extra disk space. This process includes unzipping to .xml with 7-zip (40 min: 40 GB) and then importing the wiki (80 min)
</li>
<li>
<b>bzip2</b> takes 330 minutes and + 0 GB extra disk space. This process includes reading directly from the .bz2 file (250 min: 0 GB) and importing the wiki (80 min)
</li>
</ul>
<p>
If you have the extra disk space, you will want to use the <b>unzip</b> route. If you are low on disk space, then you can use the <b>bzip2</b> route instead
</p>
<h2>
<span class="mw-headline" id="bzip2:_Application_install_(GUI)">bzip2: Application install (GUI)</span>
</h2>
<p>
By default, the application install uses the <b>unzip</b> route.
</p>
<p>
To change it to the <b>bzip2</b> route:
</p>
<ul>
<li>
Go to <a href="/wiki/Options/Import" id="xolnki_2" title="Options/Import">Options/Import</a>
</li>
<li>
Change <b>Custom wiki commands</b> to <code>wiki.download,wiki.import</code>
</li>
</ul>
<dl>
<dd>
Note: the key step is to remove <code>wiki.unzip</code> after <code>wiki.download</code>
</dd>
</dl>
<h2>
<span class="mw-headline" id="Command-line_install">Command-line install</span>
</h2>
<p>
The <code>core_init</code> build step now has an extra property: <code>src_bz2_fil_</code>. A sample invocation would be
</p>
<pre>
.add('simple.wikipedia.org', 'core.init').src_bz2_fil_('/home/download/simplewiki-latest-pages-articles.bz2').owner
</pre>
<p>
Note that XOWA can also auto-detect the appropriate file. For example, using a directory of /xowa/wiki/simple.wikipedia.org/
</p>
<ul>
<li>
If a .bz2 file is there, it will use it
</li>
<li>
If a .xml file is there, it will use it
</li>
<li>
If both a .bz2 file and a .xml file are there, it will use the .xml file. (since the .xml will be faster)
</li>
<li>
If neither are there, it will fail
</li>
</ul>
<h2>
<span class="mw-headline" id="gzip">gzip</span>
</h2>
<p>
Currently, gzip is only used for the /category2/ system.
</p>
<ul>
<li>
For application setup, .gz is always used (there is no unzipping)
</li>
<li>
For CLI, either .gz or .sql can be used. Note that usage follows the same rules as described above for .bz2 / .xml.
</li>
</ul>
<h2>
<span class="mw-headline" id="References">References</span>
</h2>
<ol class="references">
<li id="cite_note-0">
<span class="mw-cite-backlink"><a href="#cite_ref-0">^</a></span> <span class="reference-text">This seems to be a result of Java's lack of support for an unsigned byte data-type, as well as other performance gains from a native C++/C application. (7-zip on Windows; bzip2 on Linux)</span>
</li>
</ol>
</div>
</div>
</div>
<div id="mw-head" class="noprint">
<div id="left-navigation">
<div id="p-namespaces" class="vectorTabs">
<h3>Namespaces</h3>
<ul>
<li id="ca-nstab-main" class="selected"><span><a id="ca-nstab-main-href" href="index.html">Page</a></span></li>
</ul>
</div>
</div>
</div>
<div id='mw-panel' class='noprint'>
<div id='p-logo'>
<a style="background-image: url(https://gnosygnu.github.io/xowa/xowa_logo.png);" href="http://xowa.org/" title="Visit the main page"></a>
</div>
<div class="portal" id='xowa-portal-home'>
<h3>XOWA</h3>
<div class="body">
<ul>
<li><a href="http://xowa.org/index.html" title='Visit the main page'>Main page</a></li>
<li><a href="http://xowa.org/screenshots.html" title='See screenshots of XOWA'>Screenshots</a></li>
<li><a href="https://www.youtube.com/watch?v=q0qbXYXEH6M" title="See a video of XOWA Desktop in action">Video</a></li>
<li><a href="http://xowa.org/home/wiki/Help/Download_XOWA.html" title='Download the XOWA application'>Download XOWA</a></li>
<li><a href="http://xowa.org/home/wiki/Dashboard/Image_databases.html" title='Download offline wikis and image databases'>Download wikis</a></li>
</ul>
</div>
</div>
<div class="portal" id='xowa-portal-started'>
<h3>Getting started</h3>
<div class="body">
<ul>
<li><a href="http://xowa.org/home/wiki/App/Setup/System_requirements.html" title='Get XOWA&apos;s system requirements'>Requirements</a></li>
<li><a href="http://xowa.org/home/wiki/App/Setup/Installation.html" title='Get instructions for installing XOWA'>Installation</a></li>
<li><a href="http://xowa.org/home/wiki/App/Import/Simple_Wikipedia.html" title='Learn how to set up Simple Wikipedia'>Simple Wikipedia</a></li>
<li><a href="http://xowa.org/home/wiki/App/Import/English_Wikipedia.html" title='Learn how to set up English Wikipedia'>English Wikipedia</a></li>
<li><a href="http://xowa.org/home/wiki/App/Import/Other_wikis.html" title='Learn how to set up other Wikipedias'>Other Wikipedias</a></li>
</ul>
</div>
</div>
<div class="portal" id='xowa-portal-android'>
<h3>Android</h3>
<div class="body">
<ul>
<li><a href="http://xowa.org/home/wiki/Android/Setup.html" title='Setup XOWA on your Android device'>Setup</a></li>
<li><a href="https://www.youtube.com/watch?v=jsMTBxGweUw" title="See a video of XOWA Android in action">Video</a></li>
</ul>
</div>
</div>
<div class="portal" id='xowa-portal-help'>
<h3>Help</h3>
<div class="body">
<ul>
<li><a href="http://xowa.org/home/wiki/Help/About.html" title='Get more information about XOWA'>About</a></li>
<li><a href="http://xowa.org/home/wiki/Help/Contents.html" title='View a list of help topics'>Contents</a></li>
<li><a href="http://xowa.org/home/wiki/Help/Media.html" title='Read what others have written about XOWA'>Media</a></li>
<li><a href="http://xowa.org/home/wiki/Help/Feedback.html" title='Questions? Comments? Leave feedback for XOWA'>Feedback</a></li>
</ul>
</div>
</div>
<div class="portal" id='xowa-portal-blog'>
<h3>Blog</h3>
<div class="body">
<ul>
<li><a href="http://xowa.org/home/wiki/Blog.html" title='Follow XOWA''s development process'>Current</a></li>
</ul>
</div>
</div>
<div class="portal" id='xowa-portal-links'>
<h3>Links</h3>
<div class="body">
<ul>
<li><a href="http://dumps.wikimedia.org/backup-index.html" title="Get wiki datababase dumps directly from Wikimedia">Wikimedia dumps</a></li>
<li><a href="https://archive.org/search.php?query=xowa" title="Search archive.org for XOWA files">XOWA @ archive.org</a></li>
<li><a href="http://en.wikipedia.org" title="Visit Wikipedia (and compare to XOWA!)">English Wikipedia</a></li>
</ul>
</div>
</div>
<div class="portal" id='xowa-portal-donate'>
<h3>Donate</h3>
<div class="body">
<ul>
<li><a href="https://archive.org/donate/index.php" title="Support archive.org!">archive.org</a></li><!-- listed first due to recent fire damages: http://blog.archive.org/2013/11/06/scanning-center-fire-please-help-rebuild/ -->
<li><a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector" title="Support Wikipedia!">Wikipedia</a></li>
<li><a href="http://xowa.org/home/wiki/Help/Donate.html" title="Support XOWA!">XOWA</a></li>
</ul>
</div>
</div>
</div>
</body>
</html>