mirror of
https://github.com/gnosygnu/xowa.git
synced 2026-03-02 03:49:30 +00:00
'v3.4.2.6'
This commit is contained in:
538
Dev/Command-line/Dumps.html
Normal file
538
Dev/Command-line/Dumps.html
Normal file
@@ -0,0 +1,538 @@
|
||||
<!DOCTYPE html>
|
||||
<html dir="ltr">
|
||||
<head>
|
||||
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
|
||||
<title>Dev/Command-line/Dumps - XOWA</title>
|
||||
<link rel="shortcut icon" href="http://xowa.org/xowa_logo.png" />
|
||||
<link rel="stylesheet" href="http://xowa.org/xowa_common.css" type="text/css">
|
||||
<style>
|
||||
.console {font-family: monospace; color: #EEEEEE ; background-color: black ; border: medium solid black;}
|
||||
.code
|
||||
,.path
|
||||
,.url {font-family: monospace; color: black ; background-color: #f9f9f9 ; border: medium solid #f9f9f9;}
|
||||
.bold {font-weight: 900;}
|
||||
</style>
|
||||
<style>
|
||||
.console {font-family: monospace; color: #EEEEEE ; background-color: black ; border: medium solid black;}
|
||||
.code
|
||||
,.path
|
||||
,.url {font-family: monospace; color: black ; background-color: #f9f9f9 ; border: medium solid #f9f9f9;}
|
||||
.bold {font-weight: 900;}
|
||||
</style>
|
||||
|
||||
</head>
|
||||
<body class="mediawiki ltr sitedir-ltr ns-0 ns-subject skin-vector action-submit vector-animateLayout" spellcheck="false">
|
||||
<div id="mw-page-base" class="noprint"></div>
|
||||
<div id="mw-head-base" class="noprint"></div>
|
||||
<div id="content" class="mw-body">
|
||||
<h1 id="firstHeading" class="firstHeading"><span>Dev/Command-line/Dumps</span></h1>
|
||||
<div id="bodyContent" class="mw-body-content">
|
||||
<div id="siteSub">From XOWA: the free, open-source, offline wiki application</div>
|
||||
<div id="contentSub"></div>
|
||||
<div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr">
|
||||
|
||||
<p>
|
||||
XOWA can generate two types of dumps: file-dumps and html-dumps
|
||||
</p>
|
||||
<table class="metadata plainlinks ambox ambox-delete" style="">
|
||||
<tr>
|
||||
<td class="mbox-empty-cell">
|
||||
</td>
|
||||
<td class="mbox-text" style="">
|
||||
<p>
|
||||
<span class="mbox-text-span">Please note that this script is for power users. It is not meant for casual users.</span>
|
||||
</p>
|
||||
<p>
|
||||
<span class="mbox-text-span">Please read through these instructions carefully. If you fail to follow these instructions, you may end up downloading millions of images by accident, and have your IP address banned by Wikimedia.</span>
|
||||
</p>
|
||||
<p>
|
||||
<span class="mbox-text-span">Also, the script will change in the future, and without any warning. There is no backward compatibility. Although the XOWA databases have a fixed format, the scripts do not. If you discover that your script breaks, please refer to this page, contact me for assistance, or go through the code.</span>
|
||||
</p>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
<p>
|
||||
<br>
|
||||
</p>
|
||||
<div id="toc" class="toc">
|
||||
<div id="toctitle">
|
||||
<h2>
|
||||
Contents
|
||||
</h2>
|
||||
</div>
|
||||
<ul>
|
||||
<li class="toclevel-1 tocsection-1">
|
||||
<a href="#Overview"><span class="tocnumber">1</span> <span class="toctext">Overview</span></a>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-2">
|
||||
<a href="#Requirements"><span class="tocnumber">2</span> <span class="toctext">Requirements</span></a>
|
||||
<ul>
|
||||
<li class="toclevel-2 tocsection-3">
|
||||
<a href="#commons.wikimedia.org_.28thum"><span class="tocnumber">2.1</span> <span class="toctext">commons.wikimedia.org (thum</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-4">
|
||||
<a href="#www.wikidata.org"><span class="tocnumber">2.2</span> <span class="toctext">www.wikidata.org</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-5">
|
||||
<a href="#Hardware"><span class="tocnumber">2.3</span> <span class="toctext">Hardware</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-6">
|
||||
<a href="#Internet-connectivity_.28optional.29"><span class="tocnumber">2.4</span> <span class="toctext">Internet-connectivity (optional)</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-7">
|
||||
<a href="#Pre-existing_image_databases_for_your_wiki_.28optional.29"><span class="tocnumber">2.5</span> <span class="toctext">Pre-existing image databases for your wiki (optional)</span></a>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-8">
|
||||
<a href="#gfs"><span class="tocnumber">3</span> <span class="toctext">gfs</span></a>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-9">
|
||||
<a href="#Terms"><span class="tocnumber">4</span> <span class="toctext">Terms</span></a>
|
||||
<ul>
|
||||
<li class="toclevel-2 tocsection-10">
|
||||
<a href="#lnki"><span class="tocnumber">4.1</span> <span class="toctext">lnki</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-11">
|
||||
<a href="#orig"><span class="tocnumber">4.2</span> <span class="toctext">orig</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-12">
|
||||
<a href="#xfer"><span class="tocnumber">4.3</span> <span class="toctext">xfer</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-13">
|
||||
<a href="#fsdb"><span class="tocnumber">4.4</span> <span class="toctext">fsdb</span></a>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-14">
|
||||
<a href="#Script"><span class="tocnumber">5</span> <span class="toctext">Script</span></a>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Overview">Overview</span>
|
||||
</h2>
|
||||
<p>
|
||||
The download-thumbs script downloads all thumbs for pages in a specific wiki. It works in the following way:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
It loads a page.
|
||||
</li>
|
||||
<li>
|
||||
It converts the wikitext to HTML
|
||||
<ul>
|
||||
<li>
|
||||
If thumb mode is enabled, it compiles a list of [[File]] links.
|
||||
</li>
|
||||
<li>
|
||||
If HTML-dump mode is enabled, it saves the HTML into XOWA html databases
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>
|
||||
It repeats until there are no more pages
|
||||
</li>
|
||||
<li>
|
||||
If thumb mode, it does the following additional steps
|
||||
<ul>
|
||||
<li>
|
||||
It analyzes the list of [[File]] links to come up with a unique list of thumbs.
|
||||
</li>
|
||||
<li>
|
||||
It downloads the thumbs and creates the XOWA file databases.
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<p>
|
||||
The script for simple wikipedia is listed below.
|
||||
</p>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Requirements">Requirements</span>
|
||||
</h2>
|
||||
<h3>
|
||||
<span class="mw-headline" id="commons.wikimedia.org_.28thum">commons.wikimedia.org (thum</span>
|
||||
</h3>
|
||||
<p>
|
||||
You will need the latest version of commons.wikimedia.org. Note that if you have an older version, you will have missing images or wrong size information.
|
||||
</p>
|
||||
<p>
|
||||
For example, if you have a commons.wikimedia.org from 2015-04-22 and are trying to import a 2015-05-17 English Wikipedia, then any new images added after 2015-04-22 will not be picked up.
|
||||
</p>
|
||||
<h3>
|
||||
<span class="mw-headline" id="www.wikidata.org">www.wikidata.org</span>
|
||||
</h3>
|
||||
<p>
|
||||
You also need to have the latest version of www.wikidata.org. Note that English Wikipedia and other wikis uses Wikidata through the {{#property}} call or Module code. If you have an earlier version, then data will be missing or out of date.
|
||||
</p>
|
||||
<h3>
|
||||
<span class="mw-headline" id="Hardware">Hardware</span>
|
||||
</h3>
|
||||
<p>
|
||||
You should have a recent-generation machine with relatively high-performance hardware, especially if you're planning to generate images for English Wikipedia.
|
||||
</p>
|
||||
<p>
|
||||
For context, here is my current machine setup for generating the image dumps:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
Processor: Intel Core i7-4770K; 3.5 GHz with 8 MB L3 cache
|
||||
</li>
|
||||
<li>
|
||||
Memory: 16 GB DDR3 SDRAM DDR3 1600 (PC3 12800)
|
||||
</li>
|
||||
<li>
|
||||
Hard Drive: 1TB 10,000 RPM 64MB Cache SATA 6.0Gb/s
|
||||
</li>
|
||||
<li>
|
||||
Operating System: openSUSE 13.2
|
||||
</li>
|
||||
</ul>
|
||||
<p>
|
||||
(Note: The hardware was assembled in late 2013 for about $1,600 US dollars.)
|
||||
</p>
|
||||
<p>
|
||||
For English Wikipedia, it still takes about 50 hours for the entire process.
|
||||
</p>
|
||||
<h3>
|
||||
<span class="mw-headline" id="Internet-connectivity_.28optional.29">Internet-connectivity (optional)</span>
|
||||
</h3>
|
||||
<p>
|
||||
You should have a broadband connection to the internet. The script will need to download dump files from Wikimedia and some dump files (like English Wikipedia) will be in the 10s of GB.
|
||||
</p>
|
||||
<p>
|
||||
You can opt to download these files separately and place them in the appropriate location beforehand. However, the script below assumes that the machine is always online. If you are offline, you will need to comment the "util.download" lines yourself.
|
||||
</p>
|
||||
<h3>
|
||||
<span class="mw-headline" id="Pre-existing_image_databases_for_your_wiki_.28optional.29">Pre-existing image databases for your wiki (optional)</span>
|
||||
</h3>
|
||||
<p>
|
||||
XOWA will automatically re-use the images from existing image databases so that you do not have to redownload them. This is particularly useful for large wikis where redownloading millions of images would be unwanted.
|
||||
</p>
|
||||
<p>
|
||||
It is strongly advised that you download the image database for your wiki. You can find a full list here: <a href="http://xowa.sourceforge.net/image_dbs.html" rel="nofollow" class="external free">http://xowa.sourceforge.net/image_dbs.html</a> Note that if an image database does not exist for your wiki, you can still proceed to use the script
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
If you have v1 image databases, they should be placed in <code>/xowa/file/wiki_domain-prv</code>. For example, English Wikipedia should have <code>/xowa/file/en.wikipedia.org-prv/fsdb.main/fsdb.bin.0000.sqlite3</code>
|
||||
</li>
|
||||
<li>
|
||||
If you have v2 image databases, they should be placed in <code>/xowa/wiki/wiki_domain/prv</code>. For example, English Wikipedia should have <code>/xowa/wiki/en.wikipedia.org/prv/en.wikipedia.org-file-ns.000-db.001.xowa</code>
|
||||
</li>
|
||||
</ul>
|
||||
<h2>
|
||||
<span class="mw-headline" id="gfs">gfs</span>
|
||||
</h2>
|
||||
<p>
|
||||
The script is written in the <code>gfs</code> format. This is a custom scripting format specific to XOWA. It is similar to JSON, but also supports commenting.
|
||||
</p>
|
||||
<p>
|
||||
Unfortunately the error-handling for gfs is quite minimal. When making changes, please do them in small steps and be prepared to go to backups.
|
||||
</p>
|
||||
<p>
|
||||
The following is a brief list of rules:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
Comments are made with either "//","\n" or "/*","*/". For example: <code>// single-line comment</code> or <code>/* multi-line comment*/</code>
|
||||
</li>
|
||||
<li>
|
||||
Booleans are "y" and "n" (yes / no or true / false). For example: <code>enabled = 'y';</code>
|
||||
</li>
|
||||
<li>
|
||||
Numbers are 32-bit integers and are not enclosed in quotes. For example, <code>count = 10000;</code>
|
||||
</li>
|
||||
<li>
|
||||
Strings are surrounded by apostrophes (') or quotes ("). For example: <code>key = 'val';</code>
|
||||
</li>
|
||||
<li>
|
||||
Statements are terminated by a semi-colon (;). For example: <code>procedure1;</code>
|
||||
</li>
|
||||
<li>
|
||||
Statements can take arguments in parentheses. For example: <code>procedure1('argument1', 'argument2', 'argument3');</code>
|
||||
</li>
|
||||
<li>
|
||||
Statements are grouped with curly braces. ({}). For example: <code>group {procedure1; procedure2; procedure3;}</code>
|
||||
</li>
|
||||
</ul>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Terms">Terms</span>
|
||||
</h2>
|
||||
<h3>
|
||||
<span class="mw-headline" id="lnki">lnki</span>
|
||||
</h3>
|
||||
<p>
|
||||
A <code>lnki</code> is short for "<b>l</b>i<b>nk</b> <b>i</b>nternal". It refers to all wikitext with the double bracket syntax: [[A]]. A more elaborate example for files would be [[File:A.png|thumb|200x300px|upright=.80]]. Note that the abbreviation was chosen to differentiate it from <code>lnke</code> which is short for "<b>l</b>i<b>nk</b> <b>e</b>nternal". For the purposes of the script, all lnki data comes from the current wiki's data dump
|
||||
</p>
|
||||
<h3>
|
||||
<span class="mw-headline" id="orig">orig</span>
|
||||
</h3>
|
||||
<ul>
|
||||
<li>
|
||||
An <code>orig</code> is short for "<b>orig</b>inal file". It refers to the original file metadata. For the purposes of this script, all orig data comes from commons.wikimedia.org
|
||||
</li>
|
||||
</ul>
|
||||
<h3>
|
||||
<span class="mw-headline" id="xfer">xfer</span>
|
||||
</h3>
|
||||
<ul>
|
||||
<li>
|
||||
An <code>xfer</code> is short for "transfer file". It refers to the actual file to be downloaded.
|
||||
</li>
|
||||
</ul>
|
||||
<h3>
|
||||
<span class="mw-headline" id="fsdb">fsdb</span>
|
||||
</h3>
|
||||
<ul>
|
||||
<li>
|
||||
The <code>fsdb</code> is short for "<b>f</b>ile <b>s</b>ystem <b>d</b>ata<b>b</b>ase". It refers to the internal table format of the XOWA image databases.
|
||||
</li>
|
||||
</ul>
|
||||
<p>
|
||||
<br>
|
||||
</p>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Script">Script</span>
|
||||
</h2>
|
||||
<pre class='code'>
|
||||
app.bldr.pause_at_end_('n');
|
||||
app.scripts.run_file_by_type('xowa_cfg_app');
|
||||
app.bldr.cmds {
|
||||
// build commons database; this only needs to be done once, whenever commons is updated
|
||||
add ('commons.wikimedia.org' , 'util.cleanup') {delete_all = 'y';}
|
||||
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'pages-articles';}
|
||||
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'categorylinks';}
|
||||
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'page_props';}
|
||||
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'image';}
|
||||
add ('commons.wikimedia.org' , 'text.init');
|
||||
add ('commons.wikimedia.org' , 'text.page');
|
||||
add ('commons.wikimedia.org' , 'text.cat.core');
|
||||
add ('commons.wikimedia.org' , 'text.cat.link');
|
||||
add ('commons.wikimedia.org' , 'text.cat.hidden');
|
||||
add ('commons.wikimedia.org' , 'text.term');
|
||||
add ('commons.wikimedia.org' , 'text.css');
|
||||
add ('commons.wikimedia.org' , 'wiki.image');
|
||||
add ('commons.wikimedia.org' , 'file.page_regy') {build_commons = 'y'}
|
||||
add ('commons.wikimedia.org' , 'wiki.page_dump.make');
|
||||
add ('commons.wikimedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
|
||||
add ('commons.wikimedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
|
||||
|
||||
// build wikidata database; this only needs to be done once, whenever wikidata is updated
|
||||
add ('www.wikidata.org' , 'util.cleanup') {delete_all = 'y';}
|
||||
add ('www.wikidata.org' , 'util.download') {dump_type = 'pages-articles';}
|
||||
add ('www.wikidata.org' , 'util.download') {dump_type = 'categorylinks';}
|
||||
add ('www.wikidata.org' , 'util.download') {dump_type = 'page_props';}
|
||||
add ('www.wikidata.org' , 'util.download') {dump_type = 'image';}
|
||||
add ('www.wikidata.org' , 'text.init');
|
||||
add ('www.wikidata.org' , 'text.page');
|
||||
add ('www.wikidata.org' , 'text.cat.core');
|
||||
add ('www.wikidata.org' , 'text.cat.link');
|
||||
add ('www.wikidata.org' , 'text.cat.hidden');
|
||||
add ('www.wikidata.org' , 'text.term');
|
||||
add ('www.wikidata.org' , 'text.css');
|
||||
add ('www.wikidata.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
|
||||
|
||||
// build simple.wikipedia.org
|
||||
add ('simple.wikipedia.org' , 'util.cleanup') {delete_all = 'y';}
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pages-articles';}
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'categorylinks';}
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'page_props';}
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'image';}
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pagelinks';}
|
||||
add ('simple.wikipedia.org' , 'text.init');
|
||||
add ('simple.wikipedia.org' , 'text.page') {
|
||||
// calculate redirect_id for #REDIRECT pages. needed for html databases
|
||||
redirect_id_enabled = 'y';
|
||||
}
|
||||
add ('simple.wikipedia.org' , 'text.search');
|
||||
|
||||
// upload desktop css
|
||||
add ('simple.wikipedia.org' , 'text.css');
|
||||
|
||||
// upload mobile css
|
||||
add ('simple.wikipedia.org' , 'text.css') {css_key = 'xowa.mobile'; /* css_dir = 'C:\xowa\user\anonymous\wiki\simple.wikipedia.org-mobile\html\'; */}
|
||||
|
||||
add ('simple.wikipedia.org' , 'text.cat.core');
|
||||
add ('simple.wikipedia.org' , 'text.cat.link');
|
||||
add ('simple.wikipedia.org' , 'text.cat.hidden');
|
||||
add ('simple.wikipedia.org' , 'text.term');
|
||||
|
||||
// create local "page" tables in each "text" database for "lnki_temp"
|
||||
add ('simple.wikipedia.org' , 'wiki.page_dump.make');
|
||||
|
||||
// create a redirect table for pages in the File namespace
|
||||
add ('simple.wikipedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
|
||||
|
||||
// create an "image" table to get the metadata for all files in the current wiki
|
||||
add ('simple.wikipedia.org' , 'wiki.image');
|
||||
|
||||
// parse all page-to-page links
|
||||
add ('simple.wikipedia.org' , 'wiki.page_link');
|
||||
|
||||
// calculate a score for each page using the page-to-page links
|
||||
add ('simple.wikipedia.org' , 'search.page__page_score') {iteration_max = 100;}
|
||||
|
||||
// update link score statistics for the search tables
|
||||
add ('simple.wikipedia.org' , 'search.link__link_score') {page_rank_enabled = 'y';}
|
||||
|
||||
// update word count statistics for the search_word table
|
||||
add ('simple.wikipedia.org' , 'search.word__link_count')
|
||||
|
||||
// cleanup all downloaded files as well as temporary files
|
||||
add ('simple.wikipedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
|
||||
|
||||
// parse every page in the listed namespace and gather data on their lnkis.
|
||||
// this step will take the longest amount of time.
|
||||
add ('simple.wikipedia.org' , 'file.lnki_temp') {
|
||||
// save data every # of pages
|
||||
commit_interval = 10000;
|
||||
|
||||
// update progress every # of pages
|
||||
progress_interval = 50;
|
||||
|
||||
// free memory by flushing internal caches every # of pages
|
||||
cleanup_interval = 50;
|
||||
|
||||
// specify # of pages to read into memory at a time, where # is in MB. For example, 25 means read approximately 25 MB of page text into memory
|
||||
select_size = 25;
|
||||
|
||||
// namespaces to parse. See en.wikipedia.org/wiki/Help:Namespaces
|
||||
ns_ids = '0|4|14';
|
||||
|
||||
// enable generation of ".html" databases. This is experimental and will increase processing time by 20% - 25%
|
||||
// gen_hdump = 'y';
|
||||
}
|
||||
|
||||
// aggregate the lnkis
|
||||
add ('simple.wikipedia.org' , 'file.lnki_regy');
|
||||
|
||||
// generate orig metadata for files in the current wiki (for example, for pages in en.wikipedia.org/wiki/File:*)
|
||||
add ('simple.wikipedia.org' , 'file.page_regy') {build_commons = 'n';}
|
||||
|
||||
// generate all orig metadata for all lnkis
|
||||
add ('simple.wikipedia.org' , 'file.orig_regy');
|
||||
|
||||
// generate list of files to download based on "orig_regy" and XOWA image code
|
||||
add ('simple.wikipedia.org' , 'file.xfer_temp.thumb');
|
||||
|
||||
// aggregate list one more time
|
||||
add ('simple.wikipedia.org' , 'file.xfer_regy');
|
||||
|
||||
// identify images that have already been downloaded
|
||||
add ('simple.wikipedia.org' , 'file.xfer_regy_update');
|
||||
|
||||
// download images. This step may also take a long time, depending on how many images are needed
|
||||
add ('simple.wikipedia.org' , 'file.fsdb_make') {
|
||||
commit_interval = 1000; progress_interval = 200; select_interval = 10000;
|
||||
ns_ids = '0|4|14';
|
||||
|
||||
// specify whether original wiki databases are v1 (.sqlite3) or v2 (.xowa)
|
||||
src_bin_mgr__fsdb_version = 'v1';
|
||||
|
||||
// always redownload certain files
|
||||
src_bin_mgr__fsdb_skip_wkrs = 'page_gt_1|small_size';
|
||||
|
||||
// allow downloads from wikimedia
|
||||
src_bin_mgr__wmf_enabled = 'y';
|
||||
}
|
||||
|
||||
// generate registry of original metadata by file title
|
||||
add ('simple.wikipedia.org' , 'file.orig_reg');
|
||||
|
||||
// drop page_dump tables
|
||||
add ('simple.wikipedia.org' , 'wiki.page_dump.drop');
|
||||
}
|
||||
app.bldr.run;
|
||||
</pre>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
<div id="mw-head" class="noprint">
|
||||
<div id="left-navigation">
|
||||
<div id="p-namespaces" class="vectorTabs">
|
||||
<h3>Namespaces</h3>
|
||||
<ul>
|
||||
<li id="ca-nstab-main" class="selected"><span><a id="ca-nstab-main-href" href="index.html">Page</a></span></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id='mw-panel' class='noprint'>
|
||||
<div id='p-logo'>
|
||||
<a style="background-image: url(http://xowa.org/xowa_logo.png);" href="index.html" title="Visit the main page"></a>
|
||||
</div>
|
||||
<div class="portal" id='xowa-portal-home'>
|
||||
<h3>XOWA</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/index.html" title='Visit the main page'>Main page</a></li>
|
||||
<li><a href="http://xowa.org/screenshots.html" title='See screenshots of XOWA'>Screenshots</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Download_XOWA.html" title='Download the XOWA application'>Download XOWA</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Dashboard/Image_databases.html" title='Download offline wikis and image databases'>Download wikis</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-stargin'>
|
||||
<h3>Getting started</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Setup/System_requirements.html" title='Get XOWA's system requirements'>Requirements</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Setup/Installation.html" title='Get instructions for installing XOWA'>Installation</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/Simple_Wikipedia.html" title='Learn how to set up Simple Wikipedia'>Simple Wikipedia</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/English_Wikipedia.html" title='Learn how to set up English Wikipedia'>English Wikipedia</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/Other_wikis.html" title='Learn how to set up other Wikipedias'>Other Wikipedias</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-help'>
|
||||
<h3>Help</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/About.html" title='Get more information about XOWA'>About</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Contents.html" title='View a list of help topics'>Contents</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Media.html" title='Read what others have written about XOWA'>Media</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Feedback.html" title='Questions? Comments? Leave feedback for XOWA'>Feedback</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-blog'>
|
||||
<h3>Blog</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Blog.html" title='Follow XOWA''s development process'>Current</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-links'>
|
||||
<h3>Links</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://dumps.wikimedia.org/backup-index.html" title="Get wiki datababase dumps directly from Wikimedia">Wikimedia dumps</a></li>
|
||||
<li><a href="https://archive.org/search.php?query=xowa" title="Search archive.org for XOWA files">XOWA @ archive.org</a></li>
|
||||
<li><a href="http://en.wikipedia.org" title="Visit Wikipedia (and compare to XOWA!)">English Wikipedia</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-donate'>
|
||||
<h3>Donate</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="https://archive.org/donate/index.php" title="Support archive.org!">archive.org</a></li><!-- listed first due to recent fire damages: http://blog.archive.org/2013/11/06/scanning-center-fire-please-help-rebuild/ -->
|
||||
<li><a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector" title="Support Wikipedia!">Wikipedia</a></li>
|
||||
<!-- <li><a href="" title="Support XOWA! (but only after you've supported archive.org and Wikipedia)">XOWA</a></li> -->
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
283
Dev/Command-line/Overview.html
Normal file
283
Dev/Command-line/Overview.html
Normal file
@@ -0,0 +1,283 @@
|
||||
<!DOCTYPE html>
|
||||
<html dir="ltr">
|
||||
<head>
|
||||
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
|
||||
<title>Dev/Command-line/Overview - XOWA</title>
|
||||
<link rel="shortcut icon" href="http://xowa.org/xowa_logo.png" />
|
||||
<link rel="stylesheet" href="http://xowa.org/xowa_common.css" type="text/css">
|
||||
<style>
|
||||
.console {font-family: monospace; color: #EEEEEE ; background-color: black ; border: medium solid black;}
|
||||
.code
|
||||
,.path
|
||||
,.url {font-family: monospace; color: black ; background-color: #f9f9f9 ; border: medium solid #f9f9f9;}
|
||||
.bold {font-weight: 900;}
|
||||
</style>
|
||||
<style>
|
||||
.console {font-family: monospace; color: #EEEEEE ; background-color: black ; border: medium solid black;}
|
||||
.code
|
||||
,.path
|
||||
,.url {font-family: monospace; color: black ; background-color: #f9f9f9 ; border: medium solid #f9f9f9;}
|
||||
.bold {font-weight: 900;}
|
||||
</style>
|
||||
|
||||
</head>
|
||||
<body class="mediawiki ltr sitedir-ltr ns-0 ns-subject skin-vector action-submit vector-animateLayout" spellcheck="false">
|
||||
<div id="mw-page-base" class="noprint"></div>
|
||||
<div id="mw-head-base" class="noprint"></div>
|
||||
<div id="content" class="mw-body">
|
||||
<h1 id="firstHeading" class="firstHeading"><span>Dev/Command-line/Overview</span></h1>
|
||||
<div id="bodyContent" class="mw-body-content">
|
||||
<div id="siteSub">From XOWA: the free, open-source, offline wiki application</div>
|
||||
<div id="contentSub"></div>
|
||||
<div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr">
|
||||
|
||||
<p>
|
||||
XOWA can import a wiki using a plain text file and a command-line.
|
||||
</p>
|
||||
<div id="toc" class="toc">
|
||||
<div id="toctitle">
|
||||
<h2>
|
||||
Contents
|
||||
</h2>
|
||||
</div>
|
||||
<ul>
|
||||
<li class="toclevel-1 tocsection-1">
|
||||
<a href="#Import_simple.wikipedia.org_through_the_command-line"><span class="tocnumber">1</span> <span class="toctext">Import simple.wikipedia.org through the command-line</span></a>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-2">
|
||||
<a href="#Import_a_different_wiki_by_editing_the_build_script"><span class="tocnumber">2</span> <span class="toctext">Import a different wiki by editing the build script</span></a>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-3">
|
||||
<a href="#Import_a_wiki_with_a_manual_download"><span class="tocnumber">3</span> <span class="toctext">Import a wiki with a manual download</span></a>
|
||||
<ul>
|
||||
<li class="toclevel-2 tocsection-4">
|
||||
<a href="#Download_the_wiki_dump"><span class="tocnumber">3.1</span> <span class="toctext">Download the wiki dump</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-5">
|
||||
<a href="#Specify_location_of_the_wiki_dump"><span class="tocnumber">3.2</span> <span class="toctext">Specify location of the wiki dump</span></a>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-6">
|
||||
<a href="#Script"><span class="tocnumber">4</span> <span class="toctext">Script</span></a>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Import_simple.wikipedia.org_through_the_command-line">Import simple.wikipedia.org through the command-line</span>
|
||||
</h2>
|
||||
<ul>
|
||||
<li>
|
||||
Open up a command-line. For example, on Windows, run <span class='bold'>cmd</span>
|
||||
</li>
|
||||
<li>
|
||||
Run the following: <span class='console'>java -jar C:\000\200_dev\110_java\400_xowa\bin\ --cmd_file C:\xowa\xowa_build.gfs --app_mode cmd</span>
|
||||
</li>
|
||||
<li>
|
||||
Wait about 10 minutes for the script to complete
|
||||
</li>
|
||||
<li>
|
||||
Launch XOWA and enter <span class='url'>simple.wikipedia.org</span> in the URL bar
|
||||
</li>
|
||||
</ul>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Import_a_different_wiki_by_editing_the_build_script">Import a different wiki by editing the build script</span>
|
||||
</h2>
|
||||
<ul>
|
||||
<li>
|
||||
Open the following file in a <a href="http://xowa.org/wiki/home/page/Dev/Environment/Text_editor.html" id="xolnki_2" title="Dev/Environment/Text editor">text editor</a>: <span class='path'>C:\xowa\xowa_build.gfs</span>. See Script below for the full text.
|
||||
</li>
|
||||
<li>
|
||||
Replace all instances of <span class='bold'>simple.wikipedia.org</span> with the domain name. For example, for English Wikipedia, use <span class='bold'>en.wikipedia.org</span>
|
||||
</li>
|
||||
<li>
|
||||
Run the command-line import again.
|
||||
</li>
|
||||
<li>
|
||||
Launch XOWA and enter in the domain name in the the URL bar.
|
||||
</li>
|
||||
</ul>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Import_a_wiki_with_a_manual_download">Import a wiki with a manual download</span>
|
||||
</h2>
|
||||
<h3>
|
||||
<span class="mw-headline" id="Download_the_wiki_dump">Download the wiki dump</span>
|
||||
</h3>
|
||||
<ul>
|
||||
<li>
|
||||
Navigate to <a href="https://dumps.wikimedia.org/enwiki" rel="nofollow" class="external free">https://dumps.wikimedia.org/enwiki</a>
|
||||
</li>
|
||||
<li>
|
||||
Click on the <b>latest</b> directory
|
||||
</li>
|
||||
<li>
|
||||
Download the file just under "<b>Articles, templates, media/file descriptions, and primary meta-pages.</b>". It should read <b>enwiki-latest-pages-articles.xml.bz2</b>
|
||||
</li>
|
||||
</ul>
|
||||
<dl>
|
||||
<dd>
|
||||
The download is 11+ GB and may take anywhere between 2 and 5 hours to complete.
|
||||
</dd>
|
||||
<dd>
|
||||
If you also want talk pages, you should download the "<b>Recombine all pages, current versions only.</b>" version. It should read <b>enwiki-latest-pages-meta-current.xml.bz2</b>. Note that this dump is twice the size of the regular dump.
|
||||
</dd>
|
||||
</dl>
|
||||
<h3>
|
||||
<span class="mw-headline" id="Specify_location_of_the_wiki_dump">Specify location of the wiki dump</span>
|
||||
</h3>
|
||||
<ul>
|
||||
<li>
|
||||
In the build script, replace the following line:
|
||||
</li>
|
||||
</ul>
|
||||
<dl>
|
||||
<dd>
|
||||
<span class='code'>add ('simple.wikipedia.org', 'text.init') {src_bz2_fil = '/your_directory/simplewiki-20130103-pages-articles.xml.bz2';}</span>
|
||||
</dd>
|
||||
</dl>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Script">Script</span>
|
||||
</h2>
|
||||
<pre class='code'>
|
||||
// do not show a "Press enter to continue" at the end of the script
|
||||
app.bldr.pause_at_end = 'n';
|
||||
|
||||
// run xowa.gfs
|
||||
app.scripts.run_file_by_type('xowa_cfg_app');
|
||||
|
||||
// import wiki; for more info see [[Dev/Command-line]]
|
||||
app.bldr.cmds {
|
||||
// delete all files in directory; note that subdirectories and file databases ("-file.xowa") will not be deleted
|
||||
add ('simple.wikipedia.org' , 'util.cleanup') {delete_all = 'y';}
|
||||
|
||||
// download main dump file; contains all articles
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pages-articles';}
|
||||
|
||||
// download categorylinks file; contains links from category to pages
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'categorylinks';}
|
||||
|
||||
// download page_props file; contains information on hidden categories
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'page_props';}
|
||||
|
||||
// start wiki import
|
||||
add ('simple.wikipedia.org' , 'text.init');
|
||||
|
||||
// import articles
|
||||
add ('simple.wikipedia.org' , 'text.page');
|
||||
|
||||
// generate search data
|
||||
add ('simple.wikipedia.org' , 'text.search');
|
||||
|
||||
// generate main category data
|
||||
add ('simple.wikipedia.org' , 'text.cat.core');
|
||||
|
||||
// import category links
|
||||
add ('simple.wikipedia.org' , 'text.cat.link');
|
||||
|
||||
// apply hidden categories
|
||||
add ('simple.wikipedia.org' , 'text.cat.hidden');
|
||||
|
||||
// end import
|
||||
add ('simple.wikipedia.org' , 'text.term');
|
||||
|
||||
// import css into wiki
|
||||
add ('simple.wikipedia.org' , 'text.css');
|
||||
|
||||
// cleanup temp files; delete xml and bz2
|
||||
add ('simple.wikipedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
|
||||
}
|
||||
|
||||
// run cmds
|
||||
app.bldr.run;
|
||||
</pre>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
<div id="mw-head" class="noprint">
|
||||
<div id="left-navigation">
|
||||
<div id="p-namespaces" class="vectorTabs">
|
||||
<h3>Namespaces</h3>
|
||||
<ul>
|
||||
<li id="ca-nstab-main" class="selected"><span><a id="ca-nstab-main-href" href="index.html">Page</a></span></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id='mw-panel' class='noprint'>
|
||||
<div id='p-logo'>
|
||||
<a style="background-image: url(http://xowa.org/xowa_logo.png);" href="index.html" title="Visit the main page"></a>
|
||||
</div>
|
||||
<div class="portal" id='xowa-portal-home'>
|
||||
<h3>XOWA</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/index.html" title='Visit the main page'>Main page</a></li>
|
||||
<li><a href="http://xowa.org/screenshots.html" title='See screenshots of XOWA'>Screenshots</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Download_XOWA.html" title='Download the XOWA application'>Download XOWA</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Dashboard/Image_databases.html" title='Download offline wikis and image databases'>Download wikis</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-stargin'>
|
||||
<h3>Getting started</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Setup/System_requirements.html" title='Get XOWA's system requirements'>Requirements</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Setup/Installation.html" title='Get instructions for installing XOWA'>Installation</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/Simple_Wikipedia.html" title='Learn how to set up Simple Wikipedia'>Simple Wikipedia</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/English_Wikipedia.html" title='Learn how to set up English Wikipedia'>English Wikipedia</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/Other_wikis.html" title='Learn how to set up other Wikipedias'>Other Wikipedias</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-help'>
|
||||
<h3>Help</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/About.html" title='Get more information about XOWA'>About</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Contents.html" title='View a list of help topics'>Contents</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Media.html" title='Read what others have written about XOWA'>Media</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Feedback.html" title='Questions? Comments? Leave feedback for XOWA'>Feedback</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-blog'>
|
||||
<h3>Blog</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Blog.html" title='Follow XOWA''s development process'>Current</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-links'>
|
||||
<h3>Links</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://dumps.wikimedia.org/backup-index.html" title="Get wiki datababase dumps directly from Wikimedia">Wikimedia dumps</a></li>
|
||||
<li><a href="https://archive.org/search.php?query=xowa" title="Search archive.org for XOWA files">XOWA @ archive.org</a></li>
|
||||
<li><a href="http://en.wikipedia.org" title="Visit Wikipedia (and compare to XOWA!)">English Wikipedia</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-donate'>
|
||||
<h3>Donate</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="https://archive.org/donate/index.php" title="Support archive.org!">archive.org</a></li><!-- listed first due to recent fire damages: http://blog.archive.org/2013/11/06/scanning-center-fire-please-help-rebuild/ -->
|
||||
<li><a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector" title="Support Wikipedia!">Wikipedia</a></li>
|
||||
<!-- <li><a href="" title="Support XOWA! (but only after you've supported archive.org and Wikipedia)">XOWA</a></li> -->
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
121
Dev/Command-line/Script.html
Normal file
121
Dev/Command-line/Script.html
Normal file
@@ -0,0 +1,121 @@
|
||||
<!DOCTYPE html>
|
||||
<html dir="ltr">
|
||||
<head>
|
||||
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
|
||||
<title>Dev/Command-line/Script - XOWA</title>
|
||||
<link rel="shortcut icon" href="http://xowa.org/xowa_logo.png" />
|
||||
<link rel="stylesheet" href="http://xowa.org/xowa_common.css" type="text/css">
|
||||
|
||||
</head>
|
||||
<body class="mediawiki ltr sitedir-ltr ns-0 ns-subject skin-vector action-submit vector-animateLayout" spellcheck="false">
|
||||
<div id="mw-page-base" class="noprint"></div>
|
||||
<div id="mw-head-base" class="noprint"></div>
|
||||
<div id="content" class="mw-body">
|
||||
<h1 id="firstHeading" class="firstHeading"><span>Dev/Command-line/Script</span></h1>
|
||||
<div id="bodyContent" class="mw-body-content">
|
||||
<div id="siteSub">From XOWA: the free, open-source, offline wiki application</div>
|
||||
<div id="contentSub"></div>
|
||||
<div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr">
|
||||
|
||||
<table class="metadata plainlinks ambox ambox-delete" style="">
|
||||
<tr>
|
||||
<td class="mbox-empty-cell">
|
||||
</td>
|
||||
<td class="mbox-text" style="">
|
||||
<p>
|
||||
<span class="mbox-text-span">Note: This page is obsolete. It is preserved for historical reference only.</span>
|
||||
</p>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
<div id="mw-head" class="noprint">
|
||||
<div id="left-navigation">
|
||||
<div id="p-namespaces" class="vectorTabs">
|
||||
<h3>Namespaces</h3>
|
||||
<ul>
|
||||
<li id="ca-nstab-main" class="selected"><span><a id="ca-nstab-main-href" href="index.html">Page</a></span></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id='mw-panel' class='noprint'>
|
||||
<div id='p-logo'>
|
||||
<a style="background-image: url(http://xowa.org/xowa_logo.png);" href="index.html" title="Visit the main page"></a>
|
||||
</div>
|
||||
<div class="portal" id='xowa-portal-home'>
|
||||
<h3>XOWA</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/index.html" title='Visit the main page'>Main page</a></li>
|
||||
<li><a href="http://xowa.org/screenshots.html" title='See screenshots of XOWA'>Screenshots</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Download_XOWA.html" title='Download the XOWA application'>Download XOWA</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Dashboard/Image_databases.html" title='Download offline wikis and image databases'>Download wikis</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-stargin'>
|
||||
<h3>Getting started</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Setup/System_requirements.html" title='Get XOWA's system requirements'>Requirements</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Setup/Installation.html" title='Get instructions for installing XOWA'>Installation</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/Simple_Wikipedia.html" title='Learn how to set up Simple Wikipedia'>Simple Wikipedia</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/English_Wikipedia.html" title='Learn how to set up English Wikipedia'>English Wikipedia</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/Other_wikis.html" title='Learn how to set up other Wikipedias'>Other Wikipedias</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-help'>
|
||||
<h3>Help</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/About.html" title='Get more information about XOWA'>About</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Contents.html" title='View a list of help topics'>Contents</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Media.html" title='Read what others have written about XOWA'>Media</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Feedback.html" title='Questions? Comments? Leave feedback for XOWA'>Feedback</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-blog'>
|
||||
<h3>Blog</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Blog.html" title='Follow XOWA''s development process'>Current</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-links'>
|
||||
<h3>Links</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://dumps.wikimedia.org/backup-index.html" title="Get wiki datababase dumps directly from Wikimedia">Wikimedia dumps</a></li>
|
||||
<li><a href="https://archive.org/search.php?query=xowa" title="Search archive.org for XOWA files">XOWA @ archive.org</a></li>
|
||||
<li><a href="http://en.wikipedia.org" title="Visit Wikipedia (and compare to XOWA!)">English Wikipedia</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-donate'>
|
||||
<h3>Donate</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="https://archive.org/donate/index.php" title="Support archive.org!">archive.org</a></li><!-- listed first due to recent fire damages: http://blog.archive.org/2013/11/06/scanning-center-fire-please-help-rebuild/ -->
|
||||
<li><a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector" title="Support Wikipedia!">Wikipedia</a></li>
|
||||
<!-- <li><a href="" title="Support XOWA! (but only after you've supported archive.org and Wikipedia)">XOWA</a></li> -->
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
209
Dev/Command-line/Site_meta.html
Normal file
209
Dev/Command-line/Site_meta.html
Normal file
@@ -0,0 +1,209 @@
|
||||
<!DOCTYPE html>
|
||||
<html dir="ltr">
|
||||
<head>
|
||||
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
|
||||
<title>Dev/Command-line/Site meta - XOWA</title>
|
||||
<link rel="shortcut icon" href="http://xowa.org/xowa_logo.png" />
|
||||
<link rel="stylesheet" href="http://xowa.org/xowa_common.css" type="text/css">
|
||||
<style>
|
||||
.console {font-family: monospace; color: #EEEEEE ; background-color: black ; border: medium solid black;}
|
||||
.code
|
||||
,.path
|
||||
,.url {font-family: monospace; color: black ; background-color: #f9f9f9 ; border: medium solid #f9f9f9;}
|
||||
.bold {font-weight: 900;}
|
||||
</style>
|
||||
<style>
|
||||
.console {font-family: monospace; color: #EEEEEE ; background-color: black ; border: medium solid black;}
|
||||
.code
|
||||
,.path
|
||||
,.url {font-family: monospace; color: black ; background-color: #f9f9f9 ; border: medium solid #f9f9f9;}
|
||||
.bold {font-weight: 900;}
|
||||
</style>
|
||||
|
||||
</head>
|
||||
<body class="mediawiki ltr sitedir-ltr ns-0 ns-subject skin-vector action-submit vector-animateLayout" spellcheck="false">
|
||||
<div id="mw-page-base" class="noprint"></div>
|
||||
<div id="mw-head-base" class="noprint"></div>
|
||||
<div id="content" class="mw-body">
|
||||
<h1 id="firstHeading" class="firstHeading"><span>Dev/Command-line/Site meta</span></h1>
|
||||
<div id="bodyContent" class="mw-body-content">
|
||||
<div id="siteSub">From XOWA: the free, open-source, offline wiki application</div>
|
||||
<div id="contentSub"></div>
|
||||
<div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr">
|
||||
|
||||
<div id="toc" class="toc">
|
||||
<div id="toctitle">
|
||||
<h2>
|
||||
Contents
|
||||
</h2>
|
||||
</div>
|
||||
<ul>
|
||||
<li class="toclevel-1 tocsection-1">
|
||||
<a href="#Background"><span class="tocnumber">1</span> <span class="toctext">Background</span></a>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-2">
|
||||
<a href="#Process"><span class="tocnumber">2</span> <span class="toctext">Process</span></a>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
<p>
|
||||
XOWA can download the metadata for the Wikimedia wikis
|
||||
</p>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Background">Background</span>
|
||||
</h2>
|
||||
<p>
|
||||
Wikimedia exposes an API for accessing the meta-data for a given wiki. For example, for English Wikipedia, the following will return most of the meta-data around the wiki installation.
|
||||
</p>
|
||||
<pre class='code'>
|
||||
https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=general|namespaces|statistics|interwikimap|namespacealiases|specialpagealiases|libraries|extensions|skins|magicwords|functionhooks|showhooks|extensiontags|protocols|defaultoptions|languages
|
||||
</pre>
|
||||
<p>
|
||||
XOWA can call this API to download metadata for each wiki and save them in a database for data-processing. XOWA uses this info to resolve namespaces, but it will also incorporate other metadata from this API in future releases.
|
||||
</p>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Process">Process</span>
|
||||
</h2>
|
||||
<p>
|
||||
Assuming you are on a Windows system with XOWA installed at <code>C:\xowa</code>
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
Create a plain text-file called "C:\xowa\build_site_meta.gfs"
|
||||
</li>
|
||||
<li>
|
||||
Save the following text to the file:
|
||||
</li>
|
||||
</ul>
|
||||
<pre class='code'>
|
||||
app.bldr.pause_at_end_('n');
|
||||
app.scripts.run_file_by_type('xowa_cfg_app');
|
||||
app.bldr.cmds {
|
||||
|
||||
// NOTE: wiki doesn't matter; just use any wiki name that is on your system
|
||||
add('simple.wikipedia.org', 'util.site_meta') {
|
||||
|
||||
// path of the database to generate; default is C:\xowa\bin\any\xowa\cfg\wiki\site_meta.sqlite3
|
||||
db_url = 'C:\xowa\site_meta__enwiki.sqlite3';
|
||||
|
||||
// skip any wikis which have been downloaded after this time. default is now() - 1 day
|
||||
// the purpose of this argument is to avoid recalling the api if it's already been called recently.
|
||||
// for example, if the script runs for 800 wikis and fails for 3 wikis,
|
||||
// you can rerun the script again and it will only download the 3 failed ones; not all 800
|
||||
cutoff_time = '2015-07-01';
|
||||
|
||||
// list of wikis to download; note that each wiki must be separated by a new-line. default is all wikis listed in [[Dashboard/Import/Online]]
|
||||
wikis =
|
||||
'en.wikipedia.org
|
||||
en.wiktionary.org';
|
||||
}
|
||||
}
|
||||
app.bldr.run;
|
||||
</pre>
|
||||
<ul>
|
||||
<li>
|
||||
Run the file with the following:
|
||||
</li>
|
||||
</ul>
|
||||
<pre class='code'>
|
||||
java -jar xowa_windows.jar --app_mode cmd --cmd_file C:\xowa\build_site_meta.gfs
|
||||
</pre>
|
||||
<ul>
|
||||
<li>
|
||||
Open C:\xowa\site_meta__enwiki.sqlite3 in a sqlite shell and run the following:
|
||||
</li>
|
||||
</ul>
|
||||
<pre class='code'>
|
||||
SELECT * FROM site_statistic;
|
||||
</pre>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
<div id="mw-head" class="noprint">
|
||||
<div id="left-navigation">
|
||||
<div id="p-namespaces" class="vectorTabs">
|
||||
<h3>Namespaces</h3>
|
||||
<ul>
|
||||
<li id="ca-nstab-main" class="selected"><span><a id="ca-nstab-main-href" href="index.html">Page</a></span></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id='mw-panel' class='noprint'>
|
||||
<div id='p-logo'>
|
||||
<a style="background-image: url(http://xowa.org/xowa_logo.png);" href="index.html" title="Visit the main page"></a>
|
||||
</div>
|
||||
<div class="portal" id='xowa-portal-home'>
|
||||
<h3>XOWA</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/index.html" title='Visit the main page'>Main page</a></li>
|
||||
<li><a href="http://xowa.org/screenshots.html" title='See screenshots of XOWA'>Screenshots</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Download_XOWA.html" title='Download the XOWA application'>Download XOWA</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Dashboard/Image_databases.html" title='Download offline wikis and image databases'>Download wikis</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-stargin'>
|
||||
<h3>Getting started</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Setup/System_requirements.html" title='Get XOWA's system requirements'>Requirements</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Setup/Installation.html" title='Get instructions for installing XOWA'>Installation</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/Simple_Wikipedia.html" title='Learn how to set up Simple Wikipedia'>Simple Wikipedia</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/English_Wikipedia.html" title='Learn how to set up English Wikipedia'>English Wikipedia</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/Other_wikis.html" title='Learn how to set up other Wikipedias'>Other Wikipedias</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-help'>
|
||||
<h3>Help</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/About.html" title='Get more information about XOWA'>About</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Contents.html" title='View a list of help topics'>Contents</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Media.html" title='Read what others have written about XOWA'>Media</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Feedback.html" title='Questions? Comments? Leave feedback for XOWA'>Feedback</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-blog'>
|
||||
<h3>Blog</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Blog.html" title='Follow XOWA''s development process'>Current</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-links'>
|
||||
<h3>Links</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://dumps.wikimedia.org/backup-index.html" title="Get wiki datababase dumps directly from Wikimedia">Wikimedia dumps</a></li>
|
||||
<li><a href="https://archive.org/search.php?query=xowa" title="Search archive.org for XOWA files">XOWA @ archive.org</a></li>
|
||||
<li><a href="http://en.wikipedia.org" title="Visit Wikipedia (and compare to XOWA!)">English Wikipedia</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-donate'>
|
||||
<h3>Donate</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="https://archive.org/donate/index.php" title="Support archive.org!">archive.org</a></li><!-- listed first due to recent fire damages: http://blog.archive.org/2013/11/06/scanning-center-fire-please-help-rebuild/ -->
|
||||
<li><a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector" title="Support Wikipedia!">Wikipedia</a></li>
|
||||
<!-- <li><a href="" title="Support XOWA! (but only after you've supported archive.org and Wikipedia)">XOWA</a></li> -->
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
538
Dev/Command-line/Thumbs.html
Normal file
538
Dev/Command-line/Thumbs.html
Normal file
@@ -0,0 +1,538 @@
|
||||
<!DOCTYPE html>
|
||||
<html dir="ltr">
|
||||
<head>
|
||||
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
|
||||
<title>Dev/Command-line/Thumbs - XOWA</title>
|
||||
<link rel="shortcut icon" href="http://xowa.org/xowa_logo.png" />
|
||||
<link rel="stylesheet" href="http://xowa.org/xowa_common.css" type="text/css">
|
||||
<style>
|
||||
.console {font-family: monospace; color: #EEEEEE ; background-color: black ; border: medium solid black;}
|
||||
.code
|
||||
,.path
|
||||
,.url {font-family: monospace; color: black ; background-color: #f9f9f9 ; border: medium solid #f9f9f9;}
|
||||
.bold {font-weight: 900;}
|
||||
</style>
|
||||
<style>
|
||||
.console {font-family: monospace; color: #EEEEEE ; background-color: black ; border: medium solid black;}
|
||||
.code
|
||||
,.path
|
||||
,.url {font-family: monospace; color: black ; background-color: #f9f9f9 ; border: medium solid #f9f9f9;}
|
||||
.bold {font-weight: 900;}
|
||||
</style>
|
||||
|
||||
</head>
|
||||
<body class="mediawiki ltr sitedir-ltr ns-0 ns-subject skin-vector action-submit vector-animateLayout" spellcheck="false">
|
||||
<div id="mw-page-base" class="noprint"></div>
|
||||
<div id="mw-head-base" class="noprint"></div>
|
||||
<div id="content" class="mw-body">
|
||||
<h1 id="firstHeading" class="firstHeading"><span>Dev/Command-line/Thumbs</span></h1>
|
||||
<div id="bodyContent" class="mw-body-content">
|
||||
<div id="siteSub">From XOWA: the free, open-source, offline wiki application</div>
|
||||
<div id="contentSub"></div>
|
||||
<div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr">
|
||||
|
||||
<p>
|
||||
XOWA can generate two types of dumps: file-dumps and html-dumps
|
||||
</p>
|
||||
<table class="metadata plainlinks ambox ambox-delete" style="">
|
||||
<tr>
|
||||
<td class="mbox-empty-cell">
|
||||
</td>
|
||||
<td class="mbox-text" style="">
|
||||
<p>
|
||||
<span class="mbox-text-span">Please note that this script is for power users. It is not meant for casual users.</span>
|
||||
</p>
|
||||
<p>
|
||||
<span class="mbox-text-span">Please read through these instructions carefully. If you fail to follow these instructions, you may end up downloading millions of images by accident, and have your IP address banned by Wikimedia.</span>
|
||||
</p>
|
||||
<p>
|
||||
<span class="mbox-text-span">Also, the script will change in the future, and without any warning. There is no backward compatibility. Although the XOWA databases have a fixed format, the scripts do not. If you discover that your script breaks, please refer to this page, contact me for assistance, or go through the code.</span>
|
||||
</p>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
<p>
|
||||
<br>
|
||||
</p>
|
||||
<div id="toc" class="toc">
|
||||
<div id="toctitle">
|
||||
<h2>
|
||||
Contents
|
||||
</h2>
|
||||
</div>
|
||||
<ul>
|
||||
<li class="toclevel-1 tocsection-1">
|
||||
<a href="#Overview"><span class="tocnumber">1</span> <span class="toctext">Overview</span></a>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-2">
|
||||
<a href="#Requirements"><span class="tocnumber">2</span> <span class="toctext">Requirements</span></a>
|
||||
<ul>
|
||||
<li class="toclevel-2 tocsection-3">
|
||||
<a href="#commons.wikimedia.org_.28thum"><span class="tocnumber">2.1</span> <span class="toctext">commons.wikimedia.org (thum</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-4">
|
||||
<a href="#www.wikidata.org"><span class="tocnumber">2.2</span> <span class="toctext">www.wikidata.org</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-5">
|
||||
<a href="#Hardware"><span class="tocnumber">2.3</span> <span class="toctext">Hardware</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-6">
|
||||
<a href="#Internet-connectivity_.28optional.29"><span class="tocnumber">2.4</span> <span class="toctext">Internet-connectivity (optional)</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-7">
|
||||
<a href="#Pre-existing_image_databases_for_your_wiki_.28optional.29"><span class="tocnumber">2.5</span> <span class="toctext">Pre-existing image databases for your wiki (optional)</span></a>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-8">
|
||||
<a href="#gfs"><span class="tocnumber">3</span> <span class="toctext">gfs</span></a>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-9">
|
||||
<a href="#Terms"><span class="tocnumber">4</span> <span class="toctext">Terms</span></a>
|
||||
<ul>
|
||||
<li class="toclevel-2 tocsection-10">
|
||||
<a href="#lnki"><span class="tocnumber">4.1</span> <span class="toctext">lnki</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-11">
|
||||
<a href="#orig"><span class="tocnumber">4.2</span> <span class="toctext">orig</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-12">
|
||||
<a href="#xfer"><span class="tocnumber">4.3</span> <span class="toctext">xfer</span></a>
|
||||
</li>
|
||||
<li class="toclevel-2 tocsection-13">
|
||||
<a href="#fsdb"><span class="tocnumber">4.4</span> <span class="toctext">fsdb</span></a>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-14">
|
||||
<a href="#Script"><span class="tocnumber">5</span> <span class="toctext">Script</span></a>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Overview">Overview</span>
|
||||
</h2>
|
||||
<p>
|
||||
The download-thumbs script downloads all thumbs for pages in a specific wiki. It works in the following way:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
It loads a page.
|
||||
</li>
|
||||
<li>
|
||||
It converts the wikitext to HTML
|
||||
<ul>
|
||||
<li>
|
||||
If thumb mode is enabled, it compiles a list of [[File]] links.
|
||||
</li>
|
||||
<li>
|
||||
If HTML-dump mode is enabled, it saves the HTML into XOWA html databases
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>
|
||||
It repeats until there are no more pages
|
||||
</li>
|
||||
<li>
|
||||
If thumb mode, it does the following additional steps
|
||||
<ul>
|
||||
<li>
|
||||
It analyzes the list of [[File]] links to come up with a unique list of thumbs.
|
||||
</li>
|
||||
<li>
|
||||
It downloads the thumbs and creates the XOWA file databases.
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<p>
|
||||
The script for simple wikipedia is listed below.
|
||||
</p>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Requirements">Requirements</span>
|
||||
</h2>
|
||||
<h3>
|
||||
<span class="mw-headline" id="commons.wikimedia.org_.28thum">commons.wikimedia.org (thum</span>
|
||||
</h3>
|
||||
<p>
|
||||
You will need the latest version of commons.wikimedia.org. Note that if you have an older version, you will have missing images or wrong size information.
|
||||
</p>
|
||||
<p>
|
||||
For example, if you have a commons.wikimedia.org from 2015-04-22 and are trying to import a 2015-05-17 English Wikipedia, then any new images added after 2015-04-22 will not be picked up.
|
||||
</p>
|
||||
<h3>
|
||||
<span class="mw-headline" id="www.wikidata.org">www.wikidata.org</span>
|
||||
</h3>
|
||||
<p>
|
||||
You also need to have the latest version of www.wikidata.org. Note that English Wikipedia and other wikis uses Wikidata through the {{#property}} call or Module code. If you have an earlier version, then data will be missing or out of date.
|
||||
</p>
|
||||
<h3>
|
||||
<span class="mw-headline" id="Hardware">Hardware</span>
|
||||
</h3>
|
||||
<p>
|
||||
You should have a recent-generation machine with relatively high-performance hardware, especially if you're planning to generate images for English Wikipedia.
|
||||
</p>
|
||||
<p>
|
||||
For context, here is my current machine setup for generating the image dumps:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
Processor: Intel Core i7-4770K; 3.5 GHz with 8 MB L3 cache
|
||||
</li>
|
||||
<li>
|
||||
Memory: 16 GB DDR3 SDRAM DDR3 1600 (PC3 12800)
|
||||
</li>
|
||||
<li>
|
||||
Hard Drive: 1TB 10,000 RPM 64MB Cache SATA 6.0Gb/s
|
||||
</li>
|
||||
<li>
|
||||
Operating System: openSUSE 13.2
|
||||
</li>
|
||||
</ul>
|
||||
<p>
|
||||
(Note: The hardware was assembled in late 2013 for about $1,600 US dollars.)
|
||||
</p>
|
||||
<p>
|
||||
For English Wikipedia, it still takes about 50 hours for the entire process.
|
||||
</p>
|
||||
<h3>
|
||||
<span class="mw-headline" id="Internet-connectivity_.28optional.29">Internet-connectivity (optional)</span>
|
||||
</h3>
|
||||
<p>
|
||||
You should have a broadband connection to the internet. The script will need to download dump files from Wikimedia and some dump files (like English Wikipedia) will be in the 10s of GB.
|
||||
</p>
|
||||
<p>
|
||||
You can opt to download these files separately and place them in the appropriate location beforehand. However, the script below assumes that the machine is always online. If you are offline, you will need to comment the "util.download" lines yourself.
|
||||
</p>
|
||||
<h3>
|
||||
<span class="mw-headline" id="Pre-existing_image_databases_for_your_wiki_.28optional.29">Pre-existing image databases for your wiki (optional)</span>
|
||||
</h3>
|
||||
<p>
|
||||
XOWA will automatically re-use the images from existing image databases so that you do not have to redownload them. This is particularly useful for large wikis where redownloading millions of images would be unwanted.
|
||||
</p>
|
||||
<p>
|
||||
It is strongly advised that you download the image database for your wiki. You can find a full list here: <a href="http://xowa.sourceforge.net/image_dbs.html" rel="nofollow" class="external free">http://xowa.sourceforge.net/image_dbs.html</a> Note that if an image database does not exist for your wiki, you can still proceed to use the script
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
If you have v1 image databases, they should be placed in <code>/xowa/file/wiki_domain-prv</code>. For example, English Wikipedia should have <code>/xowa/file/en.wikipedia.org-prv/fsdb.main/fsdb.bin.0000.sqlite3</code>
|
||||
</li>
|
||||
<li>
|
||||
If you have v2 image databases, they should be placed in <code>/xowa/wiki/wiki_domain/prv</code>. For example, English Wikipedia should have <code>/xowa/wiki/en.wikipedia.org/prv/en.wikipedia.org-file-ns.000-db.001.xowa</code>
|
||||
</li>
|
||||
</ul>
|
||||
<h2>
|
||||
<span class="mw-headline" id="gfs">gfs</span>
|
||||
</h2>
|
||||
<p>
|
||||
The script is written in the <code>gfs</code> format. This is a custom scripting format specific to XOWA. It is similar to JSON, but also supports commenting.
|
||||
</p>
|
||||
<p>
|
||||
Unfortunately the error-handling for gfs is quite minimal. When making changes, please do them in small steps and be prepared to go to backups.
|
||||
</p>
|
||||
<p>
|
||||
The following is a brief list of rules:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
Comments are made with either "//","\n" or "/*","*/". For example: <code>// single-line comment</code> or <code>/* multi-line comment*/</code>
|
||||
</li>
|
||||
<li>
|
||||
Booleans are "y" and "n" (yes / no or true / false). For example: <code>enabled = 'y';</code>
|
||||
</li>
|
||||
<li>
|
||||
Numbers are 32-bit integers and are not enclosed in quotes. For example, <code>count = 10000;</code>
|
||||
</li>
|
||||
<li>
|
||||
Strings are surrounded by apostrophes (') or quotes ("). For example: <code>key = 'val';</code>
|
||||
</li>
|
||||
<li>
|
||||
Statements are terminated by a semi-colon (;). For example: <code>procedure1;</code>
|
||||
</li>
|
||||
<li>
|
||||
Statements can take arguments in parentheses. For example: <code>procedure1('argument1', 'argument2', 'argument3');</code>
|
||||
</li>
|
||||
<li>
|
||||
Statements are grouped with curly braces. ({}). For example: <code>group {procedure1; procedure2; procedure3;}</code>
|
||||
</li>
|
||||
</ul>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Terms">Terms</span>
|
||||
</h2>
|
||||
<h3>
|
||||
<span class="mw-headline" id="lnki">lnki</span>
|
||||
</h3>
|
||||
<p>
|
||||
A <code>lnki</code> is short for "<b>l</b>i<b>nk</b> <b>i</b>nternal". It refers to all wikitext with the double bracket syntax: [[A]]. A more elaborate example for files would be [[File:A.png|thumb|200x300px|upright=.80]]. Note that the abbreviation was chosen to differentiate it from <code>lnke</code> which is short for "<b>l</b>i<b>nk</b> <b>e</b>nternal". For the purposes of the script, all lnki data comes from the current wiki's data dump
|
||||
</p>
|
||||
<h3>
|
||||
<span class="mw-headline" id="orig">orig</span>
|
||||
</h3>
|
||||
<ul>
|
||||
<li>
|
||||
An <code>orig</code> is short for "<b>orig</b>inal file". It refers to the original file metadata. For the purposes of this script, all orig data comes from commons.wikimedia.org
|
||||
</li>
|
||||
</ul>
|
||||
<h3>
|
||||
<span class="mw-headline" id="xfer">xfer</span>
|
||||
</h3>
|
||||
<ul>
|
||||
<li>
|
||||
An <code>xfer</code> is short for "transfer file". It refers to the actual file to be downloaded.
|
||||
</li>
|
||||
</ul>
|
||||
<h3>
|
||||
<span class="mw-headline" id="fsdb">fsdb</span>
|
||||
</h3>
|
||||
<ul>
|
||||
<li>
|
||||
The <code>fsdb</code> is short for "<b>f</b>ile <b>s</b>ystem <b>d</b>ata<b>b</b>ase". It refers to the internal table format of the XOWA image databases.
|
||||
</li>
|
||||
</ul>
|
||||
<p>
|
||||
<br>
|
||||
</p>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Script">Script</span>
|
||||
</h2>
|
||||
<pre class='code'>
|
||||
app.bldr.pause_at_end_('n');
|
||||
app.scripts.run_file_by_type('xowa_cfg_app');
|
||||
app.bldr.cmds {
|
||||
// build commons database; this only needs to be done once, whenever commons is updated
|
||||
add ('commons.wikimedia.org' , 'util.cleanup') {delete_all = 'y';}
|
||||
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'pages-articles';}
|
||||
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'categorylinks';}
|
||||
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'page_props';}
|
||||
add ('commons.wikimedia.org' , 'util.download') {dump_type = 'image';}
|
||||
add ('commons.wikimedia.org' , 'text.init');
|
||||
add ('commons.wikimedia.org' , 'text.page');
|
||||
add ('commons.wikimedia.org' , 'text.cat.core');
|
||||
add ('commons.wikimedia.org' , 'text.cat.link');
|
||||
add ('commons.wikimedia.org' , 'text.cat.hidden');
|
||||
add ('commons.wikimedia.org' , 'text.term');
|
||||
add ('commons.wikimedia.org' , 'text.css');
|
||||
add ('commons.wikimedia.org' , 'wiki.image');
|
||||
add ('commons.wikimedia.org' , 'file.page_regy') {build_commons = 'y'}
|
||||
add ('commons.wikimedia.org' , 'wiki.page_dump.make');
|
||||
add ('commons.wikimedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
|
||||
add ('commons.wikimedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
|
||||
|
||||
// build wikidata database; this only needs to be done once, whenever wikidata is updated
|
||||
add ('www.wikidata.org' , 'util.cleanup') {delete_all = 'y';}
|
||||
add ('www.wikidata.org' , 'util.download') {dump_type = 'pages-articles';}
|
||||
add ('www.wikidata.org' , 'util.download') {dump_type = 'categorylinks';}
|
||||
add ('www.wikidata.org' , 'util.download') {dump_type = 'page_props';}
|
||||
add ('www.wikidata.org' , 'util.download') {dump_type = 'image';}
|
||||
add ('www.wikidata.org' , 'text.init');
|
||||
add ('www.wikidata.org' , 'text.page');
|
||||
add ('www.wikidata.org' , 'text.cat.core');
|
||||
add ('www.wikidata.org' , 'text.cat.link');
|
||||
add ('www.wikidata.org' , 'text.cat.hidden');
|
||||
add ('www.wikidata.org' , 'text.term');
|
||||
add ('www.wikidata.org' , 'text.css');
|
||||
add ('www.wikidata.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
|
||||
|
||||
// build simple.wikipedia.org
|
||||
add ('simple.wikipedia.org' , 'util.cleanup') {delete_all = 'y';}
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pages-articles';}
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'categorylinks';}
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'page_props';}
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'image';}
|
||||
add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pagelinks';}
|
||||
add ('simple.wikipedia.org' , 'text.init');
|
||||
add ('simple.wikipedia.org' , 'text.page') {
|
||||
// calculate redirect_id for #REDIRECT pages. needed for html databases
|
||||
redirect_id_enabled = 'y';
|
||||
}
|
||||
add ('simple.wikipedia.org' , 'text.search');
|
||||
|
||||
// upload desktop css
|
||||
add ('simple.wikipedia.org' , 'text.css');
|
||||
|
||||
// upload mobile css
|
||||
add ('simple.wikipedia.org' , 'text.css') {css_key = 'xowa.mobile'; /* css_dir = 'C:\xowa\user\anonymous\wiki\simple.wikipedia.org-mobile\html\'; */}
|
||||
|
||||
add ('simple.wikipedia.org' , 'text.cat.core');
|
||||
add ('simple.wikipedia.org' , 'text.cat.link');
|
||||
add ('simple.wikipedia.org' , 'text.cat.hidden');
|
||||
add ('simple.wikipedia.org' , 'text.term');
|
||||
|
||||
// create local "page" tables in each "text" database for "lnki_temp"
|
||||
add ('simple.wikipedia.org' , 'wiki.page_dump.make');
|
||||
|
||||
// create a redirect table for pages in the File namespace
|
||||
add ('simple.wikipedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
|
||||
|
||||
// create an "image" table to get the metadata for all files in the current wiki
|
||||
add ('simple.wikipedia.org' , 'wiki.image');
|
||||
|
||||
// parse all page-to-page links
|
||||
add ('simple.wikipedia.org' , 'wiki.page_link');
|
||||
|
||||
// calculate a score for each page using the page-to-page links
|
||||
add ('simple.wikipedia.org' , 'search.page__page_score') {iteration_max = 100;}
|
||||
|
||||
// update link score statistics for the search tables
|
||||
add ('simple.wikipedia.org' , 'search.link__link_score') {page_rank_enabled = 'y';}
|
||||
|
||||
// update word count statistics for the search_word table
|
||||
add ('simple.wikipedia.org' , 'search.word__link_count')
|
||||
|
||||
// cleanup all downloaded files as well as temporary files
|
||||
add ('simple.wikipedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
|
||||
|
||||
// parse every page in the listed namespace and gather data on their lnkis.
|
||||
// this step will take the longest amount of time.
|
||||
add ('simple.wikipedia.org' , 'file.lnki_temp') {
|
||||
// save data every # of pages
|
||||
commit_interval = 10000;
|
||||
|
||||
// update progress every # of pages
|
||||
progress_interval = 50;
|
||||
|
||||
// free memory by flushing internal caches every # of pages
|
||||
cleanup_interval = 50;
|
||||
|
||||
// specify # of pages to read into memory at a time, where # is in MB. For example, 25 means read approximately 25 MB of page text into memory
|
||||
select_size = 25;
|
||||
|
||||
// namespaces to parse. See en.wikipedia.org/wiki/Help:Namespaces
|
||||
ns_ids = '0|4|14';
|
||||
|
||||
// enable generation of ".html" databases. This is experimental and will increase processing time by 20% - 25%
|
||||
// gen_hdump = 'y';
|
||||
}
|
||||
|
||||
// aggregate the lnkis
|
||||
add ('simple.wikipedia.org' , 'file.lnki_regy');
|
||||
|
||||
// generate orig metadata for files in the current wiki (for example, for pages in en.wikipedia.org/wiki/File:*)
|
||||
add ('simple.wikipedia.org' , 'file.page_regy') {build_commons = 'n';}
|
||||
|
||||
// generate all orig metadata for all lnkis
|
||||
add ('simple.wikipedia.org' , 'file.orig_regy');
|
||||
|
||||
// generate list of files to download based on "orig_regy" and XOWA image code
|
||||
add ('simple.wikipedia.org' , 'file.xfer_temp.thumb');
|
||||
|
||||
// aggregate list one more time
|
||||
add ('simple.wikipedia.org' , 'file.xfer_regy');
|
||||
|
||||
// identify images that have already been downloaded
|
||||
add ('simple.wikipedia.org' , 'file.xfer_regy_update');
|
||||
|
||||
// download images. This step may also take a long time, depending on how many images are needed
|
||||
add ('simple.wikipedia.org' , 'file.fsdb_make') {
|
||||
commit_interval = 1000; progress_interval = 200; select_interval = 10000;
|
||||
ns_ids = '0|4|14';
|
||||
|
||||
// specify whether original wiki databases are v1 (.sqlite3) or v2 (.xowa)
|
||||
src_bin_mgr__fsdb_version = 'v1';
|
||||
|
||||
// always redownload certain files
|
||||
src_bin_mgr__fsdb_skip_wkrs = 'page_gt_1|small_size';
|
||||
|
||||
// allow downloads from wikimedia
|
||||
src_bin_mgr__wmf_enabled = 'y';
|
||||
}
|
||||
|
||||
// generate registry of original metadata by file title
|
||||
add ('simple.wikipedia.org' , 'file.orig_reg');
|
||||
|
||||
// drop page_dump tables
|
||||
add ('simple.wikipedia.org' , 'wiki.page_dump.drop');
|
||||
}
|
||||
app.bldr.run;
|
||||
</pre>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
<div id="mw-head" class="noprint">
|
||||
<div id="left-navigation">
|
||||
<div id="p-namespaces" class="vectorTabs">
|
||||
<h3>Namespaces</h3>
|
||||
<ul>
|
||||
<li id="ca-nstab-main" class="selected"><span><a id="ca-nstab-main-href" href="index.html">Page</a></span></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id='mw-panel' class='noprint'>
|
||||
<div id='p-logo'>
|
||||
<a style="background-image: url(http://xowa.org/xowa_logo.png);" href="index.html" title="Visit the main page"></a>
|
||||
</div>
|
||||
<div class="portal" id='xowa-portal-home'>
|
||||
<h3>XOWA</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/index.html" title='Visit the main page'>Main page</a></li>
|
||||
<li><a href="http://xowa.org/screenshots.html" title='See screenshots of XOWA'>Screenshots</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Download_XOWA.html" title='Download the XOWA application'>Download XOWA</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Dashboard/Image_databases.html" title='Download offline wikis and image databases'>Download wikis</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-stargin'>
|
||||
<h3>Getting started</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Setup/System_requirements.html" title='Get XOWA's system requirements'>Requirements</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Setup/Installation.html" title='Get instructions for installing XOWA'>Installation</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/Simple_Wikipedia.html" title='Learn how to set up Simple Wikipedia'>Simple Wikipedia</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/English_Wikipedia.html" title='Learn how to set up English Wikipedia'>English Wikipedia</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/Other_wikis.html" title='Learn how to set up other Wikipedias'>Other Wikipedias</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-help'>
|
||||
<h3>Help</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/About.html" title='Get more information about XOWA'>About</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Contents.html" title='View a list of help topics'>Contents</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Media.html" title='Read what others have written about XOWA'>Media</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Feedback.html" title='Questions? Comments? Leave feedback for XOWA'>Feedback</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-blog'>
|
||||
<h3>Blog</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Blog.html" title='Follow XOWA''s development process'>Current</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-links'>
|
||||
<h3>Links</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://dumps.wikimedia.org/backup-index.html" title="Get wiki datababase dumps directly from Wikimedia">Wikimedia dumps</a></li>
|
||||
<li><a href="https://archive.org/search.php?query=xowa" title="Search archive.org for XOWA files">XOWA @ archive.org</a></li>
|
||||
<li><a href="http://en.wikipedia.org" title="Visit Wikipedia (and compare to XOWA!)">English Wikipedia</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-donate'>
|
||||
<h3>Donate</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="https://archive.org/donate/index.php" title="Support archive.org!">archive.org</a></li><!-- listed first due to recent fire damages: http://blog.archive.org/2013/11/06/scanning-center-fire-please-help-rebuild/ -->
|
||||
<li><a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector" title="Support Wikipedia!">Wikipedia</a></li>
|
||||
<!-- <li><a href="" title="Support XOWA! (but only after you've supported archive.org and Wikipedia)">XOWA</a></li> -->
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
206
Dev/Command-line/Wikidata.html
Normal file
206
Dev/Command-line/Wikidata.html
Normal file
@@ -0,0 +1,206 @@
|
||||
<!DOCTYPE html>
|
||||
<html dir="ltr">
|
||||
<head>
|
||||
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
|
||||
<title>Dev/Command-line/Wikidata - XOWA</title>
|
||||
<link rel="shortcut icon" href="http://xowa.org/xowa_logo.png" />
|
||||
<link rel="stylesheet" href="http://xowa.org/xowa_common.css" type="text/css">
|
||||
<style>
|
||||
.console {font-family: monospace; color: #EEEEEE ; background-color: black ; border: medium solid black;}
|
||||
.code
|
||||
,.path
|
||||
,.url {font-family: monospace; color: black ; background-color: #f9f9f9 ; border: medium solid #f9f9f9;}
|
||||
.bold {font-weight: 900;}
|
||||
</style>
|
||||
<style>
|
||||
.console {font-family: monospace; color: #EEEEEE ; background-color: black ; border: medium solid black;}
|
||||
.code
|
||||
,.path
|
||||
,.url {font-family: monospace; color: black ; background-color: #f9f9f9 ; border: medium solid #f9f9f9;}
|
||||
.bold {font-weight: 900;}
|
||||
</style>
|
||||
|
||||
</head>
|
||||
<body class="mediawiki ltr sitedir-ltr ns-0 ns-subject skin-vector action-submit vector-animateLayout" spellcheck="false">
|
||||
<div id="mw-page-base" class="noprint"></div>
|
||||
<div id="mw-head-base" class="noprint"></div>
|
||||
<div id="content" class="mw-body">
|
||||
<h1 id="firstHeading" class="firstHeading"><span>Dev/Command-line/Wikidata</span></h1>
|
||||
<div id="bodyContent" class="mw-body-content">
|
||||
<div id="siteSub">From XOWA: the free, open-source, offline wiki application</div>
|
||||
<div id="contentSub"></div>
|
||||
<div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr">
|
||||
|
||||
<div id="toc" class="toc">
|
||||
<div id="toctitle">
|
||||
<h2>
|
||||
Contents
|
||||
</h2>
|
||||
</div>
|
||||
<ul>
|
||||
<li class="toclevel-1 tocsection-1">
|
||||
<a href="#Import_using_the_XML_dump"><span class="tocnumber">1</span> <span class="toctext">Import using the XML dump</span></a>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-2">
|
||||
<a href="#Import_using_the_JSON_dump"><span class="tocnumber">2</span> <span class="toctext">Import using the JSON dump</span></a>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
<p>
|
||||
XOWA can import Wikidata through the command-line
|
||||
</p>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Import_using_the_XML_dump">Import using the XML dump</span>
|
||||
</h2>
|
||||
<p>
|
||||
XOWA can build wikidata using the XML dump at www.mediwa/wikidatawiki/. This import is basically the same as an import of any other wiki.
|
||||
</p>
|
||||
<p>
|
||||
The script for the XML import follows.
|
||||
</p>
|
||||
<pre class='code'>
|
||||
// build wikidata database; this only needs to be done once, whenever wikidata is updated
|
||||
add ('www.wikidata.org' , 'util.cleanup') {delete_all = 'y';}
|
||||
add ('www.wikidata.org' , 'util.download') {dump_type = 'pages-articles';}
|
||||
add ('www.wikidata.org' , 'util.download') {dump_type = 'categorylinks';}
|
||||
add ('www.wikidata.org' , 'util.download') {dump_type = 'page_props';}
|
||||
add ('www.wikidata.org' , 'util.download') {dump_type = 'image';}
|
||||
add ('www.wikidata.org' , 'text.init');
|
||||
add ('www.wikidata.org' , 'text.page');
|
||||
add ('www.wikidata.org' , 'text.cat.core');
|
||||
add ('www.wikidata.org' , 'text.cat.link');
|
||||
add ('www.wikidata.org' , 'text.cat.hidden');
|
||||
add ('www.wikidata.org' , 'text.term');
|
||||
add ('www.wikidata.org' , 'text.css');
|
||||
add ('www.wikidata.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
|
||||
</pre>
|
||||
<h2>
|
||||
<span class="mw-headline" id="Import_using_the_JSON_dump">Import using the JSON dump</span>
|
||||
</h2>
|
||||
<p>
|
||||
As of v2.6.3, XOWA also provides basic support for building wikidata from the JSON dump. This support was added for the following reasons:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
<b>Current delay in XML dumps</b>: The last good wikidata XML dump was 2+ months old due to problems with dump generation. See: <a href="https://phabricator.wikimedia.org/T98585" rel="nofollow" class="external free">https://phabricator.wikimedia.org/T98585</a>
|
||||
</li>
|
||||
<li>
|
||||
<b>JSON dumps recommended</b>: Wikidata seems to prefer using the JSON dump over the XML dump. See: <a href="http://www.wikidata.org/wiki/Wikidata:Database_download" rel="nofollow" class="external free">http://www.wikidata.org/wiki/Wikidata:Database_download</a>
|
||||
</li>
|
||||
<li>
|
||||
<b>JSON dumps are more frequent</b>: The JSON dumps have been dumping regularly on a weekly basis. In contrast the XML dumps take 3 - 4 weeks.
|
||||
</li>
|
||||
</ul>
|
||||
<p>
|
||||
Despite these reasons, there are limitations to the JSON dump.
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
<b>Non-JSON pages not available</b>: The JSON dump doesn't provide other pages, such as the Main Page or MediaWiki pages. Only pages in the main and property namespaces are available. This is by design. See: <a href="https://lists.wikimedia.org/pipermail/wikidata/2015-June/006441.html" rel="nofollow" class="external free">https://lists.wikimedia.org/pipermail/wikidata/2015-June/006441.html</a>
|
||||
</li>
|
||||
<li>
|
||||
<b>Page metadata not available</b> : Certain properties are not available, such as page_id and last_modified. XOWA provides substitutes for these values, but they will not match the Wikimedia version
|
||||
</li>
|
||||
</ul>
|
||||
<p>
|
||||
The script for the JSON import follows.
|
||||
</p>
|
||||
<pre class='code'>
|
||||
add ('www.wikidata.org' , 'util.cleanup') {delete_all = 'y';}
|
||||
// TODO: add ('www.wikidata.org' , 'util.download') {dump_type = 'wikidata-json';}
|
||||
add ('www.wikidata.org' , 'wbase.json_dump');
|
||||
add ('www.wikidata.org' , 'text.term');
|
||||
add ('www.wikidata.org' , 'text.css');
|
||||
add ('www.wikidata.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz|*.json');}
|
||||
</pre>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
<div id="mw-head" class="noprint">
|
||||
<div id="left-navigation">
|
||||
<div id="p-namespaces" class="vectorTabs">
|
||||
<h3>Namespaces</h3>
|
||||
<ul>
|
||||
<li id="ca-nstab-main" class="selected"><span><a id="ca-nstab-main-href" href="index.html">Page</a></span></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id='mw-panel' class='noprint'>
|
||||
<div id='p-logo'>
|
||||
<a style="background-image: url(http://xowa.org/xowa_logo.png);" href="index.html" title="Visit the main page"></a>
|
||||
</div>
|
||||
<div class="portal" id='xowa-portal-home'>
|
||||
<h3>XOWA</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/index.html" title='Visit the main page'>Main page</a></li>
|
||||
<li><a href="http://xowa.org/screenshots.html" title='See screenshots of XOWA'>Screenshots</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Download_XOWA.html" title='Download the XOWA application'>Download XOWA</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Dashboard/Image_databases.html" title='Download offline wikis and image databases'>Download wikis</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-stargin'>
|
||||
<h3>Getting started</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Setup/System_requirements.html" title='Get XOWA's system requirements'>Requirements</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Setup/Installation.html" title='Get instructions for installing XOWA'>Installation</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/Simple_Wikipedia.html" title='Learn how to set up Simple Wikipedia'>Simple Wikipedia</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/English_Wikipedia.html" title='Learn how to set up English Wikipedia'>English Wikipedia</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/App/Import/Other_wikis.html" title='Learn how to set up other Wikipedias'>Other Wikipedias</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-help'>
|
||||
<h3>Help</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/About.html" title='Get more information about XOWA'>About</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Contents.html" title='View a list of help topics'>Contents</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Media.html" title='Read what others have written about XOWA'>Media</a></li>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Help/Feedback.html" title='Questions? Comments? Leave feedback for XOWA'>Feedback</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-blog'>
|
||||
<h3>Blog</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://xowa.org/wiki/home/page/Blog.html" title='Follow XOWA''s development process'>Current</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-links'>
|
||||
<h3>Links</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="http://dumps.wikimedia.org/backup-index.html" title="Get wiki datababase dumps directly from Wikimedia">Wikimedia dumps</a></li>
|
||||
<li><a href="https://archive.org/search.php?query=xowa" title="Search archive.org for XOWA files">XOWA @ archive.org</a></li>
|
||||
<li><a href="http://en.wikipedia.org" title="Visit Wikipedia (and compare to XOWA!)">English Wikipedia</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="portal" id='xowa-portal-donate'>
|
||||
<h3>Donate</h3>
|
||||
<div class="body">
|
||||
<ul>
|
||||
<li><a href="https://archive.org/donate/index.php" title="Support archive.org!">archive.org</a></li><!-- listed first due to recent fire damages: http://blog.archive.org/2013/11/06/scanning-center-fire-please-help-rebuild/ -->
|
||||
<li><a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector" title="Support Wikipedia!">Wikipedia</a></li>
|
||||
<!-- <li><a href="" title="Support XOWA! (but only after you've supported archive.org and Wikipedia)">XOWA</a></li> -->
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
Reference in New Issue
Block a user