Dev/Command-line/Dumpsx

From XOWA: the free, open-source, offline wiki application

XOWA can also generate file-dumps and html-dumps.

Please note that this script is for power users. It is not meant for casual users.

Please read through these instructions carefully. If you fail to follow these instructions, you may end up downloading millions of images by accident, and have your IP address banned by Wikimedia.

Also, the script will change in the future, and without any warning. There is no backward compatibility. Although the XOWA databases have a fixed format, the scripts do not. If you discover that your script breaks, please refer to this page, contact me for assistance, or go through the code.

The html-dump is officially experimental. They will become hardened for forward-compatibility, but they are not yet ready for it.

Although the XOWA Android app works fine with the html-dumps for English Wikipedia, I still need to run the html-dump code through more wikis. There is a high probability that I may find something that causes me to change the html-dump format. When that happens, the old html-dumps will not work with the newest XOWA Android app.

Basically, you should generate these html-dumps for personal use / testing. Please do not generate them for many wikis, or distribute them en masse, without preparing to redo all your work again.

1 Background
2 Overview
3 Requirements
4 gfs
5 Terms (file-dump mode only)
- 5.1 lnki
- 5.2 orig
- 5.3 xfer
- 5.4 fsdb
6 HTML dump
- 6.1 Plain-html databases
7 Command-line
8 Script

Background

XOWA generates three types of dumps:

text-dumps: These contain wikitext for a page. For example: [[Earth]]
html-dumps: These contain HTML for a page (compiled from its wikitext). For example: <a href="/wiki/Earth">Earth</a>
file-dumps: These contain files for a page. For example: the binary data for https://commons.wikimedia.org/wiki/File:Africa_and_Europe_from_a_Million_Miles_Away.png (aka: the Blue Marble).

Text-dumps are generated within the program through Import online and Import offline

Html-dumps and file-dumps are only generated through a command-line script

This page describes the process to generate html-dumps and file-dumps.

Overview

The dump script works in the following way:

It loads a page.
It converts the wikitext to HTML
- For file-dump mode, it compiles a list of [[File]] links.
- If HTML-dump mode, it also saves the HTML into XOWA html databases
It repeats until there are no more pages
For file-dump mode, it also does the following additional steps
- It analyzes the list of [[File]] links to come up with a unique list of thumbs.
- It downloads the thumbs and creates the XOWA file databases.

The script for simple wikipedia is listed below.

You should also refer to Dev/Command-line for general instructions on running by command-line.

Requirements

commons.wikimedia.org (file-dump mode only)

You will need the latest version of commons.wikimedia.org. Note that if you have an older version, you will have missing images or wrong size information.

For example, if you have a commons.wikimedia.org from 2015-04-22 and are trying to import a 2015-05-17 English Wikipedia, then any new images added after 2015-04-22 will not be picked up.

www.wikidata.org

You also need to have the latest version of www.wikidata.org. Note that English Wikipedia and other wikis uses Wikidata through the {{#property}} call or Module code. If you have an earlier version, then data will be missing or out of date.

Hardware

You should have a recent-generation machine with relatively high-performance hardware, especially if you're planning to run the script for English Wikipedia.

For context, here is my current machine setup for generating the image dumps:

Processor: 3.5 GHz with 8 MB L3 cache (Intel Core i7-4770K)
Memory: 16 GB DDR3 SDRAM DDR3 1600 (PC3 12800)
Hard Drive: 1TB SSD drive (Samsung 850 EVO)
Operating System: openSUSE 13.2

(Note: The hardware was assembled in late 2013 for about $1,600 US dollars.)

For English Wikipedia, it takes about 70 hours for the entire process.

Internet-connectivity (file-dump mode only; optional)

You should have a broadband connection to the internet. The script will need to download dump files from Wikimedia and some dump files (like English Wikipedia) will be in the 10s of GB.

You can opt to download these files separately and place them in the appropriate location beforehand. However, the script below assumes that the machine is always online. If you are offline, you will need to comment the "util.download" lines yourself.

Pre-existing image databases for your wiki (file-dump mode only; optional)

XOWA will automatically re-use the images from existing image databases so that you do not have to redownload them. This is particularly useful for large wikis where redownloading millions of images would be unwanted.

It is strongly advised that you download the image database for your wiki. You can find a full list here: http://xowa.sourceforge.net/image_dbs.html Note that if an image database does not exist for your wiki, you can still proceed to use the script

If you have v1 image databases, they should be placed in /xowa/file/wiki_domain-prv. For example, English Wikipedia should have /xowa/file/en.wikipedia.org-prv/fsdb.main/fsdb.bin.0000.sqlite3
If you have v2 image databases, they should be placed in /xowa/wiki/wiki_domain/prv. For example, English Wikipedia should have /xowa/wiki/en.wikipedia.org/prv/en.wikipedia.org-file-ns.000-db.001.xowa

gfs

The script is written in the gfs format. This is a custom scripting format unique to XOWA. It is similar to JSON, but also supports commenting.

Unfortunately the error-handling for gfs is quite minimal. When making changes, please do them in small steps and be prepared to revert to backups.

The following is a brief list of rules:

Comments are made with either "//","\n" or "/*","*/". For example: // single-line comment or /* multi-line comment*/
Booleans are "y" and "n" (yes / no or true / false). For example: enabled = 'y';
Numbers are 32-bit integers and are not enclosed in quotes. For example, count = 10000;
Strings are surrounded by apostrophes (') or quotes ("). For example: key = 'val';
Statements are terminated by a semi-colon (;). For example: procedure1;
Statements can take arguments in parentheses. For example: procedure1('argument1', 'argument2', 'argument3');
Statements are grouped with curly braces. ({}). For example: group {procedure1; procedure2; procedure3;}

Terms (file-dump mode only)

lnki

A lnki is short for "link internal". It refers to all wikitext with the double bracket syntax: [[A]]. A more elaborate example for files would be [[File:A.png|thumb|200x300px|upright=.80]]. Note that the abbreviation was chosen to differentiate it from lnke which is short for "link enternal". For the purposes of the script, all lnki data comes from the current wiki's data dump

orig

An orig is short for "original file". It refers to the original file metadata. For the purposes of this script, all orig data comes from commons.wikimedia.org

xfer

An xfer is short for "transfer file". It refers to the actual file to be downloaded.

fsdb

The fsdb is short for "file system database". It refers to the internal table format of the XOWA image databases.

HTML dump

Plain-html databases

The above script generates pages that are gz-compressed and xowa.mediawiki-compressed. If you just want plain HTML pages to use in another application, you can substitute this command:

hdump_bldr {enabled = 'y'; zip_tid_html = 'raw'; hzip_enabled = 'n'; hzip_diff = 'n';}

After the build completes, you can open up any of the XOWA HTML databases and run the following SQL:

SELECT * FROM html LIMIT 10;

Command-line

Dump-scripts require more memory. You should have at least 8 GB memory and preferably 16 GB. Use a command-line like the following

java -Xmx15000m -XX:+HeapDumpOnOutOfMemoryError -jar xowa_linux_64.jar --app_mode cmd --cmd_file make_wiki.gfs --show_license n --show_args n

Script

app.bldr.pause_at_end_('n');
app.scripts.run_file_by_type('xowa_cfg_app');
app.bldr.cmds {
  // build commons database; this only needs to be done once, whenever commons is updated
  add     ('commons.wikimedia.org' , 'util.cleanup')          {delete_all = 'y';}  
  add     ('commons.wikimedia.org' , 'util.download')         {dump_type = 'pages-articles';}
  add     ('commons.wikimedia.org' , 'util.download')         {dump_type = 'categorylinks';}
  add     ('commons.wikimedia.org' , 'util.download')         {dump_type = 'page_props';}
  add     ('commons.wikimedia.org' , 'util.download')         {dump_type = 'image';}
  add     ('commons.wikimedia.org' , 'text.init');
  add     ('commons.wikimedia.org' , 'text.page');
  add     ('commons.wikimedia.org' , 'text.cat.core');
  add     ('commons.wikimedia.org' , 'text.cat.link');
  add     ('commons.wikimedia.org' , 'text.cat.hidden');
  add     ('commons.wikimedia.org' , 'text.term');
  add     ('commons.wikimedia.org' , 'text.css');
  add     ('commons.wikimedia.org' , 'wiki.image');
  add     ('commons.wikimedia.org' , 'file.page_regy')        {build_commons = 'y'}
  add     ('commons.wikimedia.org' , 'wiki.page_dump.make');
  add     ('commons.wikimedia.org' , 'wiki.redirect')         {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
  add     ('commons.wikimedia.org' , 'util.cleanup')          {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}

  // build wikidata database; this only needs to be done once, whenever wikidata is updated
  add     ('www.wikidata.org'      , 'util.cleanup')          {delete_all = 'y';}
  add     ('www.wikidata.org'      , 'util.download')         {dump_type = 'pages-articles';}
  add     ('www.wikidata.org'      , 'util.download')         {dump_type = 'categorylinks';}
  add     ('www.wikidata.org'      , 'util.download')         {dump_type = 'page_props';}
  add     ('www.wikidata.org'      , 'util.download')         {dump_type = 'image';}
  add     ('www.wikidata.org'      , 'text.init');
  add     ('www.wikidata.org'      , 'text.page');
  add     ('www.wikidata.org'      , 'text.cat.core');
  add     ('www.wikidata.org'      , 'text.cat.link');
  add     ('www.wikidata.org'      , 'text.cat.hidden');
  add     ('www.wikidata.org'      , 'text.term');
  add     ('www.wikidata.org'      , 'text.css');
  add     ('www.wikidata.org'      , 'util.cleanup')          {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}

  // build simple.wikipedia.org
  // NOTE!: deletes all files in /xowa/wiki/simple.wikipedia.org
  add     ('simple.wikipedia.org'  , 'util.cleanup')          {delete_all = 'y';}
  
  // download wikitext dump from http://dumps.wikimedia.org/backup-index.html
  add     ('simple.wikipedia.org'  , 'util.download')         {dump_type = 'pages-articles';}

  // download category dump from http://dumps.wikimedia.org/backup-index.html
  add     ('simple.wikipedia.org'  , 'util.download')         {dump_type = 'categorylinks';}

  // download page_props dump from http://dumps.wikimedia.org/backup-index.html (needed for hidden categories)
  add     ('simple.wikipedia.org'  , 'util.download')         {dump_type = 'page_props';}

  // download image dump from http://dumps.wikimedia.org/backup-index.html
  add     ('simple.wikipedia.org'  , 'util.download')         {dump_type = 'image';}
  
  // initial step to create stub databases for wikitext
  add     ('simple.wikipedia.org'  , 'text.init');
  
  // calculate redirect_id for #REDIRECT pages. needed for html databases
  add     ('simple.wikipedia.org'  , 'text.page')             {redirect_id_enabled = 'y';}
  
  // generates title-search database
  add     ('simple.wikipedia.org'  , 'text.search');

  // generates desktop css
  add     ('simple.wikipedia.org'  , 'text.css');

  // generates main category database
  add     ('simple.wikipedia.org'  , 'text.cat.core');
  
  // generates category-to-page databases
  add     ('simple.wikipedia.org'  , 'text.cat.link');
  
  // identifies hidden categories
  add     ('simple.wikipedia.org'  , 'text.cat.hidden');
  
  // performs final steps for wikitext databases
  add     ('simple.wikipedia.org'  , 'text.term');
  
  // create local "page" tables in each "text" database for "lnki_temp"
  add     ('simple.wikipedia.org' , 'wiki.page_dump.make');
  
  // create a redirect table for pages in the File namespace
  add     ('simple.wikipedia.org' , 'wiki.redirect')         {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;}
  
  // create an "image" table to get the metadata for all files in the current wiki
  add     ('simple.wikipedia.org' , 'wiki.image');

  // NOTE!: deletes all downloaded bz2 / gz / xml / sql files
  add     ('simple.wikipedia.org' , 'util.cleanup')          {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');}
  
  // parse every page in the listed namespace and gather data on their lnkis.
  // this step will take the longest amount of time.
  add     ('simple.wikipedia.org' , 'file.lnki_temp') {
    // save data every # of pages
    commit_interval = 10000; 

    // print progress to command-line shell every # of pages
    progress_interval = 50;

    // free memory by flushing internal caches every # of pages
    cleanup_interval = 50;

    // specify # of pages to read into memory at a time, where # is in MB. For example, 25 means read approximately 25 MB of page text into memory
    select_size = 25;

    // namespaces to parse. See en.wikipedia.org/wiki/Help:Namespaces
    ns_ids = '0|4|14|100';

    // generate html-dump databases
    hdump_bldr {
      // enable / disable html-dump generation
      enabled = 'y';
      
      // 'raw'  : no compression; stores in plain text
      // 'gz'   : compresses to gz
      // 'bzip2': compresses to bz2
      zip_tid_html = 'gz';

      // 'y': does secondary mediawiki-specific compression to make databases even smaller (about 30%)
      // 'n': does not do secondary compression
      hzip_enabled = 'y';
      
      // post-processing check to make sure XOWA-compression format decompresses back to original format
      hzip_diff = 'y';
    }
  }
  
  // aggregate the lnkis
  add     ('simple.wikipedia.org' , 'file.lnki_regy');
  
  // generate orig metadata for files in the current wiki (for example, for pages in en.wikipedia.org/wiki/File:*)
  add     ('simple.wikipedia.org' , 'file.page_regy')        {build_commons = 'n';}
  
  // generate all orig metadata for all lnkis
  add     ('simple.wikipedia.org' , 'file.orig_regy');
  
  // generate list of files to download based on "orig_regy" and XOWA image code
  add     ('simple.wikipedia.org' , 'file.xfer_temp.thumb');
  
  // aggregate list one more time
  add     ('simple.wikipedia.org' , 'file.xfer_regy');

  // identify images that have already been downloaded
  add     ('simple.wikipedia.org' , 'file.xfer_regy_update');
  
  // download images. This step may also take a long time, depending on how many images are needed
  add     ('simple.wikipedia.org' , 'file.fsdb_make') {
    commit_interval = 1000; progress_interval = 200; select_interval = 10000;
    ns_ids = '0|4|14|100';
    
    // specify whether original wiki databases are v1 (.sqlite3) or v2 (.xowa)
    src_bin_mgr__fsdb_version = 'v2';
    
    // always redownload certain files
    src_bin_mgr__fsdb_skip_wkrs = 'page_gt_1|small_size';
    
    // allow downloads from wikimedia
    src_bin_mgr__wmf_enabled = 'y';
  }
  
  // generate registry of original metadata by file title
  add     ('simple.wikipedia.org' , 'file.orig_reg');
  
  // drop page_dump tables
  add     ('simple.wikipedia.org' , 'wiki.page_dump.drop');
}
app.bldr.run;