Options/Import
From XOWA: the free, open-source, offline wiki application
Wiki setup
Page storage format: [1] |
Import process
Dump servers: [2] | |
Import bz2 by stdout: [3] | |
Import bz2 by stdout process:[4] |
|
Custom wiki commands: [5] | |
Download xowa_common.css: [6] | |
Delete xml file after import: [7] |
PageRank
PageRank iteration max: [8] |
Database layout
Max file size for single text database: [9] | |
Max file size for single file database: [9] | |
Max file size for single html database: [9] |
Decompression apps
Decompress bz2 file |
|
Decompress zip file |
|
Decompress gz file |
|
Notes
-
^ Choose one of the following: (default is
.gz
)- text: fastest for reading but has no compression. Simple Wikipedia will be 300 MB
- gzip: (default) fast for reading and has compression. Simple Wikipedia will be 100 MB
- bzip2: very slow for reading but has best compression. Simple Wikipedia will be 85 MB (Note: The performance is very noticeable. Please try this with Simple Wikipedia first before using on a large wiki.)
-
^ Enter a list of server urls separated by a comma and newline.
- The default value is:
http://dumps.wikimedia.your.org/, http://dumps.wikimedia.org/, http://wikipedia.c3sl.ufpr.br/, http://ftp.fi.muni.cz/pub/wikimedia/
- Note that servers are prioritized from left-to-right. In the default example, your.org will be tried first. If it is offline, then the next server -- dumps.wikimedia.org -- will be tried, etc.
- See App/Import/Download/Dump_servers for more info
-
^ NOTE 1: this option only applies if the "Custom wiki commands" option is
wiki.download,wiki.import
(wiki.unzip must be removed)
Select the method for importing a wiki dump bz2 file. (default ischecked
)- checked : import through a native process's stdout. This will be faster, but may not work on all Operating Systems. A 95 MB file takes 85 seconds
- unchecked: import though Apache Common's Java bz2 compression library. This will be slower, but will work on all Operating Systems. A 95 MB file takes 215 seconds.
-
install lbzip2
-
(Debian)
sudo apt-get install lbzip2
-
(Debian)
-
change "Import bz2 by stdout process" to
-
lbzip2
-
-dkc "~{src}"
-
- ^ Process used to decompress bz2 by stdout. Recommended: Operating System default
-
^ Select custom commands: (default is
wiki.download,wiki.unzip,wiki.import
)
Short version:-
For fast imports, but high disk space usage, use
wiki.download,wiki.unzip,wiki.import
-
For slow imports, but low disk space usage, use
wiki.download,wiki.import
-
wiki.download
: downloads the wiki data dump from the dump server
- A file will be generated in "/xowa/wiki/simple.wikipedia.org/simplewiki-latest-pages-articles.xml.bz2"
-
wiki.unzip
: unzips an xml file from the wiki data dump
- A file will be created for "/xowa/wiki/simple.wikipedia.org/simplewiki-latest-pages-articles.xml" (assuming the corresponding .xml.bz2 exists)
- If this step is omitted, then XOWA will read directly from the .bz2 file. Although this will use less space (no .xml file to unzip), it will be significantly slower. Also, due to a program limitation, the progress percentage will not be accurate. It may hover at 99.99% for several minutes
-
wiki.import
: imports the xml file
- A wiki will be imported from "/xowa/wiki/simple.wikipedia.org/simplewiki-latest-pages-articles.xml"
-
wiki.download,wiki.unzip,wiki.import
AKA: fastest
- This is the default. Note that this will be the fastest to set up, but will take more space. For example, English Wikipedia will set up in 5 hours and require at least 45 GB of temp space
-
wiki.download,wiki.import
AKA: smallest
- This will read directly from the bz2 file. Note that this will use the least disk space, but will take more time. For example, English Wikipedia will set up in 8 hours but will only use 5 GB of temp space
-
For fast imports, but high disk space usage, use
-
^ Affects the xowa_common.css in /xowa/user/anonymous/wiki/wiki_name/html/. Occurs when importing a wiki. (default is
checked
)- checked : downloads xowa_common.css from the Wikimedia servers. Note that this stylesheet will be the latest copy but it may cause unexpected formatting in XOWA.
- unchecked: (default) copies xowa_common.css from /xowa/bin/any/html/html/import/. Note that this stylesheet is the one XOWA is coded against. It is the most stable, but will not have the latest logo
-
^ (Only relevant for wiki.unzip) Choose one of the following: (default is
checked
)- checked : (default) the .xml file is automatically deleted once the import process completes
- unchecked: the .xml file is untouched
-
^ Specify one of the following: (default is
0
)- 0 : (default) page rank is disabled
- (number greater than 1): page rank will be calculated until it is finished or maximum number of interations are reached. For more info, see Help/Features/Search/Build
-
^ a b c Enter a number in MB to represent the cutoff for generating sets of page databases as one file or many files (default is
1500
)
XOWA generates three types of page databases:-
text: These are Wikitext databases and have entries like ''italics''. They have
-text-
in their file name. -
html: These the html-dump databases and have entries like <i>italics</i>. They have
-html-
in their file name -
file: These are image databases which have the raw binary images. They have
-file-
in their file name
Different wikis will have different numbers of databases for a given set.- For small wikis, XOWA generates one database for the entire wiki. For example, Simple Wikipedia will just have "simple.wikipedia.org-text.xowa". This way is preferred as it is simpler.
- For large wikis, XOWA generates many databases for the entire wiki. For example, English Wikipedia will have "en.wikipedia.org-text-ns.000.xowa", "en.wikipedia.org-text-ns.000-db.002.xowa", etc. This way is necessary, because some file-systems don't support large databases. For example, creating an "en.wikipedia.org-text.xowa" file will generate a 20 GB file. This 20 GB file will generally fail on flash drives (FAT32), as well as Android (SQLite library allows 2 GB max)
These options can force XOWA to generate a wiki using either one database (Simple Wikipedia style) or many databases (English Wikipedia style). It does this by using a cutoff for the XML database dump
For example, 1500 means that a wiki with a dump file size of 1.5 GB or less will generate a single file. Any wiki with a dump file size larger than 1.5 GB will generate multiple files.- If you always want to generate a set with only one file, set the value to a large number like 999,999 (999 GB)
- If you always want to generate a set with multiple files, set the value to 0.
- Otherwise, set the value to a cutoff. Wikis below that cutoff will be "single file"; wikis above it will be "multiple files"
-
text: These are Wikitext databases and have entries like ''italics''. They have
- ^ Decompress bz2 file(needed for importing dumps) . Recommended: 7-zip
- ^ Decompress zip file(needed for importing dumps) . Recommended: 7-zip
- ^ Decompress gz file(needed for importing dumps) . Recommended: 7-zip