From bda10e766086a01a2d63a1d4f93e4a0e5eae9ebb Mon Sep 17 00:00:00 2001
From: gnosygnu
- The v1.9.2 release has several minor changes for English Wiktionary and Wikisource. Some were quite time-consuming, including one Luaj issue with .pairs(). I also decided to hold off on more HTML dump work, because I want to see how they perform in Android before I commit to the HTML dump format. Towards that end, I started working on the Android version this week, though so far most of the work is quite experimental.
+ The v1.9.2 release has several minor changes for English Wiktionary and Wikisource. Some were quite time-consuming, including one Luaj issue with .pairs(). I also decided to hold off on more HTML dump work, because I want to see how they perform in Android before I commit to the HTML dump format. Towards that end, I started working on the Android version this week, though so far most of the work is quite experimental.
v1.9.3 will involve more Android work along with HTML dumps. For wikis, I'm going to do Hungarian (rebuild) and Esperanto (new).
@@ -436,7 +436,7 @@
The v1.8.2 update is larger than usual. I had to regenerate the language files because MediaWiki v1.24 added "!" as a magic word and German Wikipedia dropped Template:!. Since this was a low-level change, it forced a rebuild of all the language files. I also took the opportunity to move the language files from
- In addition, I also worked on a number of performance enhancements for pages with heavy Scribunto usage. The worst is https://en.wiktionary.org/wiki/water . On XOWA this used to take 1.5 GB of memory and 50 seconds. Now it takes 200 MB of memory and 35 seconds. This is still bloated, but keep in mind the offical site takes about 20-25 seconds[1]. Also, this heavy Scribunto usage only affects a small number of pages (and mostly on en.wiktionary). I'll try to add some more incremental improvements over the next few releases, but ultimately this may have to be resolved by offline HTML dumps.
+ In addition, I also worked on a number of performance enhancements for pages with heavy Scribunto usage. The worst is https://en.wiktionary.org/wiki/water . On XOWA this used to take 1.5 GB of memory and 50 seconds. Now it takes 200 MB of memory and 35 seconds. This is still bloated, but keep in mind the offical site takes about 20-25 seconds[1]. Also, this heavy Scribunto usage only affects a small number of pages (and mostly on en.wiktionary). I'll try to add some more incremental improvements over the next few releases, but ultimately this may have to be resolved by offline HTML dumps.
Finally, I'm trying to upload the German wikis now, but the upload speed is horrendous and at 33 GB, it'll take several days. I had the same problem last Sunday, though this time it's much worse. If this doesn't resolve by tomorrow, I'll contact archive.org for help, but in the meantime, German wikis will be late.
@@ -450,7 +450,7 @@
^ If you want to reproduce this, try the following:
/xowa/user/anonymous/lang/xowa/
to /xowa/bin/any/xowa/cfg/lang/core/
. I also did the same for /xowa/user/anonymous/wiki/#cfg/
to /xowa/bin/any/xowa/cfg/wiki/core/
This item is self-explanatory. The XOWA Android app is getting more stable, so I felt it would be time to document the generation of the HTML databases. diff --git a/home/wiki/Blog/2016-02.html b/home/wiki/Blog/2016-02.html index cac7f7918..63c69a6e5 100644 --- a/home/wiki/Blog/2016-02.html +++ b/home/wiki/Blog/2016-02.html @@ -182,7 +182,7 @@ Fix for English Wiktionary sections not expanding correctly
- This bug occurs when opening up any English Wiktionary page. Each page will have Translation tables with a "Hide" / "Show" button. The following occurred when viewing these pages (for example: https://en.wiktionary.org/wiki/green) + This bug occurs when opening up any English Wiktionary page. Each page will have Translation tables with a "Hide" / "Show" button. The following occurred when viewing these pages (for example: https://en.wiktionary.org/wiki/green)
@@ -824,7 +824,7 @@
@@ -1403,7 +1403,7 @@
@@ -3569,25 +3569,25 @@ Scribunto.Luaj: Handle string.match for empty strings and balanced regexs; EX:string.match("", "%b<>", ""). See: https://en.wikipedia.org/wiki/Woburn,_Massachusetts
@@ -3669,7 +3669,7 @@ Lang: Show thumbs on left for rtl languages. See: https://ar.wikipedia.org/wiki/منطقة_غويانا
@@ -1313,10 +1313,10 @@
@@ -2563,7 +2563,7 @@
@@ -2804,7 +2804,7 @@
diff --git a/home/wiki/Change_log/2016.html b/home/wiki/Change_log/2016.html index c3042b27e..65a9fb315 100644 --- a/home/wiki/Change_log/2016.html +++ b/home/wiki/Change_log/2016.html @@ -988,7 +988,7 @@ Resolved by: Always reload page when going back / forward on wikinews (do not use cached html).
@@ -3497,7 +3497,7 @@ Example: {{:missing}} -> [[:missing]] x> [[Template:Missing]].
@@ -4029,7 +4029,7 @@ Resolved by: Implement basic functionality for {{#categorytree}}.
@@ -4814,7 +4814,7 @@ Resolved by: Change mediawiki.gadget.navframe.js to explicitly set style.display.
@@ -4966,13 +4966,13 @@
@@ -5051,7 +5051,7 @@
diff --git a/home/wiki/Change_log/2017.html b/home/wiki/Change_log/2017.html index af7ae83e3..29057144b 100644 --- a/home/wiki/Change_log/2017.html +++ b/home/wiki/Change_log/2017.html @@ -594,7 +594,7 @@
diff --git a/home/wiki/Change_log/v3.6.4.1.html b/home/wiki/Change_log/v3.6.4.1.html index 0a2dbf63f..ba6249713 100644 --- a/home/wiki/Change_log/v3.6.4.1.html +++ b/home/wiki/Change_log/v3.6.4.1.html @@ -184,7 +184,7 @@ Resolved by: Include "mediawiki.page.gallery.css" if page has gallery.
@@ -235,7 +235,7 @@ Example: {{:missing}} -> [[:missing]] x> [[Template:Missing]].
diff --git a/home/wiki/Change_log/v3.9.2.1.html b/home/wiki/Change_log/v3.9.2.1.html index 4acc5c1b2..aa96549c6 100644 --- a/home/wiki/Change_log/v3.9.2.1.html +++ b/home/wiki/Change_log/v3.9.2.1.html @@ -279,7 +279,7 @@
- https://he.wikipedia.org
+ https://he.wikipedia.org
https://he.wiktionary.org
https://he.wikisource.org
https://he.wikivoyage.org
diff --git a/home/wiki/Dashboard/Import/Online.html b/home/wiki/Dashboard/Import/Online.html
index d1378e0f4..4cd132f24 100644
--- a/home/wiki/Dashboard/Import/Online.html
+++ b/home/wiki/Dashboard/Import/Online.html
@@ -145,7 +145,7 @@
download
- XOWA can generate two types of dumps: file-dumps and html-dumps + XOWA can make complete wikis which will have the following: +
+
+ This process is run by a custom command-line make
script.
@@ -61,53 +72,74 @@
1 Overview
- The download-thumbs script downloads all thumbs for pages in a specific wiki. It works in the following way:
+ The make
script works in the following way:
cmd
C:\xowa
, run cd C:\xowa
+ make_xowa.gfs
with a text-editor.
+
+ java -jar C:\xowa\xowa_windows_64.jar --app_mode cmd --cmd_file C:\xowa\make_xowa.gfs --show_license n --show_args n
+
+ The make
script should be run in 3 parts:
+
make_commons
script: Builds commons.wikimedia.org which is needed to provide image metadata for the download
+ make_wikidata
script: Builds www.wikidata.org which needed for data from {{#property}} calls or Module code.
+ make_wiki
script: Build the actual wiki
+
+ Note that other wikis can re-use the same commons and wikidata. For example, if you want to build enwiki and dewiki, you only need to build make_commons
and make_wikidata
once.
+
make_commons
+ make_xowa.gfs
+ +app.bldr.pause_at_end_('n'); +app.scripts.run_file_by_type('xowa_cfg_app'); +app.cfg.set_temp('app', 'xowa.app.web.enabled', 'y'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.text', '0'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.html', '0'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.file', '0'); +app.bldr.cmds { + // build commons database; this only needs to be done once, whenever commons is updated + add ('commons.wikimedia.org' , 'util.cleanup') {delete_all = 'y';} + add ('commons.wikimedia.org' , 'util.download') {dump_type = 'pages-articles';} + add ('commons.wikimedia.org' , 'util.download') {dump_type = 'page_props';} + add ('commons.wikimedia.org' , 'util.download') {dump_type = 'image';} + add ('commons.wikimedia.org' , 'text.init'); + add ('commons.wikimedia.org' , 'text.page'); + add ('commons.wikimedia.org' , 'text.term'); + add ('commons.wikimedia.org' , 'text.css'); + add ('commons.wikimedia.org' , 'wiki.page_props'); + add ('commons.wikimedia.org' , 'wiki.image'); + add ('commons.wikimedia.org' , 'file.page_regy') {build_commons = 'y'} + add ('commons.wikimedia.org' , 'wiki.page_dump.make'); + add ('commons.wikimedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;} + add ('commons.wikimedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');} +} +app.bldr.run; ++
make_wikidata
+ make_xowa.gfs
+ +app.bldr.pause_at_end_('n'); +app.scripts.run_file_by_type('xowa_cfg_app'); +app.cfg.set_temp('app', 'xowa.app.web.enabled', 'y'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.text', '0'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.html', '0'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.file', '0'); +app.bldr.cmds { + // build wikidata database; this only needs to be done once, whenever wikidata is updated + add ('www.wikidata.org' , 'util.cleanup') {delete_all = 'y';} + add ('www.wikidata.org' , 'util.download') {dump_type = 'pages-articles';} + add ('www.wikidata.org' , 'util.download') {dump_type = 'categorylinks';} + add ('www.wikidata.org' , 'util.download') {dump_type = 'page_props';} + add ('www.wikidata.org' , 'util.download') {dump_type = 'image';} + add ('www.wikidata.org' , 'text.init'); + add ('www.wikidata.org' , 'text.page'); + add ('www.wikidata.org' , 'text.term'); + add ('www.wikidata.org' , 'text.css'); + add ('www.wikidata.org' , 'wiki.page_props'); + add ('www.wikidata.org' , 'wiki.categorylinks'); + add ('www.wikidata.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');} +} +app.bldr.run; ++
make_wiki
+ make_xowa.gfs
+ +app.bldr.pause_at_end_('n'); +app.scripts.run_file_by_type('xowa_cfg_app'); +app.cfg.set_temp('app', 'xowa.app.web.enabled', 'y'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.text', '0'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.html', '0'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.file', '0'); +app.bldr.cmds { + // build simple.wikipedia.org + add ('simple.wikipedia.org' , 'util.cleanup') {delete_all = 'y';} + add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pages-articles';} + add ('simple.wikipedia.org' , 'util.download') {dump_type = 'categorylinks';} + add ('simple.wikipedia.org' , 'util.download') {dump_type = 'page_props';} + add ('simple.wikipedia.org' , 'util.download') {dump_type = 'image';} + add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pagelinks';} // needed for sorting search results by PageRank + add ('simple.wikipedia.org' , 'util.download') {dump_type = 'imagelinks';} + add ('simple.wikipedia.org' , 'text.init'); + add ('simple.wikipedia.org' , 'text.page') { + // calculate redirect_id for #REDIRECT pages. needed for html databases + redirect_id_enabled = 'y'; + } + add ('simple.wikipedia.org' , 'text.search'); + + // upload desktop css + add ('simple.wikipedia.org' , 'text.css'); + + // upload mobile css + add ('simple.wikipedia.org' , 'text.css') {css_key = 'xowa.mobile'; /* css_dir = 'C:\xowa\user\anonymous\wiki\simple.wikipedia.org-mobile\html\'; */} + + add ('simple.wikipedia.org' , 'text.term'); + + add ('simple.wikipedia.org' , 'wiki.page_props'); + add ('simple.wikipedia.org' , 'wiki.categorylinks'); + + // create local "page" tables in each "text" database for "lnki_temp" + add ('simple.wikipedia.org' , 'wiki.page_dump.make'); + + // create a redirect table for pages in the File namespace + add ('simple.wikipedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;} + + // create an "image" table to get the metadata for all files in the current wiki + add ('simple.wikipedia.org' , 'wiki.image'); + + // create an "imagelinks" table to find out which images are used for the wiki + add ('simple.wikipedia.org' , 'wiki.imagelinks'); + + // parse all page-to-page links + add ('simple.wikipedia.org' , 'wiki.page_link'); + + // calculate a score for each page using the page-to-page links + add ('simple.wikipedia.org' , 'search.page__page_score') {iteration_max = 100;} + + // update link score statistics for the search tables + add ('simple.wikipedia.org' , 'search.link__link_score') {page_rank_enabled = 'y';} + + // update word count statistics for the search_word table + add ('simple.wikipedia.org' , 'search.word__link_count'); + + // cleanup all downloaded files as well as temporary files + add ('simple.wikipedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');} + + // v2 html generator; allows for multi-threaded / multi-machine builds + add ('simple.wikipedia.org' , 'wiki.mass_parse.init') {cfg {ns_ids = '0|4|14|8';}} + + // NOTE: must change manual_now + add ('simple.wikipedia.org' , 'wiki.mass_parse.exec') { + cfg { + num_wkrs = 8; load_all_templates = 'y'; cleanup_interval = 50; hzip_enabled = 'y'; hdiff_enabled ='y'; manual_now = '2020-02-01 01:02:03'; + load_all_imglinks = 'y'; + + // uncomment the following 3 lines if using the build script as a "worker" helping a "server" + // num_pages_in_pool = 32000; + // mgr_url = '\\server_machine_name\xowa\wiki\en.wikipedia.org\tmp\xomp\'; + // wkr_machine_name = 'worker_machine_1' + } + } + + // note that if multi-machine mode is enabled, all worker directories must be manually copied to the server directory (a build command will be added later) + add ('simple.wikipedia.org' , 'wiki.mass_parse.make'); + + // aggregate the lnkis + add ('simple.wikipedia.org' , 'file.lnki_regy'); + + // generate orig metadata for files in the current wiki (for example, for pages in en.wikipedia.org/wiki/File:*) + add ('simple.wikipedia.org' , 'file.page_regy') {build_commons = 'n';} + + // generate all orig metadata for all lnkis + add ('simple.wikipedia.org' , 'file.orig_regy'); + + // generate list of files to download based on "orig_regy" and XOWA image code + add ('simple.wikipedia.org' , 'file.xfer_temp.thumb'); + + // aggregate list one more time + add ('simple.wikipedia.org' , 'file.xfer_regy'); + + // identify images that have already been downloaded + add ('simple.wikipedia.org' , 'file.xfer_regy_update'); + + // download images. This step may also take a long time, depending on how many images are needed + add ('simple.wikipedia.org' , 'file.fsdb_make') { + commit_interval = 1000; progress_interval = 200; select_interval = 10000; + ns_ids = '0|4|14'; + + // specify whether original wiki databases are v1 (.sqlite3) or v2 (.xowa) + src_bin_mgr__fsdb_version = 'v1'; + + // always redownload certain files + src_bin_mgr__fsdb_skip_wkrs = 'page_gt_1|small_size'; + + // allow downloads from wikimedia + src_bin_mgr__wmf_enabled = 'y'; + } + + // generate registry of original metadata by file title + add ('simple.wikipedia.org' , 'file.orig_reg'); + + // drop page_dump tables + add ('simple.wikipedia.org' , 'wiki.page_dump.drop'); +} +app.bldr.run; ++
manual_now
above to match the first day of the current month. For example, if today is 2020-02-16
, change it to manual_now = '2020-02-01 01:02:03'
.
+ - The script for simple wikipedia is listed below. -
- You will need the latest version of commons.wikimedia.org. Note that if you have an older version, you will have missing images or wrong size information. -
-- For example, if you have a commons.wikimedia.org from 2015-04-22 and are trying to import a 2015-05-17 English Wikipedia, then any new images added after 2015-04-22 will not be picked up. -
-- You also need to have the latest version of www.wikidata.org. Note that English Wikipedia and other wikis uses Wikidata through the {{#property}} call or Module code. If you have an earlier version, then data will be missing or out of date. -
-
- You should have a recent-generation machine with relatively high-performance hardware, especially if you're planning to generate images for English Wikipedia.
+ You should have a recent-generation machine with relatively high-performance hardware, especially if you're planning to run the make
script for English Wikipedia.
For context, here is my current machine setup for generating the image dumps: @@ -195,20 +496,20 @@ (Note: The hardware was assembled in late 2013.)
- For English Wikipedia, it still takes about 50 hours for the entire process. + For English Wikipedia, it takes about 50 hours for the entire process.
-- You should have a broadband connection to the internet. The script will need to download dump files from Wikimedia and some dump files (like English Wikipedia) will be in the 10s of GB. + You should have a broadband connection to the internet. The script will need to download dump files from Wikimedia and some dump files (like English Wikipedia) will be in the tens of GB.
- You can opt to download these files separately and place them in the appropriate location beforehand. However, the script below assumes that the machine is always online. If you are offline, you will need to comment the "util.download" lines yourself.
+
XOWA will automatically re-use the images from existing image databases so that you do not have to redownload them. This is particularly useful for large wikis where redownloading millions of images would be unwanted.
@@ -223,9 +524,9 @@ If you have v2 image databases, they should be placed in/xowa/wiki/wiki_domain/prv
. For example, English Wikipedia should have /xowa/wiki/en.wikipedia.org/prv/en.wikipedia.org-file-ns.000-db.001.xowa
-
The script is written in the gfs
format. This is a custom scripting format specific to XOWA. It is similar to JSON, but also supports commenting.
group {procedure1; procedure2; procedure3;}
-
- A lnki
is short for "link internal". It refers to all wikitext with the double bracket syntax: [[A]]. A more elaborate example for files would be [[File:A.png|thumb|200x300px|upright=.80]]. Note that the abbreviation was chosen to differentiate it from lnke
which is short for "link enternal". For the purposes of the script, all lnki data comes from the current wiki's data dump
+ A lnki
is short for "link internal". It refers to all wikitext with the double bracket syntax: [[A]]. A more elaborate example for files would be [[File:A.png|thumb|200x300px|upright=.80]]. Note that the abbreviation was chosen to differentiate it from lnke
which is short for "link enternal".
+ For the purposes of the script, all lnki data comes from the wikitext in the current wiki's data dump +
+orig
is short for "original file". It refers to the original file metadata. For the purposes of this script, all orig data comes from commons.wikimedia.org
-
+ An orig
is short for "original file". It refers to the original file metadata.
+
+ For the purposes of this script, all orig data comes from commons.wikimedia.org +
+xfer
is short for "transfer file". It refers to the actual file to be downloaded.
-
+ An xfer
is short for "transfer file". It refers to the actual file to be downloaded.
+
fsdb
is short for "file system database". It refers to the internal table format of the XOWA image databases.
-
+ The fsdb
is short for "file system database". It refers to the file as it is stored in the internal table format of the XOWA image databases.
+
app.bldr.pause_at_end_('n'); app.scripts.run_file_by_type('xowa_cfg_app'); @@ -485,9 +789,9 @@ app.bldr.cmds { } app.bldr.run;-
app.bldr.pause_at_end_('n'); app.scripts.run_file_by_type('xowa_cfg_app'); @@ -621,6 +925,9 @@ app.bldr.run;
- XOWA can generate two types of dumps: file-dumps and html-dumps + XOWA can make complete wikis which will have the following: +
+
+ This process is run by a custom command-line make
script.
@@ -61,53 +72,74 @@
1 Overview
- The download-thumbs script downloads all thumbs for pages in a specific wiki. It works in the following way:
+ The make
script works in the following way:
cmd
C:\xowa
, run cd C:\xowa
+ make_xowa.gfs
with a text-editor.
+
+ java -jar C:\xowa\xowa_windows_64.jar --app_mode cmd --cmd_file C:\xowa\make_xowa.gfs --show_license n --show_args n
+
+ The make
script should be run in 3 parts:
+
make_commons
script: Builds commons.wikimedia.org which is needed to provide image metadata for the download
+ make_wikidata
script: Builds www.wikidata.org which needed for data from {{#property}} calls or Module code.
+ make_wiki
script: Build the actual wiki
+
+ Note that other wikis can re-use the same commons and wikidata. For example, if you want to build enwiki and dewiki, you only need to build make_commons
and make_wikidata
once.
+
make_commons
+ make_xowa.gfs
+ +app.bldr.pause_at_end_('n'); +app.scripts.run_file_by_type('xowa_cfg_app'); +app.cfg.set_temp('app', 'xowa.app.web.enabled', 'y'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.text', '0'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.html', '0'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.file', '0'); +app.bldr.cmds { + // build commons database; this only needs to be done once, whenever commons is updated + add ('commons.wikimedia.org' , 'util.cleanup') {delete_all = 'y';} + add ('commons.wikimedia.org' , 'util.download') {dump_type = 'pages-articles';} + add ('commons.wikimedia.org' , 'util.download') {dump_type = 'page_props';} + add ('commons.wikimedia.org' , 'util.download') {dump_type = 'image';} + add ('commons.wikimedia.org' , 'text.init'); + add ('commons.wikimedia.org' , 'text.page'); + add ('commons.wikimedia.org' , 'text.term'); + add ('commons.wikimedia.org' , 'text.css'); + add ('commons.wikimedia.org' , 'wiki.page_props'); + add ('commons.wikimedia.org' , 'wiki.image'); + add ('commons.wikimedia.org' , 'file.page_regy') {build_commons = 'y'} + add ('commons.wikimedia.org' , 'wiki.page_dump.make'); + add ('commons.wikimedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;} + add ('commons.wikimedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');} +} +app.bldr.run; ++
make_wikidata
+ make_xowa.gfs
+ +app.bldr.pause_at_end_('n'); +app.scripts.run_file_by_type('xowa_cfg_app'); +app.cfg.set_temp('app', 'xowa.app.web.enabled', 'y'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.text', '0'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.html', '0'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.file', '0'); +app.bldr.cmds { + // build wikidata database; this only needs to be done once, whenever wikidata is updated + add ('www.wikidata.org' , 'util.cleanup') {delete_all = 'y';} + add ('www.wikidata.org' , 'util.download') {dump_type = 'pages-articles';} + add ('www.wikidata.org' , 'util.download') {dump_type = 'categorylinks';} + add ('www.wikidata.org' , 'util.download') {dump_type = 'page_props';} + add ('www.wikidata.org' , 'util.download') {dump_type = 'image';} + add ('www.wikidata.org' , 'text.init'); + add ('www.wikidata.org' , 'text.page'); + add ('www.wikidata.org' , 'text.term'); + add ('www.wikidata.org' , 'text.css'); + add ('www.wikidata.org' , 'wiki.page_props'); + add ('www.wikidata.org' , 'wiki.categorylinks'); + add ('www.wikidata.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');} +} +app.bldr.run; ++
make_wiki
+ make_xowa.gfs
+ +app.bldr.pause_at_end_('n'); +app.scripts.run_file_by_type('xowa_cfg_app'); +app.cfg.set_temp('app', 'xowa.app.web.enabled', 'y'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.text', '0'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.html', '0'); +app.cfg.set_temp('app', 'xowa.bldr.db.layout_size.file', '0'); +app.bldr.cmds { + // build simple.wikipedia.org + add ('simple.wikipedia.org' , 'util.cleanup') {delete_all = 'y';} + add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pages-articles';} + add ('simple.wikipedia.org' , 'util.download') {dump_type = 'categorylinks';} + add ('simple.wikipedia.org' , 'util.download') {dump_type = 'page_props';} + add ('simple.wikipedia.org' , 'util.download') {dump_type = 'image';} + add ('simple.wikipedia.org' , 'util.download') {dump_type = 'pagelinks';} // needed for sorting search results by PageRank + add ('simple.wikipedia.org' , 'util.download') {dump_type = 'imagelinks';} + add ('simple.wikipedia.org' , 'text.init'); + add ('simple.wikipedia.org' , 'text.page') { + // calculate redirect_id for #REDIRECT pages. needed for html databases + redirect_id_enabled = 'y'; + } + add ('simple.wikipedia.org' , 'text.search'); + + // upload desktop css + add ('simple.wikipedia.org' , 'text.css'); + + // upload mobile css + add ('simple.wikipedia.org' , 'text.css') {css_key = 'xowa.mobile'; /* css_dir = 'C:\xowa\user\anonymous\wiki\simple.wikipedia.org-mobile\html\'; */} + + add ('simple.wikipedia.org' , 'text.term'); + + add ('simple.wikipedia.org' , 'wiki.page_props'); + add ('simple.wikipedia.org' , 'wiki.categorylinks'); + + // create local "page" tables in each "text" database for "lnki_temp" + add ('simple.wikipedia.org' , 'wiki.page_dump.make'); + + // create a redirect table for pages in the File namespace + add ('simple.wikipedia.org' , 'wiki.redirect') {commit_interval = 1000; progress_interval = 100; cleanup_interval = 100;} + + // create an "image" table to get the metadata for all files in the current wiki + add ('simple.wikipedia.org' , 'wiki.image'); + + // create an "imagelinks" table to find out which images are used for the wiki + add ('simple.wikipedia.org' , 'wiki.imagelinks'); + + // parse all page-to-page links + add ('simple.wikipedia.org' , 'wiki.page_link'); + + // calculate a score for each page using the page-to-page links + add ('simple.wikipedia.org' , 'search.page__page_score') {iteration_max = 100;} + + // update link score statistics for the search tables + add ('simple.wikipedia.org' , 'search.link__link_score') {page_rank_enabled = 'y';} + + // update word count statistics for the search_word table + add ('simple.wikipedia.org' , 'search.word__link_count'); + + // cleanup all downloaded files as well as temporary files + add ('simple.wikipedia.org' , 'util.cleanup') {delete_tmp = 'y'; delete_by_match('*.xml|*.sql|*.bz2|*.gz');} + + // v2 html generator; allows for multi-threaded / multi-machine builds + add ('simple.wikipedia.org' , 'wiki.mass_parse.init') {cfg {ns_ids = '0|4|14|8';}} + + // NOTE: must change manual_now + add ('simple.wikipedia.org' , 'wiki.mass_parse.exec') { + cfg { + num_wkrs = 8; load_all_templates = 'y'; cleanup_interval = 50; hzip_enabled = 'y'; hdiff_enabled ='y'; manual_now = '2020-02-01 01:02:03'; + load_all_imglinks = 'y'; + + // uncomment the following 3 lines if using the build script as a "worker" helping a "server" + // num_pages_in_pool = 32000; + // mgr_url = '\\server_machine_name\xowa\wiki\en.wikipedia.org\tmp\xomp\'; + // wkr_machine_name = 'worker_machine_1' + } + } + + // note that if multi-machine mode is enabled, all worker directories must be manually copied to the server directory (a build command will be added later) + add ('simple.wikipedia.org' , 'wiki.mass_parse.make'); + + // aggregate the lnkis + add ('simple.wikipedia.org' , 'file.lnki_regy'); + + // generate orig metadata for files in the current wiki (for example, for pages in en.wikipedia.org/wiki/File:*) + add ('simple.wikipedia.org' , 'file.page_regy') {build_commons = 'n';} + + // generate all orig metadata for all lnkis + add ('simple.wikipedia.org' , 'file.orig_regy'); + + // generate list of files to download based on "orig_regy" and XOWA image code + add ('simple.wikipedia.org' , 'file.xfer_temp.thumb'); + + // aggregate list one more time + add ('simple.wikipedia.org' , 'file.xfer_regy'); + + // identify images that have already been downloaded + add ('simple.wikipedia.org' , 'file.xfer_regy_update'); + + // download images. This step may also take a long time, depending on how many images are needed + add ('simple.wikipedia.org' , 'file.fsdb_make') { + commit_interval = 1000; progress_interval = 200; select_interval = 10000; + ns_ids = '0|4|14'; + + // specify whether original wiki databases are v1 (.sqlite3) or v2 (.xowa) + src_bin_mgr__fsdb_version = 'v1'; + + // always redownload certain files + src_bin_mgr__fsdb_skip_wkrs = 'page_gt_1|small_size'; + + // allow downloads from wikimedia + src_bin_mgr__wmf_enabled = 'y'; + } + + // generate registry of original metadata by file title + add ('simple.wikipedia.org' , 'file.orig_reg'); + + // drop page_dump tables + add ('simple.wikipedia.org' , 'wiki.page_dump.drop'); +} +app.bldr.run; ++
manual_now
above to match the first day of the current month. For example, if today is 2020-02-16
, change it to manual_now = '2020-02-01 01:02:03'
.
+ - The script for simple wikipedia is listed below. -
- You will need the latest version of commons.wikimedia.org. Note that if you have an older version, you will have missing images or wrong size information. -
-- For example, if you have a commons.wikimedia.org from 2015-04-22 and are trying to import a 2015-05-17 English Wikipedia, then any new images added after 2015-04-22 will not be picked up. -
-- You also need to have the latest version of www.wikidata.org. Note that English Wikipedia and other wikis uses Wikidata through the {{#property}} call or Module code. If you have an earlier version, then data will be missing or out of date. -
-
- You should have a recent-generation machine with relatively high-performance hardware, especially if you're planning to generate images for English Wikipedia.
+ You should have a recent-generation machine with relatively high-performance hardware, especially if you're planning to run the make
script for English Wikipedia.
For context, here is my current machine setup for generating the image dumps: @@ -195,20 +496,20 @@ (Note: The hardware was assembled in late 2013.)
- For English Wikipedia, it still takes about 50 hours for the entire process. + For English Wikipedia, it takes about 50 hours for the entire process.
-- You should have a broadband connection to the internet. The script will need to download dump files from Wikimedia and some dump files (like English Wikipedia) will be in the 10s of GB. + You should have a broadband connection to the internet. The script will need to download dump files from Wikimedia and some dump files (like English Wikipedia) will be in the tens of GB.
- You can opt to download these files separately and place them in the appropriate location beforehand. However, the script below assumes that the machine is always online. If you are offline, you will need to comment the "util.download" lines yourself.
+
XOWA will automatically re-use the images from existing image databases so that you do not have to redownload them. This is particularly useful for large wikis where redownloading millions of images would be unwanted.
@@ -223,9 +524,9 @@ If you have v2 image databases, they should be placed in/xowa/wiki/wiki_domain/prv
. For example, English Wikipedia should have /xowa/wiki/en.wikipedia.org/prv/en.wikipedia.org-file-ns.000-db.001.xowa
-
The script is written in the gfs
format. This is a custom scripting format specific to XOWA. It is similar to JSON, but also supports commenting.
group {procedure1; procedure2; procedure3;}
-
- A lnki
is short for "link internal". It refers to all wikitext with the double bracket syntax: [[A]]. A more elaborate example for files would be [[File:A.png|thumb|200x300px|upright=.80]]. Note that the abbreviation was chosen to differentiate it from lnke
which is short for "link enternal". For the purposes of the script, all lnki data comes from the current wiki's data dump
+ A lnki
is short for "link internal". It refers to all wikitext with the double bracket syntax: [[A]]. A more elaborate example for files would be [[File:A.png|thumb|200x300px|upright=.80]]. Note that the abbreviation was chosen to differentiate it from lnke
which is short for "link enternal".
+ For the purposes of the script, all lnki data comes from the wikitext in the current wiki's data dump +
+orig
is short for "original file". It refers to the original file metadata. For the purposes of this script, all orig data comes from commons.wikimedia.org
-
+ An orig
is short for "original file". It refers to the original file metadata.
+
+ For the purposes of this script, all orig data comes from commons.wikimedia.org +
+xfer
is short for "transfer file". It refers to the actual file to be downloaded.
-
+ An xfer
is short for "transfer file". It refers to the actual file to be downloaded.
+
fsdb
is short for "file system database". It refers to the internal table format of the XOWA image databases.
-
+ The fsdb
is short for "file system database". It refers to the file as it is stored in the internal table format of the XOWA image databases.
+
app.bldr.pause_at_end_('n'); app.scripts.run_file_by_type('xowa_cfg_app'); @@ -485,9 +789,9 @@ app.bldr.cmds { } app.bldr.run;-
app.bldr.pause_at_end_('n'); app.scripts.run_file_by_type('xowa_cfg_app'); @@ -621,6 +925,9 @@ app.bldr.run;
fails if "should be on right" is not right of "should not be left"
@@ -191,7 +191,7 @@
fails if "text does not line up on left"
@@ -218,7 +218,7 @@
fails if "text 2" is not directly underneath "text 1"
diff --git a/home/wiki/Help/Contents.html b/home/wiki/Help/Contents.html
index 3dfc7c5d6..b0f3dab45 100644
--- a/home/wiki/Help/Contents.html
+++ b/home/wiki/Help/Contents.html
@@ -184,7 +184,7 @@
Overview
+ anonymous_e30c10c2-8469-4106-80b1-2780107f4f3b
anonymous_d84359f8-890e-4d82-bff5-390c709fce05
anonymous_9a7ce759-7cdb-441d-8512-2f8d056ee952
anonymous
diff --git a/home/wiki/Help/Download_XOWA.html b/home/wiki/Help/Download_XOWA.html
index b8182b8a9..573c54b62 100644
--- a/home/wiki/Help/Download_XOWA.html
+++ b/home/wiki/Help/Download_XOWA.html
@@ -233,7 +233,7 @@
Resolved by: Redirect formatValue and formatValues to renderSnak and renderSnakValues.
diff --git a/home/wiki/Options.html b/home/wiki/Options.html index c8d169152..7e93f5f2a 100644 --- a/home/wiki/Options.html +++ b/home/wiki/Options.html @@ -59,7 +59,7 @@ - + @@ -67,7 +67,7 @@ - + @@ -108,41 +108,103 @@
Choose one of the following +
Choose one of the following:
Enter a minimum size for the cache to use (in MB)
+
This is an advanced configuration tweak. When the cache reaches its maximum size, it will delete files to free space. It will continue deleting files until the minimum size is reached.
+
For example: +
+ +Enter a font family name. +
Enter a maximum size for the cache to use (in MB)
Enter a number representing for a valid font size in pixels. +
Press to reduce the cache to the minimum now (typically 75 MB).
Press to clear the cache (reduces to 0 MB). +
+Enter a format for embedding the custom font info in the web page. +
Miscellaneous information about the cache