App/Xtn/Import/Dansguardian

From XOWA: the free, open-source, offline wiki application

XOWA allows custom creation of wikis by either excluding words or including words. The system is based on the Dansguardian format: an open-source (GPLv2) system for filtering web-pages.

1 Options
2 Phraselists
- 2.1 Location
- 2.2 Format
3 Import process
4 Scoring
- 4.1 Basic
- 4.2 Multiplicity
5 Exclusion
6 Inclusion
7 Manual inclusion / exclusion
8 Other notes
- 8.1 Performance

Options

See Options/Import_Dansguardian

Phraselists

Location

Phraselist files are located at /xowa/bin/any/xowa/cfg/bldr/filter/wiki_name/dansguardian. For example, on a Windows system, a phraselist file for simple.wikipedia.org can be placed at C:\xowa\bin\any\xowa\cfg\bldr\filter\simple.wikipedia.org\dansguardian\phraselist1.txt
Phraselist files can be placed in sub-directories for grouping purposes. For example, files can be placed at C:\xowa\bin\any\xowa\cfg\bldr\filter\simple.wikipedia.org\dansguardian\group1\phraselist1.txt and C:\xowa\bin\any\xowa\cfg\bldr\filter\simple.wikipedia.org\dansguardian\group2\phraselist2.txt

Format

Phraselist files are plain text files with the following format:

Each line is either a rule or a comment
Comments start with the hash sign: # . For example # this is a comment
Each rule has one or more words enclosed in angle brackets (< >) and separated by commas. For example, < earth >,< mars >
Each rule ends with a score also enclosed in angle brackets. For example, <70>

Import process

Phraselists are applied during import. The following process occurs:

The import starts for a wiki
All phraselists for the wiki are loaded into memory.
Each article's wikitext is analyzed by the phraselists and generates a score.
- The article's "title" is not analyzed. Article titles generally have only one or two words, and are not useful for phraselist matching
- The html is not analyzed. Note that this would slow down the import process dramatically. For example, for English Wikipedia, wikitext would only slow down the process from 2 hours 40 minutes to 3 hours. HTML would slow it down to 70 hours.

The entire wikitext is scanned including:

Html tags: <img some attributes/>
Urls: https://upload.wikimedia.org
Comments

Scoring

Basic

A rule is matched if any part of the wikitext contains the words in the ruletext.

For example, let's says we wanted to build up a phraselist that allowed us to build a wiki without any astronomy articles. We could use something like the following:

< planet ><50>
< earth >,< planet ><30>

Now consider these short sample articles:

An article with just the word "planet" would have a score of 50
An article with the words "earth planet something" would have a score of 80; 50 for matching "planet" and 30 for matching "earth" and " planet "
An article with just the word "earth" would have a score of 0. It needs to have the word "planet" to get a score of 30

Multiplicity

Rules scores are multiplied based on occurrences.

For example, if an article has the text "planet planet planet" then its score would be 150, not 50, because it matches the "planet" rule 3 times
Similarly "earth planet something earth planet something earth planet something" would have a score of 240 because it matches "earth planet" 3 times (90) and "planet" 3 times (150)
However "earth planet something earth planet something earth" only has a score of 160 because it only matches the "earth planet" rule 2 times "60" and the planet rule 2 times (100)

Exclusion

By default, anything that matches a rule (has a score > 0) will be excluded. Note that this exclude number can be raised from 0 to something higher like 100. See Options/Import_Dansguardian

Inclusion

The import filter can also be used to build content-specific wikis. For example, let's say you wanted to build a wiki that only includes articles with the words "planet" and "earth planet". The following can be done:

Use the same phraselists as above, but negate the numbers:

< planet ><-50>
< earth >,< planet ><-30>

Change the initial score from 0 to 50
Leave the exclude score at 0

When running the import, the following will happen:

An article that has the words "earth planet something" will have a score of -30: the initial score of 50 plus the rule score of -80. Because -30 is less than the exclude score of 0, it will not be excluded.
An article that has the words "a b c" will still have its initial score of 50. Because 50 is greater than the exclude score of 0, it will be excluded

Manual inclusion / exclusion

The filter process also provides a way to list articles that will always be included / excluded, regardless of rule score.

For example, to always include the articles Earth and Sun, create the following text file:

Earth
Sun

Save it to /xowa/bin/any/xowa/cfg/bldr/filter/wiki_name/xowa.title.exclude.txt XOWA will read this list and always exclude the article

Similarly, to manually include an article, save the file to /xowa/bin/any/xowa/cfg/bldr/filter/wiki_name/xowa.title.include.txt

Other notes

Performance

Note that a phraselist file can have many rules. The number of rules does not significantly slow down the runtime of the import-filter. For example, let's say Simple Wikipedia imports in 3 minutes with 100 rules. If there are 10,000 rules, the import should still take 3 minutes
However, the number of rules will affect the amount of memory required by the computer. For example, 100 rules may take 1 MB. 10,000 rules may take 10 MB.

App/Xtn/Import/Dansguardian

Contents

Options

Phraselists

Location

Format

Import process

Scoring

Basic

Multiplicity

Exclusion

Inclusion

Manual inclusion / exclusion

Other notes

Performance

Namespaces

XOWA

Getting started

Android

Help

Blog

Links

Donate