Wikipedia#

Prerequisite

Please see Wikipedia dumps for the list of dumps that are required to generate all datasets.

Data organization

kgdata organizes Wikipedia’s data in a folder <wikipedia_dir>, e.g., /data/wikipedia/20200518. The dumps are stored in a subfolder called dumps <wikipedia_dir>/dumps (e.g., /data/wikipedia/20200518/dumps). Other datasets after processed are stored in sibling folders. The list of folders can be found in kgdata.wikipedia.config.

Wikipedia datasets#

List of available datasets can be found in kgdata.wikipedia.datasets.

Run the command python -m kgdata.wikipedia.datasets to generate the datasets.

$ python -m kgdata.wikipedia.datasets --help

Usage: python -m kgdata.wikipedia.datasets [OPTIONS]

Options:
--wp-dir TEXT       Wikipedia directory  [required]
--wd-dir TEXT       Wikidata directory
-d, --dataset TEXT  Dataset name  [required]
--help              Show this message and exit.

Examples:

python -m kgdata.wikipedia.datasets --wp-dir $WP_DIR -d relational_tables
python -m kgdata.wikipedia.datasets --wp-dir $WP_DIR --wd-dir $WD_DIR -d linked_relational_tables

Note

For the commands in the above examples to run correctly, replaced $WP_DIR with the path to the Wikipedia directory and $WD_DIR with the path to the Wikidata directory, e.g., export WD_DIR=/data/wikipedia/20220420

Wikipedia dumps#

We need the following dumps:

  1. Static HTML Dumps: they only dumps some namespaces. The namespace that you likely to use is 0 (main articles). For example, enwiki-NS0-20220420-ENTERPRISE-HTML.json.tar.gz.

Below are some useful scripts to download the dumps. First, set the correct parameters:

export VERSION=20200420
export DIR=<wikipedia_dir>
mkdir -p $DIR/dumps
cd $DIR/dumps
wget https://dumps.wikimedia.org/other/enterprise_html/runs/$VERSION/enwiki-NS0-$VERSION-ENTERPRISE-HTML.json.tar.gz