Wikipedia#
Prerequisite
Please see Wikipedia dumps for the list of dumps that are required to generate all datasets.
Data organization
kgdata organizes Wikipedia’s data in a folder <wikipedia_dir>, e.g., /data/wikipedia/20200518. The dumps are stored in a subfolder called dumps <wikipedia_dir>/dumps (e.g., /data/wikipedia/20200518/dumps). Other datasets after processed are stored in sibling folders. The list of folders can be found in kgdata.wikipedia.config.
Wikipedia datasets#
List of available datasets can be found in kgdata.wikipedia.datasets.
Run the command python -m kgdata.wikipedia.datasets to generate the datasets.
$ python -m kgdata.wikipedia.datasets --help
Usage: python -m kgdata.wikipedia.datasets [OPTIONS]
Options:
--wp-dir TEXT Wikipedia directory [required]
--wd-dir TEXT Wikidata directory
-d, --dataset TEXT Dataset name [required]
--help Show this message and exit.
Examples:
python -m kgdata.wikipedia.datasets --wp-dir $WP_DIR -d relational_tables
python -m kgdata.wikipedia.datasets --wp-dir $WP_DIR --wd-dir $WD_DIR -d linked_relational_tables
Note
For the commands in the above examples to run correctly, replaced $WP_DIR with the path to the Wikipedia directory and $WD_DIR with the path to the Wikidata directory, e.g., export WD_DIR=/data/wikipedia/20220420
Wikipedia dumps#
We need the following dumps:
Static HTML Dumps: they only dumps some namespaces. The namespace that you likely to use is 0 (main articles). For example,
enwiki-NS0-20220420-ENTERPRISE-HTML.json.tar.gz.
Below are some useful scripts to download the dumps. First, set the correct parameters:
export VERSION=20200420
export DIR=<wikipedia_dir>
mkdir -p $DIR/dumps
cd $DIR/dumps
wget https://dumps.wikimedia.org/other/enterprise_html/runs/$VERSION/enwiki-NS0-$VERSION-ENTERPRISE-HTML.json.tar.gz