Wikidata#
Prerequisite
Please see Wikidata dumps for the list of dumps that are required to generate all datasets.
Data organization
kgdata organizes Wikidata’s data in a folder <wikidata_dir>, e.g., /data/wikidata/20200518. The dumps are stored in a subfolder called dumps <wikidata_dir>/dumps (e.g., /data/wikidata/20200518/dumps). Other datasets after processed are stored in sibling folders. The list of folders can be found in kgdata.wikidata.config.
Wikidata datasets#
List of available datasets can be found in kgdata.wikidata.datasets.
Run the command python -m kgdata.wikidata.datasets to generate the datasets.
$ python -m kgdata.wikidata.datasets --help
Usage: python -m kgdata.wikidata.datasets [OPTIONS]
Options:
-s, --source TEXT Wikidata directory [required]
-d, --dataset TEXT Dataset name [required]
--dbpedia TEXT DBpedia directory. Only needed if building datasets that
require DBpedia data such as entity_wikilinks
--help Show this message and exit.
Examples:
python -m kgdata.wikidata.datasets -s $WD_DIR -d entities
python -m kgdata.wikidata.datasets -s $WD_DIR -d classes
python -m kgdata.wikidata.datasets -s $WD_DIR -d properties
Note
For the commands in the above examples to run correctly, replaced $WD_DIR with the path to the Wikidata directory, e.g., export WD_DIR=/data/wikidata/20211213
Wikidata databases#
List of available databases can be found by running python -m kgdata.wikidata.
$ python -m kgdata.wikidata --help
Usage: python -m kgdata.wikidata [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
classes Wikidata classes
entities Wikidata entities
entity_labels Wikidata entity labels
entity_redirections Wikidata entity redirections
properties Wikidata properties
wp2wd Mapping from Wikipedia articles to Wikidata entities
We provide functions to read the databases built from the previous step
and return a dictionary-like objects in the module:
kgdata.wikidata.db. You can find main models of Wikidata in here:
kgdata.wikidata.models.wdentity, kgdata.wikidata.models.wdclass, kgdata.wikidata.models.wdproperty.
Extract entities, entity Labels, and entity redirections:
kgdata wikidata entities -d <wikidata_dir> -o <database_directory> -ckgdata wikidata entity_labels -d <wikidata_dir> -o <database_directory> -ckgdata wikidata entity_redirections -d <wikidata_dir> -o <database_directory> -c
Extract ontology:
kgdata wikidata classes -d <wikidata_dir> -o <database_directory> -ckgdata wikidata properties -d <wikidata_dir> -o <database_directory> -c
For more commands, see scripts/build.sh. If compaction step (compact
rocksdb) takes lots of time, you can run without -c flag. If you run
directly from source, replacing the kgdata command with
python -m kgdata.
We provide functions to read the databases built from the previous step
and return a dictionary-like objects in the module:
kgdata.wikidata.db. You can find main models of Wikidata in here:
kgdata.wikidata.models.wdentity, kgdata.wikidata.models.wdclass, kgdata.wikidata.models.wdproperty.
Wikidata dumps#
The dumps are available at dumps.wikimedia.org.
We need the following dumps:
entity dump (e.g., latest-all.json.bz2): needed to extract entities, classes and properties.
wikidatawiki-page.sql.gz and wikidatawiki-redirect.sql.gz: needed to resolve redirections of old entities.
Below are some useful scripts to download the dumps. First, set the correct parameters:
export VERSION=20200518
export DIR=<wikidata_dir>
mkdir -p $DIR/dumps
cd $DIR/dumps
wget https://dumps.wikimedia.org/wikidatawiki/entities/$VERSION/wikidata-$VERSION-all.json.bz2
wget https://dumps.wikimedia.org/wikidatawiki/$VERSION/wikidatawiki-$VERSION-page.sql.gz
wget https://dumps.wikimedia.org/wikidatawiki/$VERSION/wikidatawiki-$VERSION-redirect.sql.gz