kgdata.spark#
Functions
|
|
|
Check if the result directory exists |
|
Make sure that RDDs contain unique records |
|
|
Get spark context |
|
|
|
|
Join two RDDs (left outer join) by non primary key in RDD1. |
|
Join two RDDs (left outer join) by non primary key in RDD1. |
|
- kgdata.spark.does_result_dir_exist(dpath: Union[str, Path], allow_override=True) bool[source]#
Check if the result directory exists
- kgdata.spark.ensure_unique_records(rdd, keyfn, print_error: bool = True)[source]#
Make sure that RDDs contain unique records
- kgdata.spark.left_outer_join(rdd1, rdd2, rdd1_keyfn: Callable[[R1], K1], rdd1_fk_fn: Callable[[R1], List[K2]], rdd2_keyfn: Callable[[R2], K2], join_fn: Callable[[R1, List[Tuple[K2, Optional[R2]]]], Optional[R1]], ser_fn: Callable[[R1], Union[str, bytes]], outfile: str, compression: bool = True)[source]#
Join two RDDs (left outer join) by non primary key in RDD1.
RDD1: contains records of (x, Y, x_data) where x is the id of the record, Y are list of ids of records in RDD2. RDD2: contains records of (y, y_data) where y is the id of the record.
- Parameters
rdd1 (RDD[R1]) – records of (x, Y, x_data) where x is the id of the record, Y are list of ids of records in RDD2.
rdd2 (RDD[R2]) – records of (y, y_data) where y is the id of the record.
rdd1_keyfn (Callable[[R1], K1]) – function that extract id of a record (x) of RDD1
rdd1_fk_fn (Callable[[R1], List[K2]]) – function that extract Y from a record of RDD1
rdd2_keyfn (Callable[[R2], K2]) – function that extract id of a record (y) of RDD2
rdd1_join_fn (Callable[[R1, List[Tuple[K2, Optional[R2]]]], Optional[None]]) – function that merge list of Y into record R1, if its return not None, we use that value
rdd1_serfn (Callable[[R1], Union[str, bytes]]) – function that serialize records of RDD1 to save to file
outfile (str) – output file
compression (bool, optional) – whether we should compress the result, by default True
join_fn (Callable[[R1, List[Tuple[K2, Optional[R2]]]], Optional[R1]]) –
- kgdata.spark.left_outer_join_broadcast(rdd1, rdd2, rdd1_fk_fn: Callable[[R1], List[K2]], rdd2_keyfn: Callable[[R2], K2], rdd1_join_fn: Callable[[R1, List[Tuple[K2, Optional[R2]]]], None], rdd1_serfn: Callable[[R1], Union[str, bytes]], outfile: str, compression: bool = True)[source]#
Join two RDDs (left outer join) by non primary key in RDD1. This join assumes that RDD2 can fit in memory, and takes the broadcast approach.
RDD1: contains records of (x, Y, x_data) where x is the id of the record, Y are list of ids of records in RDD2. RDD2: contains records of (y, y_data) where y is the id of the record.
- Parameters
rdd1 (RDD[R1]) – records of (x, Y, x_data) where x is the id of the record, Y are list of ids of records in RDD2.
rdd2 (RDD[R2]) – records of (y, y_data) where y is the id of the record.
rdd1_fk_fn (Callable[[R1], List[K2]]) – function that extract Y from a record of RDD1
rdd2_keyfn (Callable[[R2], K2]) – function that extract id of a record (y) of RDD2
rdd1_join_fn (Callable[[R1, List[Tuple[K2, Optional[R2]]]], None]) – function that merge list of Y into record R1
rdd1_serfn (Callable[[R1], Union[str, bytes]]) – function that serialize records of RDD1 to save to file
outfile (str) – output file
compression (bool, optional) – whether we should compress the result, by default True
- kgdata.spark.saveAsSingleTextFile(rdd, outfile: Union[str, Path], compressionCodecClass=None, shuffle=True)[source]#