A Spark-based data comparison tool at scale which facilitates software development engineers to compare a plethora of pair combinations of possible data sources. Multiple execution modes in multiple environments enable the user to generate a diff report as a Java/Scala-friendly DataFrame or as a file for future use. Comes with out of the box SparkFactory and SparkCompare tools.

Learn more »

View project on GitHub »

Latest Release 0.4.0 »

Flexible Execution Evnrionment

Execute MegaSparkDiff locally or on an EMR

Configurable Sources

Compare data sets coming from any two Spark-compatible sources

Run at scale

Execute one EMR to run comparison at scale on massive data

Multiple Execution Modes

Execute MegaSparkDiff as either a Maven dependency or as a standalone comparator

Extra Standalone Functionality

Comes with out of the box SparkFactory and SparkComparator