A Spark-based data comparison tool at scale which facilitates software development engineers to compare a plethora of pair combinations of possible data sources. Multiple execution modes in multiple environments enable the user to generate a diff report as a Java/Scala-friendly DataFrame or as a file for future use. Comes with out of the box SparkFactory and SparkCompare tools.
Learn more » |
Execute MegaSparkDiff locally or on an EMR
Compare data sets coming from any two Spark-compatible sources
Execute one EMR to run comparison at scale on massive data
Execute MegaSparkDiff as either a Maven dependency or as a standalone comparator
Comes with out of the box SparkFactory and SparkComparator