object SparkFactory
- Alphabetic
- By Inheritance
- SparkFactory
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
- var conf: SparkConf
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
def
flattenDataFrame(df: DataFrame, delimiter: String): DataFrame
Flattens a DataFrame to only have a single column that contains the entire original row
Flattens a DataFrame to only have a single column that contains the entire original row
- df
a dataframe which contains one or more columns
- delimiter
a character which separates the source data
- returns
flattened dataframe
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def initializeDataBricks(dataBricksSparkSession: SparkSession): Unit
-
def
initializeSparkContext(): Unit
The initialize method creates the main spark session which MegaSparkDiff uses.
The initialize method creates the main spark session which MegaSparkDiff uses. This needs to be called before any operation can be made using MegaSparkDiff. It creates a spark app with the name "megasparkdiff" and enables hive support by default.
-
def
initializeSparkLocalMode(numCores: String, logLevel: String, defaultPartitions: String): Unit
This method should be used to initialize MegaSparkDiff in local mode, in other words anything that is not EMR or EC2.
This method should be used to initialize MegaSparkDiff in local mode, in other words anything that is not EMR or EC2. The typical use cases are if you are executing a diff on your laptop or workstation or within a Jenkins build.
- numCores
this parameters can be used to set the number of cores you wanna specify for spark. for example you can specify "local[1]" this means spark will use 1 core only. Alternatively you can specify "local[*]" this means spark will figure out how many cores you have and will use them all.
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
parallelizeCSVSource(filePath: String, tempViewName: String, schemaDef: Option[StructType] = Option.empty, delimiter: Option[String] = Option.apply(",")): AppleTable
This method creates an AppleTable for data in a CSV file
This method creates an AppleTable for data in a CSV file
- filePath
the relative path to the CSV file
- tempViewName
the name of the temporary view which gets created for source data
- schemaDef
optional schema definition for the data. If none is provided, the schema will be inferred by spark
- delimiter
optional delimiter character for the data. If none is provided, comma (",") will be used as default
- returns
custom table containing the data to be compared
-
def
parallelizeDynamoDBSource(tableName: String, tempViewName: String, firstLevelElementNames: Array[String], delimiter: Option[String] = Option.apply(","), selectColumns: Option[Array[String]] = Option.empty, filter: Option[String] = Option.empty, region: Option[String] = Option.empty, roleArn: Option[String] = Option.empty, readPartitions: Option[String] = Option.empty, maxPartitionBytes: Option[String] = Option.empty, defaultParallelism: Option[String] = Option.empty, targetCapacity: Option[String] = Option.empty, stronglyConsistentReads: Option[String] = Option.empty, bytesPerRCU: Option[String] = Option.empty, filterPushdown: Option[String] = Option.empty, throughput: Option[String] = Option.empty): AppleTable
This method will create an AppleTable for data in DynamoDB table.
This method will create an AppleTable for data in DynamoDB table.
The method uses the spark-dynamodb library, of which many of the parameters below are passed to, hence the inclusion of their descriptions from the spark-dynamodb README file
- tableName
name of DynamoDB table
- tempViewName
temporary table name for source data
- firstLevelElementNames
names of the first level elements in the table
- delimiter
source data separation character
- selectColumns
list of columns to select from table
- filter
sql where condition
- region
region Spark parameter with the following description from the source README: "sets the region where the dynamodb table. Default is environment specific."
- roleArn
roleArn Spark parameter with the following description from the source README: "sets an IAM role to assume. This allows for access to a DynamoDB in a different account than the Spark cluster. Defaults to the standard role configuration."
- readPartitions
readPartitions Spark reader parameter with the following description from the source README: "number of partitions to split the initial RDD when loading the data into Spark. Defaults to the size of the DynamoDB table divided into chunks of maxPartitionBytes"
- maxPartitionBytes
maxPartitionBytes Spark reader parameter with the following description from the source README: "the maximum size of a single input partition. Default 128 MB"
- defaultParallelism
defaultParallelism Spark reader parameter with the following description from the source README: "the number of input partitions that can be read from DynamoDB simultaneously. Defaults to sparkContext.defaultParallelism"
- targetCapacity
targetCapacity Spark reader parameter with the following description from the source README: "fraction of provisioned read capacity on the table (or index) to consume for reading. Default 1 (i.e. 100% capacity)."
- stronglyConsistentReads
stronglyConsistentReads Spark reader parameter with the following description from the source README: "whether or not to use strongly consistent reads. Default false."
- bytesPerRCU
bytesPerRCU Spark reader parameter with the following description from the source README: "number of bytes that can be read per second with a single Read Capacity Unit. Default 4000 (4 KB). This value is multiplied by two when stronglyConsistentReads=false"
- filterPushdown
filterPushdown Spark reader parameter with the following description from the source README: "whether or not to use filter pushdown to DynamoDB on scan requests. Default true."
- throughput
throughput Spark reader parameter with the following description from the source README: "the desired read throughput to use. It overwrites any calculation used by the package. It is intended to be used with tables that are on-demand. Defaults to 100 for on-demand."
- returns
custom table containing the data to be compared
-
def
parallelizeHiveSource(sqlText: String, tempViewName: String): AppleTable
This method will create an AppleTable from a query that retrieves data from a hive table.
This method will create an AppleTable from a query that retrieves data from a hive table. It is assumed that hive connectivity is already enabled in the environment from which this project is run.
- sqlText
a query to retrieve the data
- tempViewName
custom table name for source
- returns
custom table containing the data
-
def
parallelizeJDBCSource(driverClassName: String, jdbcUrl: String, username: String, password: String, sqlQuery: String, tempViewName: String): AppleTable
This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection; passes in "," as a default delimiter
This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection; passes in "," as a default delimiter
- driverClassName
JDBC driver name
- jdbcUrl
JDBC URL
- username
Username for database connection
- password
Password for database connection
- sqlQuery
Query to retrieve the desired data from database
- tempViewName
temporary table name for source data
- returns
custom table containing the data to be compared
-
def
parallelizeJDBCSource(driverClassName: String, jdbcUrl: String, username: String, password: String, sqlQuery: String, tempViewName: String, delimiter: Option[String], partitionColumn: String, lowerBound: String, upperBound: String, numPartitions: String): AppleTable
This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection.
This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection.
- driverClassName
JDBC driver name
- jdbcUrl
JDBC URL
- username
Username for database connection
- password
Password for database connection
- sqlQuery
Query to retrieve the desired data from database
- tempViewName
temporary table name for source data
- delimiter
source data separation character
- partitionColumn
column to partition queries on
- lowerBound
lower bound of partition column values
- upperBound
upper bound of partition column values
- numPartitions
number of total queries to execute
- returns
custom table containing the data to be compared
-
def
parallelizeJDBCSource(driverClassName: String, jdbcUrl: String, username: String, password: String, sqlQuery: String, tempViewName: String, delimiter: Option[String]): AppleTable
This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection.
This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection.
- driverClassName
JDBC driver name
- jdbcUrl
JDBC URL
- username
Username for database connection
- password
Password for database connection
- sqlQuery
Query to retrieve the desired data from database
- tempViewName
temporary table name for source data
- delimiter
source data separation character
- returns
custom table containing the data to be compared
-
def
parallelizeJSONSource(jsonFileLocation: String, tempViewName: String, firstLevelElementNames: Array[String], delimiter: Option[String] = Option.apply(",")): AppleTable
This method will create an AppleTable for data in a JSON file.
This method will create an AppleTable for data in a JSON file.
- jsonFileLocation
path of a json file containing the data to be compared
- tempViewName
temporary table name for source data
- firstLevelElementNames
Names of the first level elements in the file
- delimiter
source data separation character
- returns
custom table containing the data to be compared
-
def
parallelizeTextFile(textFileLocation: String): DataFrame
This method will create DataFrame with a single column called "field1" from a text file, the file can be on local machine or HDFS for HDFS url like so "hdfs://nn1home:8020/input/war-and-peace.txt" for S3 the url like so "s3n://myBucket/myFile1.log"
This method will create DataFrame with a single column called "field1" from a text file, the file can be on local machine or HDFS for HDFS url like so "hdfs://nn1home:8020/input/war-and-peace.txt" for S3 the url like so "s3n://myBucket/myFile1.log"
- textFileLocation
path of a flat file containing the data to be compared
- returns
a dataframe converted from a flat file
-
def
parallelizeTextSource(textFileLocation: String, tempViewName: String): AppleTable
This method will create DataFrame with a single column called "field1" from a text file, the file can be on local machine or HDFS for HDFS url like so "hdfs://nn1home:8020/input/war-and-peace.txt" for S3 the url like so "s3n://myBucket/myFile1.log"
This method will create DataFrame with a single column called "field1" from a text file, the file can be on local machine or HDFS for HDFS url like so "hdfs://nn1home:8020/input/war-and-peace.txt" for S3 the url like so "s3n://myBucket/myFile1.log"
- textFileLocation
path of a flat file containing the data to be compared
- tempViewName
temporary table name for source data
- returns
custom table containing the data to be compared
-
def
simpleTableToSimpleJSONFormatTable(df: DataFrame): DataFrame
Transform single level DataFrame to a JSON format DataFrame
Transform single level DataFrame to a JSON format DataFrame
- df
single level DataFrame
- returns
simplified JSON format DataFrame
- var sparkSession: SparkSession
-
def
stopSparkContext(): Unit
Terminates the current Spark session
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
- object sparkImplicits extends SQLImplicits