apache-spark spark2 Read files sent with spark-submit by the driver
I am sending a Spark job to run on a remote cluster by running
spark-submit ... --deploy-mode cluster --files some.properties ...
I want to read the content of the
some.properties file by the driver code, i.e. before creating the Spark context and launching RDD tasks. The file is copied to the remote driver, but not to the driver's working directory.
The ways around this problem that I know of are:
- Upload the file to HDFS
- Store the file in the app jar
Both are inconvenient since this file is frequently changed on the submitting dev machine.
Is there a way to read the file that was uploaded using the
--files flag during the driver code main method?
Yes, you can access files uploaded via the
This is how I'm able to access files passed in via
./bin/spark-submit \ --class com.MyClass \ --master yarn-cluster \ --files /path/to/some/file.ext \ --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar \ /path/to/app.jar file.ext
and in my Spark code:
val filename = args(0) val linecount = Source.fromFile(filename).getLines.size
I do believe these files are downloaded onto the workers in the same directory as the jar is placed, which is why simply passing the filename and not the absolute path to
Here's a nice solution I developed in Python Spark in order to integrate any data as a file from outside to your Big Data platform.
# Load from the Spark driver any local text file and return a RDD (really useful in YARN mode to integrate new data at the fly) # (See https://community.hortonworks.com/questions/38482/loading-local-file-to-apache-spark.html) def parallelizeTextFileToRDD(sparkContext, localTextFilePath, splitChar): localTextFilePath = localTextFilePath.strip(' ') if (localTextFilePath.startswith("file://")): localTextFilePath = localTextFilePath[7:] import subprocess dataBytes = subprocess.check_output("cat " + localTextFilePath, shell=True) textRDD = sparkContext.parallelize(dataBytes.split(splitChar)) return textRDD # Usage example myRDD = parallelizeTextFileToRDD(sc, '~/myTextFile.txt', '\n') # Load my local file as a RDD myRDD.saveAsTextFile('/user/foo/myTextFile') # Store my data to HDFS
The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
this works for my spark streaming application in both yarn/client and yarn/cluster mode. maybe can help you
A way around the problem is that you can create a temporary
SparkContext simply by calling
SparkContext.getOrCreate() and then read the file you passed in the
--files with the help of
Once you read the file retrieve all necessary configuration you required in a
After that call this function:
This will distroy the existing
SparkContext and than in the next line simply initalize a new
SparkContext with the necessary configurations like this.
sc = SparkContext(conf=conf).getOrCreate()
You got yourself a
SparkContext with the desired settings