scala - How to convert rdd object to dataframe in spark




spark dataframe (7)

How can I convert an RDD ( org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] ) to a Dataframe org.apache.spark.sql.DataFrame . I converted a dataframe to rdd using .rdd . After processing it I want it back in dataframe. How can I do this ?


Assuming your RDD[row] is called rdd, you can use:

val sqlContext = new SQLContext(sc) 
import sqlContext.implicits._
rdd.toDF()

Here is a simple example of converting your List into Spark RDD and then converting that Spark RDD into Dataframe.

Please note that I have used Spark-shell's scala REPL to execute following code, Here sc is an instance of SparkContext which is implicitly available in Spark-shell. Hope it answer your question.

scala> val numList = List(1,2,3,4,5)
numList: List[Int] = List(1, 2, 3, 4, 5)

scala> val numRDD = sc.parallelize(numList)
numRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[80] at parallelize at <console>:28

scala> val numDF = numRDD.toDF
numDF: org.apache.spark.sql.DataFrame = [_1: int]

scala> numDF.show
+---+
| _1|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
+---+

On newer versions of spark (2.0+)

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._

val spark = SparkSession
  .builder()
  .getOrCreate()
import spark.implicits._

val dfSchema = Seq("col1", "col2", "col3")
rdd.toDF(dfSchema: _*)

Suppose you have a DataFrame and you want to do some modification on the fields data by converting it to RDD[Row] .

val aRdd = aDF.map(x=>Row(x.getAs[Long]("id"),x.getAs[List[String]]("role").head))

To convert back to DataFrame from RDD we need to define the structure type of the RDD .

If the datatype was Long then it will become as LongType in structure.

If String then StringType in structure.

val aStruct = new StructType(Array(StructField("id",LongType,nullable = true),StructField("role",StringType,nullable = true)))

Now you can convert the RDD to DataFrame using the createDataFrame method.

val aNamedDF = sqlContext.createDataFrame(aRdd,aStruct)

To convert an Array[Row] to DataFrame or Dataset, the following works elegantly:

Say, schema is the StructType for the row,then

val rows: Array[Row]=...
implicit val encoder = RowEncoder.apply(schema)
import spark.implicits._
rows.toDS

SqlContext has a number of createDataFrame methods that create a DataFrame given an RDD . I imagine one of these will work for your context.

For example:

def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame

Creates a DataFrame from an RDD containing Rows using the given schema.


One needs to create a schema, and attach it to the Rdd.

Assuming val spark is a product of a SparkSession.builder...

    import org.apache.spark._
    import org.apache.spark.sql._       
    import org.apache.spark.sql.types._

    /* Lets gin up some sample data:
     * As RDD's and dataframes can have columns of differing types, lets make our
     * sample data a three wide, two tall, rectangle of mixed types.
     * A column of Strings, a column of Longs, and a column of Doubules 
     */
    val arrayOfArrayOfAnys = Array.ofDim[Any](2,3)
    arrayOfArrayOfAnys(0)(0)="aString"
    arrayOfArrayOfAnys(0)(1)=0L
    arrayOfArrayOfAnys(0)(2)=3.14159
    arrayOfArrayOfAnys(1)(0)="bString"
    arrayOfArrayOfAnys(1)(1)=9876543210L
    arrayOfArrayOfAnys(1)(2)=2.71828

    /* The way to convert an anything which looks rectangular, 
     * (Array[Array[String]] or Array[Array[Any]] or Array[Row], ... ) into an RDD is to 
     * throw it into sparkContext.parallelize.
     * http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext shows
     * the parallelize definition as 
     *     def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
     * so in our case our ArrayOfArrayOfAnys is treated as a sequence of ArraysOfAnys.
     * Will leave the numSlices as the defaultParallelism, as I have no particular cause to change it. 
     */
    val rddOfArrayOfArrayOfAnys=spark.sparkContext.parallelize(arrayOfArrayOfAnys)

    /* We'll be using the sqlContext.createDataFrame to add a schema our RDD.
     * The RDD which goes into createDataFrame is an RDD[Row] which is not what we happen to have.
     * To convert anything one tall and several wide into a Row, one can use Row.fromSeq(thatThing.toSeq)
     * As we have an RDD[somethingWeDontWant], we can map each of the RDD rows into the desired Row type. 
     */     
    val rddOfRows=rddOfArrayOfArrayOfAnys.map(f=>
        Row.fromSeq(f.toSeq)
    )

    /* Now to construct our schema. This needs to be a StructType of 1 StructField per column in our dataframe.
     * https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructField shows the definition as
     *   case class StructField(name: String, dataType: DataType, nullable: Boolean = true, metadata: Metadata = Metadata.empty)
     * Will leave the two default values in place for each of the columns:
     *        nullability as true, 
     *        metadata as an empty Map[String,Any]
     *   
     */

    val schema = StructType(
        StructField("colOfStrings", StringType) ::
        StructField("colOfLongs"  , LongType  ) ::
        StructField("colOfDoubles", DoubleType) ::
        Nil
    )

    val df=spark.sqlContext.createDataFrame(rddOfRows,schema)
    /*
     *      +------------+----------+------------+
     *      |colOfStrings|colOfLongs|colOfDoubles|
     *      +------------+----------+------------+
     *      |     aString|         0|     3.14159|
     *      |     bString|9876543210|     2.71828|
     *      +------------+----------+------------+
    */ 
    df.show 

Same steps, but with fewer val declarations:

    val arrayOfArrayOfAnys=Array(
        Array("aString",0L         ,3.14159),
        Array("bString",9876543210L,2.71828)
    )

    val rddOfRows=spark.sparkContext.parallelize(arrayOfArrayOfAnys).map(f=>Row.fromSeq(f.toSeq))

    /* If one knows the datatypes, for instance from JDBC queries as to RDBC column metadata:
     * Consider constructing the schema from an Array[StructField].  This would allow looping over 
     * the columns, with a match statement applying the appropriate sql datatypes as the second
     *  StructField arguments.   
     */
    val sf=new Array[StructField](3)
    sf(0)=StructField("colOfStrings",StringType)
    sf(1)=StructField("colOfLongs"  ,LongType  )
    sf(2)=StructField("colOfDoubles",DoubleType)        
    val df=spark.sqlContext.createDataFrame(rddOfRows,StructType(sf.toList))
    df.show




rdd