Apache Spark RDD

Actions

Transformation:

RDD can be transformed from one form to another form. Map, filter, combineByKey etc. are transformation operation which create other RDD.

If you have multiple operations to be performed on the same data, you can store that data explicitly in the memory by calling cache() or persist() functions.

Actions:

Actions returns final result. Like first, collect, reduce, count etc. are actions.

Lazy Evaluation:

Until the action operation is called, no transformation operations are performed.

Pair RDD:

RDD having key/value pairs called Pair RDDs.They are very useful performing or counting aggregations by keys in parallel on various nodes of the cluster.

Pair RDD can be created by calling a map() operation which will emit key/value pairs.

Transformations on Pair RDDs:

ReduceByKey(),groupByKey(),cobineByKey(),mapValues(),flatMapValues(),keys() etc. are functions can be performed on one Pair RDDs where as subtractByKet(),join, cogroup() are functions can be performed on two pair RDDs.

Demo:

Run the spark-shell command on command line.

Then create the rdd from any text file.

Here media.txt is a list of instagram URLs in it.

RDD:

scala> val mediaRDD =sc.textFile(“D:/instagram-scraper-master/media.txt”)

rdd: org.apache.spark.rdd.RDD[String] = D:/instagram-scraper-master/media.txt Ma

pPartitionsRDD[1] at textFile at <console>:21

scala> mediaRDD.count

res0: Long = 1013

scala> mediaRDD.take(2).foreach(println)

https://instagram.fbom1-1.fna.fbcdn.net/t50.2886-16/14790206_177359509381923_796

7834812834643968_n.mp4

https://instagram.fbom1-1.fna.fbcdn.net/t50.2886-16/14833228_1020652531380366_85

48718479509815296_n.mp4

Pair RDD:

Node.txt: It is a network file having node id and it’s neighbors.

1 12

1 13

1 14

2 23

4 24

3 15

3 11

scala> val nodeRDD =sc.textFile(“D:/Node.txt”)

nodeRDD: org.apache.spark.rdd.RDD[String] = D:/Node.txt MapPartitionsRDD[5] at textFile at <console>:21

scala> val mapRDD= nodeRDD.map(_.split(” “)).map(v => (v(0).toInt, v(1).toInt))

mapRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[9] at ma

at <console>:23

scala> mapRDD.fo

fold foreach foreachPartition foreachWith

scala> mapRDD.foreach(println)

(4,24)

(1,12)

(3,15)

(1,13)

(3,11)

(1,14)

(2,23)

scala> val result=mapRDD.countByKey()

result: scala.collection.Map[Int,Long] = Map(4 -> 1, 2 -> 1, 1 -> 3, 3 -> 2)

So like this we can perform several paired RDD functions on Paired RDD and it makes easy to perform several aggregation functions.

In the next tutorial we’ll see all the RDD functions in details.

Pro Programming

Professional way of Programming

Related Posts

Leave a Reply Cancel reply

Related Posts

Share this:

Leave a Reply Cancel reply