Saturday 8 August 2015

Spark Quick Start: Interactive Analysis with the Spark Shell

Standard




  Running Spark 1.3 on Hadoop/YARN 2.4.0


Go to
cd usr/ local/spark
then run following command it will start Spark

./bin/spark-shell






Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:


scala> val textFile = sc.textFile("README.md")
 
 
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
 
 
RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions:

scala> textFile.count() // Number of items in this RDD
 
 
 
res0: Long = 98

scala> textFile.first() // First item in this RDD
 
res1: String = # Apache Spark
 
Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file.

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
 
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
 
We can chain together transformations and actions:

scala> textFile.filter(line => line.contains("Spark")).count() 
// How many lines contain "Spark"?
 
res3: Long = 15
 
reference : http://spark.apache.org/docs/latest/quick-start.html

0 comments:

Post a Comment