Saturday, 8 August 2015

Spark Quick Start: Interactive Analysis with the Spark Shell

Running Spark 1.3 on Hadoop/YARN 2.4.0

Go to
cd usr/ local/spark
then run following command it will start Spark

./bin/spark-shell

Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:

scala> val textFile = sc.textFile("README.md")

 
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions:

scala> textFile.count() // Number of items in this RDD

 
res0: Long = 98

scala> textFile.first() // First item in this RDD

 
res1: String = # Apache Spark

Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file.

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))

 
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09

We can chain together transformations and actions:

scala> textFile.filter(line => line.contains("Spark")).count()

// How many lines contain "Spark"?

res3: Long = 15

reference : http://spark.apache.org/docs/latest/quick-start.html

Techie's Notes

while( !(succeed=try())){}

Popular Posts

Recent Posts

Categories

Blog Archive

Contributors

Followers

Total Pageviews

Search This Blog

Blogroll

About

Blogger templates

Saturday, 8 August 2015

Spark Quick Start: Interactive Analysis with the Spark Shell

0 comments:

Post a Comment

Popular Posts

Recent Posts

Categories

Blog Archive

Contributors

Subscribe To

Followers

Total Pageviews

Search This Blog

Blogroll

About

Blogger templates

Saturday, 8 August 2015

0 comments:

Post a Comment