Saturday, 15 August 2015

Self-Contained Java Applications on Apache Spark

Standard


Please follow following steps to run Self-Contained Java Applications on Apache Spark

1) Create folder into your C: drive "example-java-build"

2) Create following directory structure

./pom.xml ./src ./src/main ./src/main/java ./src/main/java/SimpleApp.java


SimpleApp.java
 


/* SimpleApp.java */
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

public class SimpleApp {
  public static void main(String[] args) {
    String logFile = "./examples/README.md"; // Should be some file on your system
    SparkConf conf = new SparkConf().setAppName("Simple Application");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

    long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

    System.out.println("Lines with a: " + numAs +

 ", lines with b: " + numBs);
  }
}




Note : I have copied "README.md" into my example folder.



pom.xml

<project>
  <groupId>edu.berkeley</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Simple Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.4.1</version>
    </dependency>
  </dependencies>
</project>


Now run following mvn command

mvn package

C:\Users\Vaquar khan\Desktop\Spark\example-java-build>mvn package
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Simple Project 1.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ simple-pro
[WARNING] Using platform encoding (Cp1252 actually) to copy filtered resources,
[INFO] skip non existing resourceDirectory C:\Users\Vaquar khan\Desktop\Spark\ex
[INFO]
[INFO] --- maven-compiler-plugin:2.5.1:compile (default-compile) @ simple-projec
[WARNING] File encoding has not been set, using platform encoding Cp1252, i.e. b
[INFO] Compiling 1 source file to C:\Users\Vaquar khan\Desktop\Spark\example-jav
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ si
[WARNING] Using platform encoding (Cp1252 actually) to copy filtered resources,
[INFO] skip non existing resourceDirectory C:\Users\Vaquar khan\Desktop\Spark\ex
[INFO]
[INFO] --- maven-compiler-plugin:2.5.1:testCompile (default-testCompile) @ simpl
[INFO] No sources to compile
[INFO]
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ simple-project ---
[INFO] No tests to run.
[INFO]
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ simple-project ---
[INFO] Building jar: C:\Users\Vaquar khan\Desktop\Spark\example-java-build\targe
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 6.939s
[INFO] Finished at: Sun Aug 16 02:27:41 IST 2015
[INFO] Final Memory: 29M/291M
[INFO] ------------------------------------------------------------------------


Then after go to target folder and copy simple-project-1.0.jar into your
spark examples folder


Now open your Ubuntu spark-shell (cd usr/local/spark)and run following command


./bin/spark-submit \ --class "SimpleApp" \ --master local[4] \ ./examples/simple-project-1.0.jar
 You can see output 

Lines with a:60 lines with b:29



Console :
vaquarkhan@ubuntu:/usr/local/sparks$ ./bin/spark-submit   --class "SimpleApp"   --master local[4]   ./examples/simple-project-1.0.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/15 13:58:06 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.40.129 instead (on interface eth0)
15/08/15 13:58:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/08/15 13:58:07 INFO SecurityManager: Changing view acls to: vaquarkhan
15/08/15 13:58:07 INFO SecurityManager: Changing modify acls to: vaquarkhan
15/08/15 13:58:07 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vaquarkhan); users with modify permissions: Set(vaquarkhan)
15/08/15 13:58:08 INFO Slf4jLogger: Slf4jLogger started
15/08/15 13:58:08 INFO Remoting: Starting remoting
15/08/15 13:58:09 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@ubuntu.local:38218]
15/08/15 13:58:09 INFO Utils: Successfully started service 'sparkDriver' on port 38218.
15/08/15 13:58:09 INFO SparkEnv: Registering MapOutputTracker
15/08/15 13:58:09 INFO SparkEnv: Registering BlockManagerMaster
15/08/15 13:58:09 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150815135809-809e
15/08/15 13:58:09 INFO MemoryStore: MemoryStore started with capacity 267.3 MB
15/08/15 13:58:10 INFO HttpFileServer: HTTP File server directory is /tmp/spark-0eda6393-c81b-4cec-b3db-82d861b61a23
15/08/15 13:58:10 INFO HttpServer: Starting HTTP Server
15/08/15 13:58:10 INFO Utils: Successfully started service 'HTTP file server' on port 54406.
15/08/15 13:58:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
15/08/15 13:58:20 INFO Utils: Successfully started service 'SparkUI' on port 4041.
15/08/15 13:58:20 INFO SparkUI: Started SparkUI at http://ubuntu.local:4041
15/08/15 13:58:20 INFO SparkContext: Added JAR file:/usr/local/sparks/./examples/simple-project-1.0.jar at http://192.168.40.129:54406/jars/simple-project-1.0.jar with timestamp 1439672300996
15/08/15 13:58:21 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@ubuntu.local:38218/user/HeartbeatReceiver
15/08/15 13:58:21 INFO NettyBlockTransferService: Server created on 54173
15/08/15 13:58:21 INFO BlockManagerMaster: Trying to register BlockManager
15/08/15 13:58:21 INFO BlockManagerMasterActor: Registering block manager localhost:54173 with 267.3 MB RAM, BlockManagerId(<driver>, localhost, 54173)
15/08/15 13:58:21 INFO BlockManagerMaster: Registered BlockManager
15/08/15 13:58:21 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
15/08/15 13:58:21 INFO MemoryStore: ensureFreeSpace(28887) called with curMem=0, maxMem=280248975
15/08/15 13:58:21 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 28.2 KB, free 267.2 MB)
15/08/15 13:58:22 INFO MemoryStore: ensureFreeSpace(4959) called with curMem=28887, maxMem=280248975
15/08/15 13:58:22 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.8 KB, free 267.2 MB)
15/08/15 13:58:22 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54173 (size: 4.8 KB, free: 267.3 MB)
15/08/15 13:58:22 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/08/15 13:58:22 INFO SparkContext: Created broadcast 0 from textFile at SimpleApp.java:11
15/08/15 13:58:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/08/15 13:58:22 WARN LoadSnappy: Snappy native library not loaded
15/08/15 13:58:22 INFO FileInputFormat: Total input paths to process : 1
15/08/15 13:58:22 INFO SparkContext: Starting job: count at SimpleApp.java:15
15/08/15 13:58:22 INFO DAGScheduler: Got job 0 (count at SimpleApp.java:15) with 2 output partitions (allowLocal=false)
15/08/15 13:58:22 INFO DAGScheduler: Final stage: Stage 0(count at SimpleApp.java:15)
15/08/15 13:58:22 INFO DAGScheduler: Parents of final stage: List()
15/08/15 13:58:22 INFO DAGScheduler: Missing parents: List()
15/08/15 13:58:22 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[2] at filter at SimpleApp.java:13), which has no missing parents
15/08/15 13:58:22 INFO MemoryStore: ensureFreeSpace(2928) called with curMem=33846, maxMem=280248975
15/08/15 13:58:22 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.9 KB, free 267.2 MB)
15/08/15 13:58:22 INFO MemoryStore: ensureFreeSpace(2100) called with curMem=36774, maxMem=280248975
15/08/15 13:58:22 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.1 KB, free 267.2 MB)
15/08/15 13:58:22 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54173 (size: 2.1 KB, free: 267.3 MB)
15/08/15 13:58:22 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/08/15 13:58:22 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
15/08/15 13:58:22 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MapPartitionsRDD[2] at filter at SimpleApp.java:13)
15/08/15 13:58:22 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/08/15 13:58:23 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1371 bytes)
15/08/15 13:58:23 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1371 bytes)
15/08/15 13:58:23 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/08/15 13:58:23 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/08/15 13:58:23 INFO Executor: Fetching http://192.168.40.129:54406/jars/simple-project-1.0.jar with timestamp 1439672300996
15/08/15 13:58:23 INFO Utils: Fetching http://192.168.40.129:54406/jars/simple-project-1.0.jar to /tmp/fetchFileTemp4857343201243967252.tmp
15/08/15 13:58:23 INFO Executor: Adding file:/tmp/spark-42ce2234-3577-41fd-87a6-2e7d9bbeb7b0/simple-project-1.0.jar to class loader
15/08/15 13:58:23 INFO CacheManager: Partition rdd_1_1 not found, computing it
15/08/15 13:58:23 INFO CacheManager: Partition rdd_1_0 not found, computing it
15/08/15 13:58:23 INFO HadoopRDD: Input split: file:/usr/local/sparks/examples/README.md:0+1822
15/08/15 13:58:23 INFO HadoopRDD: Input split: file:/usr/local/sparks/examples/README.md:1822+1823
15/08/15 13:58:23 INFO MemoryStore: ensureFreeSpace(5080) called with curMem=38874, maxMem=280248975
15/08/15 13:58:23 INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 5.0 KB, free 267.2 MB)
15/08/15 13:58:23 INFO MemoryStore: ensureFreeSpace(5776) called with curMem=43954, maxMem=280248975
15/08/15 13:58:23 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 5.6 KB, free 267.2 MB)
15/08/15 13:58:23 INFO BlockManagerInfo: Added rdd_1_1 in memory on localhost:54173 (size: 5.0 KB, free: 267.3 MB)
15/08/15 13:58:23 INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:54173 (size: 5.6 KB, free: 267.2 MB)
15/08/15 13:58:23 INFO BlockManagerMaster: Updated info of block rdd_1_1
15/08/15 13:58:23 INFO BlockManagerMaster: Updated info of block rdd_1_0
15/08/15 13:58:24 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2326 bytes result sent to driver
15/08/15 13:58:24 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2326 bytes result sent to driver
15/08/15 13:58:24 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1053 ms on localhost (1/2)
15/08/15 13:58:24 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1156 ms on localhost (2/2)
15/08/15 13:58:24 INFO DAGScheduler: Stage 0 (count at SimpleApp.java:15) finished in 1.185 s
15/08/15 13:58:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/08/15 13:58:24 INFO DAGScheduler: Job 0 finished: count at SimpleApp.java:15, took 1.790430 s
15/08/15 13:58:24 INFO SparkContext: Starting job: count at SimpleApp.java:19
15/08/15 13:58:24 INFO DAGScheduler: Got job 1 (count at SimpleApp.java:19) with 2 output partitions (allowLocal=false)
15/08/15 13:58:24 INFO DAGScheduler: Final stage: Stage 1(count at SimpleApp.java:19)
15/08/15 13:58:24 INFO DAGScheduler: Parents of final stage: List()
15/08/15 13:58:24 INFO DAGScheduler: Missing parents: List()
15/08/15 13:58:24 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[3] at filter at SimpleApp.java:17), which has no missing parents
15/08/15 13:58:24 INFO MemoryStore: ensureFreeSpace(2928) called with curMem=49730, maxMem=280248975
15/08/15 13:58:24 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.9 KB, free 267.2 MB)
15/08/15 13:58:24 INFO MemoryStore: ensureFreeSpace(2098) called with curMem=52658, maxMem=280248975
15/08/15 13:58:24 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.0 KB, free 267.2 MB)
15/08/15 13:58:24 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54173 (size: 2.0 KB, free: 267.2 MB)
15/08/15 13:58:24 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
15/08/15 13:58:24 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838
15/08/15 13:58:24 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[3] at filter at SimpleApp.java:17)
15/08/15 13:58:24 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
15/08/15 13:58:25 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1371 bytes)
15/08/15 13:58:25 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1371 bytes)
15/08/15 13:58:25 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
15/08/15 13:58:25 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
15/08/15 13:58:25 INFO BlockManager: Found block rdd_1_0 locally
15/08/15 13:58:25 INFO BlockManager: Found block rdd_1_1 locally
15/08/15 13:58:25 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1757 bytes result sent to driver
15/08/15 13:58:25 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1757 bytes result sent to driver
15/08/15 13:58:25 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 50 ms on localhost (1/2)
15/08/15 13:58:25 INFO DAGScheduler: Stage 1 (count at SimpleApp.java:19) finished in 0.036 s
15/08/15 13:58:25 INFO DAGScheduler: Job 1 finished: count at SimpleApp.java:19, took 0.546082 s
Lines with a: 60, lines with b: 29
15/08/15 13:58:25 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 55 ms on localhost (2/2)
15/08/15 13:58:25 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
vaquarkhan@ubuntu:/usr/local/sparks$ 



You can also get following information on spark UI 

 







0 comments:

Post a Comment