Apache Spark WordCount scala example

posted on Nov 20th, 2016

Apache Spark

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Pre Requirements

1) A machine with Ubuntu 14.04 LTS operating system

2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)

3) Apache Spark 1.6.1 pre installed (How to install Spark on Ubuntu 14.04)

Spark WordCount Scala Example

Step 1 - Change the directory to /usr/local/spark/sbin.

$ cd /usr/local/spark/sbin

Step 2 - Start all spark daemons.

$ ./start-all.sh

Step 3 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.

$ jps

Apache Spark WordCount Scala Example

SparkWordCount.scala

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext("spark://127.0.0.1:7077", "Word Count", "/usr/local/spark", Nil,
Map(), Map())
/* local = master URL; Word Count = application name; */
/* /usr/local/spark = Spark Home; Nil = jars; Map = environment */
/* Map = variables to work nodes */
/*creating an inputRDD to read text file (in.txt) through Spark context*/
val inputfile = sc.textFile("/user/hduser/in.txt")
/* Transform the inputRDD into countRDD */
val counts = inputfile.flatMap(line => line.split(" ")).map(word =>
(word, 1)).reduceByKey(_+_);
/* saveAsTextFile method is an action that effects on the RDD */
counts.saveAsTextFile("/user/hduser/outfile")
System.out.println("OK");
}
}

Step 4 - Create a jar file.

$ jar -cvf /home/hduser/Desktop/1.6\ SPARK/SparkWordCountScala.jar SparkWordCount*.class /usr/local/spark/lib/spark-core_2.10-0.9.0-incubating.jar /usr/local/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar

Step 5 - Run application.

$ spark-submit --class SparkWordCount --master yarn  --deploy-mode cluster --executor-cores 1 --num-executors 1 /home/hduser/Desktop/1.6\ SPARK/SparkWordCountScala.jar

Please share this blog post and follow me for latest updates on

facebook             google+             twitter             feedburner

Previous Post                                                                                          Next Post

Labels : Spark Standalone Mode Installation   Spark Cluster Mode Installation   Spark With YARN Configuration   Spark WordCount Java Example   Spark submit-script Usage   Spark Shell Usage   Spark Shell Scala Examples