Apache Spark shell scala examples
Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
1) A machine with Ubuntu 14.04 LTS operating system
2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)
3) Apache Spark 1.6.1 pre installed (How to install Spark on Ubuntu 14.04)
Spark Shell Scala Examples
Step 1 - Change the directory to /usr/local/spark/sbin.
Step 2 - Start all spark daemons.
Step 3 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.
Step 4 - The following command is used to open Spark shell.
Broadcast Variables. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
Accumulators. Accumulators are variables that are only "added" to through an associative operation and can therefore, be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums.
Read the JSON Document
1) By default, the SparkContext object is initialized with the name sc when the spark-shell starts.
2) Store the employee.json file in HDFS.
3) First, we have to read the JSON document. Based on this, generate a DataFrame named (dfs).
4) you want to see the data in the DataFrame, then use this command.
5) If you want to see the Structure (Schema) of the DataFrame, then use this command.
6) Use this command to fetch name-column among three columns from the DataFrame.
7) Use this following command for finding the employees whose age is greater than 23 (age > 23).
8) Use this following command for counting the number of employees who are of the same age.
Read the Text Document
1) Store the employee.txt file in HDFS.
2) By default, the SparkContext object is initialized with the name sc when the spark-shell starts.
3) import all the SQL functions used to implicitly convert an RDD to a DataFrame.
4) we have to define a schema for employee record data using a case class.
5) generate an RDD named empl by reading the data from employee.txt and converting it into DataFrame, using the Map functions.
6) store the DataFrame data into a table named employee.
7) we use the variable allrecords for capturing all records data.
8) display those records, call show() method on it.
9) the variable agefilter stores the records of employees whose age are between 20 and 35.
10) To see the result data of agefilter DataFrame
11) for fetching the ID values from agefilter RDD result, using field index.
Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext. Using HiveContext, you can create and find tables in the HiveMetaStore and write queries on it using HiveQL. Users who do not have an existing Hive deployment can still create a HiveContext. When not configured by the hive-site.xml, the context automatically creates a metastore called metastore_db and a folder called warehouse in the current directory.
1) initializing the HiveContext into the Spark Shell.
2) creating a table named employee with the fields id, name, and age. Here, we are using the Create statement of HiveQL syntax.
3) loading the employee record data into the employee table.
4) execute any kind of SQL queries into the table. Use the following command for fetching all records using HiveQL select query.
5) To display the record data, call the show() method on the result DataFrame.
Please share this blog post and follow me for latest updates on
Labels : Spark Standalone Mode Installation Spark Cluster Mode Installation Spark With YARN Configuration Spark WordCount Java Example Spark submit-script Usage Spark Shell Usage Spark WordCount Scala Example