Getting Started with PySpark

·

4 min read

Apache Spark is a powerful distributed computing framework commonly used for big data processing, ETL (Extract, Transform, Load), and building machine learning pipelines. It supports various programming languages, including Scala, Java, and Python, making it a versatile choice for data processing tasks. In this tutorial, we'll focus on installing Apache Spark on a MacOs machine and running Spark jobs using PySpark with Jupyter notebook.

Installing Apache Spark

Before diving into Spark, we need to ensure that we have all the necessary components installed.

1. Install Java

Java is a prerequisite for running Apache Spark. You can use the Homebrew package manager to install it:

brew cask install java8

To verify the Java installation, run:

java -version

2. Install Command Line Tools

Ensure that you have Xcode Command Line Tools installed by running:

xcode-select --install

3. Install Scala

Scala is another essential component for Spark. Install it using Homebrew:

brew install scala

To verify the Scala installation, run:

scala -version

4. Install Apache Spark Package

Now, let's install Apache Spark itself. With Homebrew, this is a breeze:

brew install apache-spark

This command installs Apache Spark along with its dependencies, including PySpark.

5. Setting Environment Variables

To make Spark easily accessible from the command line, you'll want to add some environment variables. First, find out where Apache Spark is installed on your system:

brew info apache-spark

Assuming you're using the Zsh terminal, add the following lines to your ~/.zshrc file:

export SPARK_HOME=/opt/homebrew/Cellar/apache-spark/3.5.0/libexec
export PATH=$PATH:$SPARK_HOME/sbin:$SPARK_HOME/bin
export PYTHONPATH=/opt/homebrew/Cellar/apache-spark/3.5.0/libexec/python

Save the file and run:

source ~/.zshrc

If you're using Bash, export these same variables in your ~/.bash_profile file instead:

After saving the file, run:

source ~/.bash_profile

$SPARK_HOME contains the path to the spark home directory, then we add the $SPARK_HOME/bin and $SPARK_HOME/sbin paths which contains the scripts to $PATH env vairable

With these environment variables set, you can now access Spark commands from the terminal.

To start Spark's Master UI, run:

cd $SPARK_HOME/sbin
./start-all.sh

This will launch the Spark Master UI at http://localhost:8080.

In mac to allow spark master start script to connect to localhost:22, go to system settings>sharing>enable remote login

To stop the Master UI, use:

./stop-all.sh

Running Spark Jobs

1. Spark Submit

Now that you have Apache Spark installed, let's run a simple Spark job using spark-submit. First, create a Python script, let's call it test.py:

from pyspark import SparkContext

sc = SparkContext("local", "PySpark Test")
print("Hello from Spark")
print("Spark Context >> ", sc)

To execute this script using spark-submit, use the following command:

spark-submit test.py

Make sure you're in the same directory as the test.py file or provide the full path to the script.

2. PySpark with Jupyter Notebook

PySpark can also be integrated with Jupyter Notebook for interactive data analysis and exploration.

Install Required Packages

Before you can use PySpark in Jupyter Notebook, install the necessary packages:

pip install findspark notebook

Start Jupyter Notebook

Launch Jupyter Notebook by running:

jupyter-notebook

Initialize PySpark in Jupyter

In a Jupyter Notebook cell, import and initialize findspark:

import findspark
findspark.init()

This step helps locate the path to your Apache Spark installation and sets it up for your Jupyter session.

Now, you can import PySpark and create a Spark session:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print("Spark Session >> ", spark)

You've successfully set up PySpark in Jupyter Notebook! You can now use PySpark to process data interactively. note: Always execute the findspark.init() prior to importing pyspark

spark_df = spark.sql("SELECT 'Hello from Spark' AS test_message")
spark_df.show()

Extras

To simplify running PySpark within Jupyter Notebook, you can configure it to start automatically with Jupyter. To do this, export the following environment variables to your ~/.zshrc or ~/.bash_profile:

export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=8889'

Now, when you run the pyspark command in your terminal, it will start a Jupyter Notebook session with PySpark preconfigured.


With these steps, you've set up Apache Spark and PySpark on your macOS machine and are ready to start working with distributed data processing and analysis. You can explore Spark's vast capabilities for big data processing and machine learning right from your local environment. Happy Sparking!