The Ultimate PySpark Cheat Sheet: A Data Engineer’s Best Friend

3 min readApr 5, 2024

Are you a data engineer looking to master PySpark and streamline your data processing tasks? Look no further! In this comprehensive cheat sheet, we’ll cover everything you need to know to become a PySpark pro in no time. From basic operations to advanced techniques, consider this your go-to resource for all things PySpark.

The Ultimate PySpark: A Data Engineer’s Best Friend

What is PySpark?

PySpark is a powerful Python library for processing large-scale data in parallel and distributed computing environments. It provides an interface to Apache Spark, a fast and general-purpose cluster computing system, making it an essential tool for data engineers working with big data.

Getting Started: Basics of PySpark

Let’s kick things off with some basic PySpark operations:

Initializing SparkSession: To start using PySpark, you need to create a SparkSession, which serves as the entry point to Spark functionality.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark Cheat Sheet") \
    .getOrCreate()

Loading Data: PySpark supports various file formats like CSV, JSON, Parquet, etc. You can load data into a DataFrame using the read method.

df = spark.read.csv("data.csv", header=True, inferSchema=True)

Exploring Data: Once you have loaded the data, you can perform basic exploration using methods like show, printSchema, and describe.

df.show()
df.printSchema()
df.describe().show()

Advanced PySpark Techniques

Now that you’re familiar with the basics, let’s dive into some advanced PySpark techniques:

Data Manipulation: PySpark provides powerful functions for data manipulation, including select, filter, groupBy, orderBy, and more.

from pyspark.sql.functions import col

df.select("col1", "col2").filter(col("col1") > 100).show()

df = orders.join(products, "order_id", "inner") #apply joins of any 
df.join(df2, 'any common column').groupBy('any column').count().orderBy(desc('count'))


df1=df.groupBy("cust_id").agg(sum("amount").alias("bill")) #apply group by function and the aggregation would by any

df.groupBy("col1").agg(count("col2").alias("count"), 
                          sum("col2").alias("sum"),
                          max("col2").alias("maximum"),
                          min("col2").alias("minimum"), 
                          avg("col2").alias("average")).show()


df.drop("column_name1", "column_name2", "column_name3") #droping columns
df.drop(col("column_name")) #another way of dropping columns

df.createOrReplaceTempView("any name you wish to assign") #convert data frame to table

df.orderBy(F.desc("column_name")).first() #return first row by descending order of any column for say salary
df.orderBy(col("column_name").desc()).first() #another way of returning the highest value record
df.orderBy(col("column_name").desc()).limit(5) #returning top 5 value record

#applying filters on any columns as our wish
df.filter(df.column_name==any value or any).show()

#selecting required columns as output with filters
df.select("column1", "column2", "column3").where(col("any column")=="any value")
df.select("column1").where(col("column1")> value).show(5)
df.sort("any column name")

#rename column name
df.withcolumn Renamed("already existing column name", "change column_name we want")

User-Defined Functions (UDFs): You can define custom functions and apply them to DataFrame columns using UDFs.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def my_udf(value):
    return value.upper()

udf_my_udf = udf(my_udf, StringType())
df.withColumn("new_col", udf_my_udf("col")).show()

Machine Learning: PySpark includes MLlib, a scalable machine learning library, for building and training machine learning models.

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

assembler = VectorAssembler(inputCols=["col1", "col2"], outputCol="features")
df = assembler.transform(df)
lr = LinearRegression(featuresCol="features", labelCol="label")
model = lr.fit(df)

Conclusion

Congratulations! You’ve now mastered the essentials of PySpark. With this cheat sheet by your side, you’ll be able to tackle even the most complex data engineering tasks with ease. Keep practicing and exploring the vast capabilities of PySpark, and you’ll soon become a PySpark expert!

So, what are you waiting for? Dive into PySpark and unleash the power of big data processing like never before.

Happy coding!

If you found this PySpark cheat sheet helpful, don’t forget to give it a like and clap! Your support motivates us to continue creating valuable content for data engineers like you.

Moreover, don’t miss out on future updates and new articles! Subscribe to our Medium publication to stay updated with the latest tips, tricks, and tutorials on PySpark and other data engineering topics.

And if you’re feeling extra generous and would like to support our work, consider buying us a coffee through Buy Me a Coffee. Your contribution helps us keep the lights on and fuels our passion for sharing knowledge with the community.

Thank you for your support, and happy data engineering!

The Ultimate PySpark Cheat Sheet: A Data Engineer’s Best Friend

What is PySpark?

Getting Started: Basics of PySpark

Advanced PySpark Techniques

Conclusion

Written by Pratik Barjatiya