Awesome Spark
A curated list of awesome
Apache Spark packages and
resources.
Apache Spark is an open-source cluster-computing framework. Originally
developed at the
University of California, Berkeley’s AMPLab, the
Spark codebase was later donated to the
Apache Software Foundation, which
has maintained it since. Spark provides an interface for programming
entire clusters with implicit data parallelism and fault-tolerance
(Wikipedia 2017).
Users of Apache Spark may choose between different the Python, R, Scala
and Java programming languages to interface with the Apache Spark APIs.
Contents
Packages
Language Bindings
Notebooks and IDEs
-
almond
- A scala kernel for Jupyter.
-
Apache Zeppelin
- Web-based notebook that enables interactive data analytics with
plugable backends, integrated plotting, and extensive Spark support
out-of-the-box.
-
Polynote
- Polynote: an IDE-inspired polyglot notebook. It supports mixing
multiple languages in one notebook, and sharing data between them
seamlessly. It encourages reproducible notebooks with its immutable data
model. Orginating from
Netflix.
-
Spark Notebook
- Scalable and stable Scala and Spark focused notebook bridging the gap
between JVM and Data Scientists (incl. extendable, typesafe and reactive
charts).
-
sparkmagic
- Jupyter magics and kernels for
working with remote Spark clusters, for interactively working with
remote Spark clusters through
Livy, in Jupyter
notebooks.
General Purpose Libraries
-
Succinct
- Support for efficient queries on compressed data.
-
itachi
- A library that brings useful functions from modern database management
systems to Apache Spark.
-
spark-daria
- A Scala library with essential Spark functions and extensions to make
you more productive.
-
quinn
- A native PySpark implementation of spark-daria.
-
Apache DataFu
- A library of general purpose functions and UDF’s.
SQL Data Sources
SparkSQL has
serveral built-in Data Sources
for files. These include csv
, json
,
parquet
, orc
, and avro
. It also
supports JDBC databases as well as Apache Hive. Additional data sources
can be added by including the packages listed below, or writing your own.
Storage
-
Delta Lake
- Storage layer with ACID transactions.
-
ADAM
- Set of tools designed to analyse genomics data.
-
Hail
- Genetic analysis framework.
GIS
-
Magellan
- Geospatial analytics using Spark.
-
GeoSpark
- Cluster computing system for processing large-scale spatial data.
Time Series Analytics
-
Spark-Timeseries
- Scala / Java / Python library for interacting with time series data on
Apache Spark.
-
flint
- A time series library for Apache Spark.
Graph Processing
-
Mazerunner
- Graph analytics platform on top of Neo4j and GraphX.
-
GraphFrames
- Data frame based graph API.
-
neo4j-spark-connector
- Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX /
GraphFrames support.
-
SparklingGraph
- Library extending GraphX features with multiple functionalities useful
in graph analytics (measures, generators, link prediction etc.).
Machine Learning Extension
Middleware
-
Livy
- REST server with extensive language support (Python, R, Scala),
ability to maintain interactive sessions and object sharing.
-
spark-jobserver
- Simple Spark as a Service which supports objects sharing using so
called named objects. JVM only.
-
Mist
- Service for exposing Spark analytical jobs and machine learning models
as realtime, batch or reactive web services.
-
Apache Toree
- IPython protocol based middleware for interactive applications.
-
Kyuubi
- Improved implementation of Thrift JDBC/ODBC Server.
Monitoring
Utilities
-
silex
- Collection of tools varying from ML extensions to additional RDD
methods.
-
sparkly
- Helpers & syntactic sugar for PySpark.
-
pyspark-stubs
- Static type annotations for PySpark (obsolete since Spark 3.1. See
SPARK-32681).
-
Flintrock
- A command-line tool for launching Spark clusters on EC2.
-
Optimus
- Data Cleansing and Exploration utilities with the goal of simplifying
data cleaning.
Natural Language Processing
Streaming
-
Apache Bahir
- Collection of the streaming connectors excluded from Spark 2.0 (Akka,
MQTT, Twitter. ZeroMQ).
Interfaces
-
Apache Beam
- Unified data processing engine supporting both batch and streaming
applications. Apache Spark is one of the supported execution
environments.
-
Blaze
- Interface for querying larger than memory datasets using Pandas-like
syntax. It supports both Spark
DataFrames
and
RDDs
.
-
Koalas
- Pandas DataFrame API on top of Apache Spark.
Testing
-
deequ
- Deequ is a library built on top of Apache Spark for defining “unit
tests for data”, which measure data quality in large datasets.
-
spark-testing-base
- Collection of base test classes.
-
spark-fast-tests
- A lightweight and fast testing framework.
Web Archives
Workflow Management
Resources
Books
Papers
MOOCS
Workshops
Projects Using Spark
-
Oryx 2 -
Lambda architecture
platform built on Apache Spark and
Apache Kafka with specialization
for real-time large scale machine learning.
-
Photon ML - A
machine learning library supporting classical Generalized Mixed Model
and Generalized Additive Mixed Effect Model.
-
PredictionIO - Machine Learning
server for developers and data scientists to build and deploy predictive
applications in a fraction of the time.
-
Crossdata - Data
integration platform with extended DataSource API and multi-user
environment.
Blogs
-
Spark Technology Center - Great
source of highly diverse posts related to Spark ecosystem. From
practical advices to Spark commiter profiles.
Docker Images
Miscellaneous
References
Wikipedia. 2017. “Apache Spark — Wikipedia, the Free Encyclopedia.”
https://en.wikipedia.org/w/index.php?title=Apache_Spark&oldid=781182753.
License
This work (Awesome Spark, by
https://github.com/awesome-spark/awesome-spark), identified by
Maciej Szymkiewicz, is free of known copyright restrictions.
Apache Spark, Spark, Apache, and the Spark logo are
trademarks of
The Apache Software Foundation. This
compilation is not endorsed by The Apache Software Foundation.
Inspired by
sindresorhus/awesome.