Spark : The Definitive Guide: Big Data Processing Made Simple Paperback – 9 March 2018
From the Publisher
Spark’s toolkit-illustrates all the components and libraries Spark offers to end-users.
What Is Apache Spark?
Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of this writing, Spark is the most actively developed open source engine for this task, making it a standard tool for any developer or data scientist interested in big data. Spark supports multiple widely used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and scale-up to big data processing or incredibly large scale.
Although the project has existed for multiple years-first as a research project started at UC Berkeley in 2009, then at the Apache Software Foundation since 2013-the open source community is continuing to build more powerful APIs and high-level libraries over Spark, so there is still a lot to write about the project. We decided to write this book for two reasons. First, we wanted to present the most comprehensive book on Apache Spark, covering all of the fundamental use cases with easy-to-run examples. Second, we especially wanted to explore the higher-level 'structured' APIs that were finalized in Apache Spark 2.0-namely DataFrames, Datasets, Spark SQL, and Structured Streaming-which older books on Spark don’t always include. We hope this book gives you a solid foundation to write modern Apache Spark applications using all the available tools in the project.
Who This Book Is For
We designed this book mainly for data scientists and data engineers looking to use Apache Spark. The two roles have slightly different needs, but in reality, most application development covers a bit of both, so we think the material will be useful in both cases. Specifically, in our minds, the data scientist workload focuses more on interactively querying data to answer questions and build statistical models, while the data engineer job focuses on writing maintainable, repeatable production applications-either to use the data scientist’s models in practice, or just to prepare data for further analysis (e.g., building a data ingest pipeline). However, we often see with Spark that these roles blur. For instance, data scientists are able to package production applications without too much hassle and data engineers use interactive analysis to understand and inspect their data to build and maintain pipelines.
While we tried to provide everything data scientists and engineers need to get started, there are some things we didn’t have space to focus on in this book. First, this book does not include in-depth introductions to some of the analytics techniques you can use in Apache Spark, such as machine learning. Instead, we show you how to invoke these techniques using libraries in Spark, assuming you already have a basic background in machine learning. Many full, standalone books exist to cover these techniques in formal detail, so we recommend starting with those if you want to learn about these areas. Second, this book focuses more on application development than on operations and administration (e.g., how to manage an Apache Spark cluster with dozens of users). Nonetheless, we have tried to include comprehensive material on monitoring, debugging, and configuration in Parts V and VI of the book to help engineers get their application running efficiently and tackle day-to-day maintenance. Finally, this book places less emphasis on the older, lower-level APIs in Spark-specifically RDDs and DStreams-to introduce most of the concepts using the newer, higher-level structured APIs. Thus, the book may not be the best fit if you need to maintain an old RDD or DStream application, but should be a great introduction to writing new applications.
About the Author
Bill Chambers is a Product Manager at Databricks focusing on large-scale analytics, strong documentation, and collaboration across the organization to help customers succeed with Spark and Databricks. He has a Master's degree in Information Systems from the UC Berkeley School of Information, where he focused on data science.
Matei Zaharia is an assistant professor of computer science at Stanford University and Chief Technologist at Databricks. He started the Spark project at UC Berkeley in 2009, where he was a PhD student, and he continues to serve as its vice president at Apache. Matei also co-started the Apache Mesos project and is a committer on Apache Hadoop. Matei's research work was recognized through the 2014 ACM Doctoral Dissertation Award and the VMware Systems Research Award.
- Publisher : O'Reilly Media, Inc, USA; 1st edition (9 March 2018)
- Language : English
- Paperback : 606 pages
- ISBN-10 : 1491912219
- ISBN-13 : 978-1491912218
- Dimensions : 17.53 x 3.05 x 23.11 cm
- Best Sellers Rank: 136,354 in Books (See Top 100 in Books)
- Customer Reviews:
Review this product
Top reviews from other countries
I’m bookmarking virtually every 3rd page because there are such good examples.
Some spelling errors here and there, but well worth the money.