Similar authors to follow
Manage your follows
Customers Also Bought Items By
Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.
Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.
With this book, you’ll explore:
- How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
- The choice between data joins in Core Spark and Spark SQL
- Techniques for getting the most out of standard RDD transformations
- How to work around performance issues in Spark’s key/value pair paradigm
- Writing high-performance Spark code without Scala or the JVM
- How to test for functionality and performance when applying suggested improvements
- Using Spark MLlib and Spark ML machine learning libraries
- Spark’s Streaming components and external community packages
If you're training a machine learning model but aren't sure how to put it into production, this book will get you there. Kubeflow provides a collection of cloud native tools for different stages of a model's lifecycle, from data exploration, feature preparation, and model training to model serving. This guide helps data scientists build production-grade machine learning implementations with Kubeflow and shows data engineers how to make models scalable and reliable.
Using examples throughout the book, authors Holden Karau, Trevor Grant, Ilan Filonenko, Richard Liu, and Boris Lublinsky explain how to use Kubeflow to train and serve your machine learning models on top of Kubernetes in the cloud or in a development environment on-premises.
- Understand Kubeflow's design, core components, and the problems it solves
- Understand the differences between Kubeflow on different cluster types
- Train models using Kubeflow with popular tools including Scikit-learn, TensorFlow, and Apache Spark
- Keep your model up to date with Kubeflow Pipelines
- Understand how to capture model training metadata
- Explore how to extend Kubeflow with additional open source tools
- Use hyperparameter tuning for training
- Learn how to serve your model in production
About This Book
- Develop a machine learning system with Spark’s MLlib and scalable algorithms
- Deploy Spark jobs to various clusters such as Mesos, EC2, Chef, YARN, EMR, and so on
- This is a step-by-step tutorial that unleashes the power of Spark and its latest features
Who This Book Is For
Fast Data Processing with Spark - Second Edition is for software developers who want to learn how to write distributed programs with Spark. It will help developers who have had problems that were too big to be dealt with on a single computer. No previous experience with distributed programming is necessary. This book assumes knowledge of either Java, Scala, or Python.
What You Will Learn
- Install and set up Spark on your cluster
- Prototype distributed applications with Spark's interactive shell
- Learn different ways to interact with Spark's distributed representation of data (RDDs)
- Query Spark with a SQL-like query syntax
- Effectively test your distributed software
- Recognize how Spark works with big data
- Implement machine learning systems with highly scalable algorithms
Spark is a framework used for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does, but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and built-in tools for interactive query analysis (Spark SQL), large-scale graph processing and analysis (GraphX), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big datasets.
Fast Data Processing with Spark - Second Edition covers how to write distributed programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API to developing analytics applications and tuning them for your purposes.