Similar authors to follow
Manage your follows
Customers Also Bought Items By
From news and speeches to informal chatter on social media, natural language is one of the richest and most underutilized sources of data. Not only does it come in a constant stream, always changing and adapting in context; it also contains information that is not conveyed by traditional data sources. The key to unlocking natural language is through the creative application of text analytics. This practical book presents a data scientist’s approach to building language-aware products with applied machine learning.
You’ll learn robust, repeatable, and scalable techniques for text analysis with Python, including contextual and linguistic feature engineering, vectorization, classification, topic modeling, entity resolution, graph analysis, and visual steering. By the end of the book, you’ll be equipped with practical methods to solve any number of complex real-world problems.
- Preprocess and vectorize text into high-dimensional feature representations
- Perform document classification and topic modeling
- Steer the model selection process with visual diagnostics
- Extract key phrases, named entities, and graph structures to reason about data in text
- Build a dialog framework to enable chatbots and language-driven interaction
- Use Spark to scale processing power and neural networks to scale model complexity
Ready to use statistical and machine-learning techniques across large data sets? This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus on particular analyses you can build, the data warehousing techniques that Hadoop provides, and higher order data workflows this framework can produce.
Data scientists and analysts will learn how to perform a wide range of techniques, from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data systems available to build and empower data products that can handle—and actually require—huge amounts of data.
- Understand core concepts behind Hadoop and cluster computing
- Use design patterns and parallel analytical algorithms to create distributed data analysis jobs
- Learn about data management, mining, and warehousing in a distributed context using Apache Hive and HBase
- Use Sqoop and Apache Flume to ingest data from relational databases
- Program complex Hadoop and Spark applications with Apache Pig and Spark DataFrames
- Perform machine learning techniques such as classification, clustering, and collaborative filtering with Spark’s MLlib
Os cientistas e os analistas de dados aprenderão a usar diversas técnicas que variam da escrita de aplicações MapReduce e Spark com Python ao uso de modelagem avançada e gerenciamento de dados com Spark MLlib, Hive e HBase. Você também conhecerá os processos analíticos e os sistemas de dados disponíveis para desenvolver e conferir eficácia aos produtos de dados capazes de lidar com – e que, na verdade, exigem – quantidades enormes de dados.
•Entenda os conceitos principais do Hadoop e do processamento em cluster.
•Utilize padrões de projeto e algoritmos analíticos paralelos para criar jobs de análise de dados distribuídos.
•Adquira conhecimentos sobre gerenciamento de dados, mineração e armazém de dados em um contexto distribuído usando Apache Hive e HBase.
•Utilize Sqoop e Apache Flume para entrada de dados a partir de bancos de dados relacionais.
•Programe aplicações Hadoop e Spark complexas com Apache Pig e Spark DataFrames.
•Utilize técnicas de aprendizado de máquina, como classificação, clustering e filtragem colaborativa, com a MLib do Spark.
- Learn how to tackle every step in the data science pipeline and use it to acquire, clean, analyze, and visualize data
- Get beyond the theory with real-world projects
- Expand your numerical programming skills through step-by-step code examples and learn more about the robust features of R and Python
Data's value has grown exponentially in the past decade, with 'Big Data' today being one of the biggest buzzwords in business and IT, and data scientist hailed as 'the sexiest job of the 21st century'. Practical Data Science Cookbook helps you see beyond the hype and get past the theory by providing you with a hands-on exploration of data science. With a comprehensive range of recipes designed to help you learn fundamental data science tasks, you'll uncover practical steps to help you produce powerful insights into Big Data using R and Python.
Use this valuable data science book to discover tricks and techniques to get to grips with your data. Learn effective data visualization with an automobile fuel efficiency data project, analyze football statistics, learn how to create data simulations, and get to grips with stock market data to learn data modelling. Find out how to produce sharp insights into social media data by following data science tutorials that demonstrate the best ways to tackle Twitter data, and uncover recipes that will help you dive in and explore Big Data through movie recommendation databases.
Practical Data Science Cookbook is your essential companion to the real-world challenges of working with data, created to give you a deeper insight into a world of Big Data that promises to keep growing.
What you will learn
- Follow the recipes in this essential data science cookbook to learn the fundamentals of data science and data analysis
- Put theory into practice with a selection of real-world Big Data projects
- Learn the data science pipeline and successfully structure your data science project
- Find out how to clean, munge, and manipulate data
- Learn different approaches to data modelling and how to determine the most appropriate for your data
- Learn numerical computing with NumPy and SciPy
About the Authors
Tony Ojeda is the founder of District Data Labs, a cofounder of Data Community DC, and is actively involved in promoting data science education through both organizations.
Sean Patrick Murphy spent 15 years as a senior scientist at The Johns Hopkins University Applied Physics Laboratory, where he focused on machine learning, modeling and simulation, signal processing, and high performance computing in the Cloud. Now, he acts as an advisor and data consultant for companies in SF, NY, and DC.
Benjamin Bengfort has worked in military, industry, and academia for the past 8 years. He is currently pursuing his PhD in Computer Science at the University of Maryland, College Park, researching Metacognition and Natural Language Processing.
Abhijit Dasgupta is a data consultant working in the greater DC-Maryland-Virginia area, with several years of experience in biomedical consulting, business analytics, bioinformatics, and bioengineering consulting.
Table of Contents
- Preparing Your Data Science Environment
- Driving Visual Analysis with Automobile Data (R)
- Simulating American Football Data (R)
- Modeling Stock Market Data (R)
- Visually Exploring Employment Data (R)
- Creating Ap