Hadoop Real World Solutions Cookbook

Hadoop Real World Solutions Cookbook

Brian Femiano

Language: English

Pages: 316

ISBN: B00AIVQE3I

Format: PDF / Kindle (mobi) / ePub


In Detail

Helping developers become more comfortable and proficient with solving problems in the Hadoop space. People will become more familiar with a wide variety of Hadoop related tools and best practices for implementation.

Hadoop Real World Solutions Cookbook will teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.

Hadoop Real World Solutions Cookbook provides in depth explanations and code examples. Each chapter contains a set of recipes that pose, then solve, technical challenges, and can be completed in any order. A recipe breaks a single problem down into discrete steps that are easy to follow. The book covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo.

Hadoop Real World Solutions Cookbook will give readers the examples they need to apply Hadoop technology to their own problems.

Approach

Cookbook recipes demonstrate Hadoop in action and then explain the concepts behind the code.

Who this book is for

This book is ideal for developers who wish to have a better understanding of Hadoop application development and associated tools, and developers who understand Hadoop conceptually but want practical examples of real world applications.

 

 

 

 

 

 

 

 

 

 

C/C++ using Yum, run the following command as the root user from a bash shell: # yum install gcc gcc-c++ autoconf automake To compile and install Protocol Buffers, type the following lines of code: $ cd /path/to/protobuf $ ./configure $ make $ make check # make install # ldconfig How to do it... 1. Set up the directory structure. $ mkdir test-protobufs $ mkdir test-protobufs/src $ mkdir test-protobufs/src/proto $ mkdir test-protobufs/src/java $ cd test-protobufs/src/proto 2. Next, create the

desc option: ordered_weblogs = ORDER nobots timestamp desc; See also The following recipes will use Apache Pig: ff Using Apache Pig to sessionize web server log data ff Using Python to extend Apache Pig functionality ff Using MapReduce and secondary sort to calculate page views 73 Extracting and Transforming Data Using Apache Pig to sessionize web server log data A session represents a user's continuous interaction with a website, and the user session ends when an arbitrary activity

Text, IntWritable> { private static final int country_pos = 1; private static final Pattern pattern = Pattern. compile("\\t"); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String country = pattern.split(value.toString()) [country_pos]; context.write(new Text(country), new IntWritable(1)); } } 3. Create a reducer that sums all of the country counts, and writes the output to separate files using MultipleOutputs: public

[1,1,1,1,1,1]} = six occurrences of the IP "10.10.1.1". The implementation of the reduce() method will sum the integers and arrive at the correct total, but there's nothing that requires the integer values to be limited to the number 1. We can use a combiner to process the intermediate key-value pairs as they are output from each mapper and help improve the data throughput in the shuffle phase. Since the combiner is applied against the local map output, we may see a performance improvement as the

HftpFileSystem is a read-only filesystem. The distcp command has to be run from the destination server: hadoop distcp hftp://namenodeA:port/data/weblogs hdfs://namenodeB/data/ weblogs In the preceding command, port is defined by the dfs.http.address property in the hdfs-site.xml configuration file. Importing data from MySQL into HDFS using Sqoop Sqoop is an Apache project that is part of the broader Hadoop ecosphere. In many ways Sqoop is similar to distcp (See the Moving data efficiently

Download sample

Download