博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Hadoop Streaming Made Simple using Joins and Keys with Python « All Things Hadoop
阅读量:6154 次
发布时间:2019-06-21

本文共 5553 字,大约阅读时间需要 18 分钟。

Hadoop Streaming Made Simple using Joins and Keys with Python « All Things Hadoop

> , , > Hadoop Streaming Made Simple using Joins and Keys with Python

There are a lot of different ways to write MapReduce jobs!!!

I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case).

When doing streaming with Hadoop you do have a few library options.  If you are a Ruby programmer then is awesome! For Python programmers you can use and more recently released .

I like working under the hood myself and getting down and dirty with the data and here is how you can too.

Lets start first with defining two simple sample data sets.

Data set 1:  countries.dat

name|key

 

 

Data set 2: customers.dat

name|type|country

 

The requirements: you need to find out grouped by type of customer how many of each type are in each country with the name of the country listed in the countries.dat in the final result (and not the 2 digit country name).

To-do this you need to:

1) Join the data sets2) Key on country3) Count type of customer per country4) Output the results

So first lets code up a quick mapper called smplMapper.py (you can decide if smpl is short for simple or sample).

Now in coding the mapper and reducer in Python the basics are explained nicely here but I am going to dive a bit deeper to tackle our example with some more tactics.

 

 

Don’t forget:

 

 

Great! We just took care of #1 but time to test and see what is going to the reducer.

From the command line run:

 

 

Which will result in:

Hadoop Streaming Made Simple using Joins and Keys with Python « All Things Hadoop

> , , > Hadoop Streaming Made Simple using Joins and Keys with Python

There are a lot of different ways to write MapReduce jobs!!!

I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case).

When doing streaming with Hadoop you do have a few library options.  If you are a Ruby programmer then is awesome! For Python programmers you can use and more recently released .

I like working under the hood myself and getting down and dirty with the data and here is how you can too.

Lets start first with defining two simple sample data sets.

Data set 1:  countries.dat

name|key

 

 

Data set 2: customers.dat

name|type|country

 

The requirements: you need to find out grouped by type of customer how many of each type are in each country with the name of the country listed in the countries.dat in the final result (and not the 2 digit country name).

To-do this you need to:

1) Join the data sets2) Key on country3) Count type of customer per country4) Output the results

So first lets code up a quick mapper called smplMapper.py (you can decide if smpl is short for simple or sample).

Now in coding the mapper and reducer in Python the basics are explained nicely here but I am going to dive a bit deeper to tackle our example with some more tactics.

 

 

Don’t forget:

 

 

Great! We just took care of #1 but time to test and see what is going to the reducer.

From the command line run:

 

 

Which will result in:

 

 

Notice how this is sorted so the country is first and the people in that country after it (so we can grab the correct country name as we loop) and with the type of customer also sorted (but within country) so we can properly count the types within the country. =8^)

Let us hold off on #2 for a moment (just hang in there it will all come together soon I promise) and get smplReducer.py working first.

 

 

Don’t forget:

 

 

And then run:

 

 

And voila!

 

 

So now #3 and #4 are done but what about #2? 

First put the files into Hadoop:

 

 

And now run it like this (assuming you are running as hadoop in the bin directory):

 

 

Let us look at what we did:

 

 

Which results in:

 

 

So #2 is the partioner KeyFieldBasedPartitioner explained here further which allows the key to be whatever set of columns you output (in our case by country) configurable by the command line options and the rest of the values are sorted within that key and sent to the reducer together by key.

And there you go … Simple Python Scripting Implementing Streaming in Hadoop.  

Grab the tar and give it a spin.

 

 

Notice how this is sorted so the country is first and the people in that country after it (so we can grab the correct country name as we loop) and with the type of customer also sorted (but within country) so we can properly count the types within the country. =8^)

Let us hold off on #2 for a moment (just hang in there it will all come together soon I promise) and get smplReducer.py working first.

 

 

Don’t forget:

 

 

And then run:

 

 

And voila!

 

 

So now #3 and #4 are done but what about #2? 

First put the files into Hadoop:

 

 

And now run it like this (assuming you are running as hadoop in the bin directory):

 

 

Let us look at what we did:

 

 

Which results in:

 

 

So #2 is the partioner KeyFieldBasedPartitioner explained here further which allows the key to be whatever set of columns you output (in our case by country) configurable by the command line options and the rest of the values are sorted within that key and sent to the reducer together by key.

And there you go … Simple Python Scripting Implementing Streaming in Hadoop.  

Grab the tar and give it a spin.

转载地址:http://zpbfa.baihongyu.com/

你可能感兴趣的文章
JavaAPI详解系列(1):String类(1)
查看>>
HTML条件注释判断IE<!--[if IE]><!--[if lt IE 9]>
查看>>
发布和逸出-构造过程中使this引用逸出
查看>>
Oracle执行计划发生过变化的SQL语句脚本
查看>>
使用SanLock建立简单的HA服务
查看>>
发现一个叫阿尔法城的小站(以后此贴为我记录日常常用网址的帖子了)
查看>>
Subversion使用Redmine帐户验证简单应用、高级应用以及优化
查看>>
Javascript Ajax 异步请求
查看>>
DBCP连接池
查看>>
cannot run programing "db2"
查看>>
mysql做主从relay-log问题
查看>>
Docker镜像与容器命令
查看>>
批量删除oracle中以相同类型字母开头的表
查看>>
Java基础学习总结(4)——对象转型
查看>>
BZOJ3239Discrete Logging——BSGS
查看>>
SpringMVC权限管理
查看>>
spring 整合 redis 配置
查看>>
cacti分组发飞信模块开发
查看>>
浅析LUA中游戏脚本语言之魔兽世界
查看>>
飞翔的秘密
查看>>