在Hadoop中用Python实现MapReduce代码

利用MRjob编写MapReduce代码

1.1.1 安装Mrjob


mrjob简介

  • 一个通过hadoop、emr的mapreduce编程接口(streamming),扩展出来的一个python的编程框架。
  • mrjob程序可以在本地测试运行也可以部署到Hadoop集群上运行

mrjob安装

pip3 install mrjob

1.1.2 mrjob实现词频统计

# word_count.py @midi 2021-01-22

from mrjob.job import MRJob

class WordCount(MRJob):
#继承MRjob 重写两个方法mapper和reducer

    #每一行从line中输入
    def mapper(self, _, line):
        for word in line.split():
            yield word,1

    # 键相同的会分配到同一个reducer
    def reducer(self, word, counts):
        yield word, sum(counts)

if __name__ == '__main__':
    WordCount.run()

1.1.3 运行mrjob

python3 word_count.py input.txt

运行MRJob的几种方式

1. 默认是 -r inline 方式

python3 word_count.py input.txt > output.txt
python3 word_count.py -r inline input.txt > output.txt

2.本地-r local方式

python3 word_count.py -r local input.txt > output.txt

3. 运行在hadoop集群上

python3 word_count.py -r hadoop hdfs://master/input/ -o hdfs://master/output/

mrjob案例

统计所有单词出现最多的前n个

input.txt

Almost every child will complain about their parents sometimes.
It is natural, because when people stay together for a long time, they will start to have argument.
But ignore about the unhappy time, our parents love us all the time.
No matter what happen to us, they will stand by our sides. We should be grateful to them and try to understand them.
# .py @midi 2021-01-22
from mrjob.job import MRJob, MRStep
import heapq

class TopNWords(MRJob):

    def mapper(self, _, line):
        if line.strip() != "":
            for word in line.strip().split():
                word = word.strip(',')
                word = word.strip('.')
                word = word.strip('?')
                word = word.strip('!')
                #剔除每个单词中的标点符号
                yield word, 1

	# sum(count) 在元组前面是为了下面heapq的排序
    def reducer_sum(self, word, counts):
        yield None, (sum(counts), word)

    # 利用heapq将数据进行排序,将最大的5个取出
    def top_n_reducer(self, _, word_cnts):
        for cnt, word in heapq.nlargest(5, word_cnts):
            yield word, cnt

    # 实现steps方法用于指定自定义的mapper和reducer方法
    def steps(self):
        # 传入两个step 定义了执行的顺序
        return [
            MRStep(mapper=self.mapper,
                   reducer=self.reducer_sum),
            MRStep(reducer=self.top_n_reducer)
        ]


def main():
    TopNWords.run()


if __name__ == '__main__':
    main()

本地运行
在这里插入图片描述

Hadoop上运行
在这里插入图片描述
最终结果
在这里插入图片描述

tips:
当然 如果不用mrjob这个框架 自己直接写mapper.py和reducer.py也是可以的,下面是集群执行的代码

hadoop jar /home/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.6.jar -file ./mapper.py -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py -input /user/root/word -output /output/word
  • 3
    点赞
  • 1
    评论
  • 2
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

打赏
文章很值,打赏犒劳作者一下
相关推荐
©️2020 CSDN 皮肤主题: 鲸 设计师:meimeiellie 返回首页

打赏

ISMidi

你的鼓励将是我创作的最大动力

¥2 ¥4 ¥6 ¥10 ¥20
输入1-500的整数
余额支付 (余额:-- )
扫码支付
扫码支付:¥2
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者