Dumbo: Hadoop Streaming made elegant and easy
Dumbo is a Python module that allows you to easily write and run Hadoop streaming programs (it's named after Disney's flying circus elephant, since the logo for Hadoop is an elephant and Python was named after the BBC series "Monty Python's Flying Circus").
Example programs
wordcount.py
def mapper(key,value):
for word in value.split(): yield word,1
def reducer(key,values):
yield key,sum(values)
if __name__ == "__main__":
import dumbo
dumbo.run(mapper,reducer)
greplogs.py
def mapper(key,value):
if value.find("playground.last.fm") >= 0: yield key,value
if __name__ == "__main__":
import dumbo
dumbo.run(mapper)
There are more example programs available here.
Documentation
Please refer to this wiki for documentation.
Feedback
If you have any problems, questions, feature requests etc., feel free to e-mail klaas@last.fm.