Dumbo: Hadoop Streaming made elegant and easy

Dumbo is a Python module that allows you to easily write and run Hadoop streaming programs (it's named after Disney's flying circus elephant, since the logo for Hadoop is an elephant and Python was named after the BBC series "Monty Python's Flying Circus").

Example programs

wordcount.py

def mapper(key,value):
    for word in value.split(): yield word,1

def reducer(key,values):
    yield key,sum(values)

if __name__ == "__main__":
    import dumbo
    dumbo.run(mapper,reducer)

greplogs.py

def mapper(key,value):
    if value.find("playground.last.fm") >= 0: yield key,value

if __name__ == "__main__":
    import dumbo
    dumbo.run(mapper)

There are more example programs available here.

Documentation

Please refer to this wiki for documentation.

Feedback

If you have any problems, questions, feature requests etc., feel free to e-mail klaas@last.fm.