Dumbo: Hadoop Streaming made elegant and easy

Dumbo is a Python module that allows you to easily write and run Hadoop streaming programs (it's named after Disney's flying circus elephant, since the logo for Hadoop is an elephant and Python was named after the BBC series "Monty Python's Flying Circus").

Installation

Just run these commands:

wget http://github.com/klbostee/dumbo/tarball/master
tar zxvf klbostee-dumbo*
cd klbostee-dumbo*
sudo python setup.py install

Example programs

wordcount.py

def mapper(key,value):
    for word in value.split(): yield word,1

def reducer(key,values):
    yield key,sum(values)

if __name__ == "__main__":
    import dumbo
    dumbo.run(mapper,reducer)

greplogs.py

def mapper(key,value):
    if value.find("playground.last.fm") >= 0: yield value,

if __name__ == "__main__":
    import dumbo
    dumbo.run(mapper)

There are more example programs available here.

Running programs

Local run:

python program.py map < input.txt | LC_ALL=C sort | python program.py red > output.txt 
Distributed run on Hadoop:
python -m dumbo program.py <options>
with <options> at least including: Have a look at this page for more possible options.

Documentation

Please refer to this wiki for documentation.

Feedback

If you have any problems, questions, feature requests etc., feel free to e-mail klaas@last.fm.