Reproducible Experiment to Compare Apache Spark and Apache Flink batch processing

Jan 31, 2016 · 2 minute read · Comments

How do you reproduce distributed experiments? Have you ever experienced the pain of installing and configuring necessary software to run a distributed experiment.


Karamel comes to rescue you if you are struggling in that problem.

Karamel is an orchestration engine which helps you design and run distributed experiments on the cloud such as Amazon EC2 Google Compute Engine, Open Stack etc.It provides a convenient GUI to design reproducible experiments and also a Domain Specific Language (DSL) to declare dependencies and software tools that are required to setup and run the experiments. This is a starting to make painful distributed experiments as easy as running it locally with a single click. How easy and convenient it would be for a researcher to tune few knobs of the experiment and re run in a total different scale.

Watch these two clips and experience it yourself.

We recently made Dongwong’s Apache Flink vs Apache Spark experiment reproducible on Amazon EC2. If you want to reproduce the same experiment conveniently with different configurations you can follow the steps mentioned here.

Following is the presentation about the results we obtained.

Apache Flink vs Apache Spark - Reproducible experiments on cloud. from Shelan Perera

System level performance results of the experiments.

200GB workload

400GB workload

600GB workload

You can find the full report of the project here.


Jim Dowling

Kamal Hakimzadeh

Ashansa Perera

comments powered by Disqus