Software Components

Free and open source software components of the BigFoot stack.

A important goal of BigFoot is to contribute to the Free / Open Source Software community by releasing patches and updates to existing projects, or develop new projects. To achieve this goal, we used the following methodology: our experimental software is released publicly on our BitBucket repository. Stable versions are released on GitHub and, when relevant, proposed for inclusion in upstream repositories of existing open source projects.

Find below the various software components that have been developed in the different work packages of the BigFoot project, which correspond to the different layers of the BigFoot software stack.

Note that these components are assembled in BigFoot virtual machine images, which are declined depending on your use cases. Visit the download page to learn more, on how to use BigFoot in your organization!

BigFoot Applications: algorithms and benchmarking tools

These are the software libraries to address the application domains of the BigFoot project, namely ICT Security and Smart Grids.

Major releases

These are the two main "products" sitting on the top of the BigFoot stack

Minor releases

These are algorithms to complement the "big data" analytics that are typically done in the two application domains of the project.

  • TreeLib: parallel decision trees and random forests for Spark. Available on GitHub.
  • Hadoop implementation of KNN graph building algorithms. Available on GitHub
  • K-means clustering on Spark. Available on Github.
  • Datascience: course material for Hadoop MapReduce and Pig laboratories. Available on GitHub.

Benchmarking Tools

These are the tools we used to benchmark the software components of the BigFoot software stack

  • OSMeF: OpenStack Measurement Framework. Available on GitHub.
  • Hadoop Log Tools: A set of tools to analyse Hadoop logs. Available on GitHub.
  • SWIM: Statistical Workload Injector for MapReduce (originally from UC Berkeley). Available on GitHub.
  • Custom SSB-QGEN: Query generator for the data analytics Star Schema Benchmark. Available on BitBucket.

BigFoot Engines: Batch and Interactive Data Processing

These are the software components that are at the heart of the BigFoot software stack. Note that BigFoot is fully compatible and integrated with existing parallel processing engines, such as Apache Hadoop.

Batch engines

The major focus of these components is on optimizing the performance of legacy Apache Hadoop systems. Although, such software components are bound to Apache Hadoop, the underlying ideas, that are largely discussed in our academic research, can be adapted to recent, cutting edge systems such as Apache YARN and Apache Spark.

Depending on the virtual machine images your application use cases demand, the BigFoot software stack can instantiate a variety of parallel processing engines, in addition to the ones originally supported by BigFoot. In this case, some of the optimizations we present here will not be activated.

  • HFSP: the Hadoop Fair Sojourn Protocol scheduler. Available on GitHub.
  • Hadoop with Suspension: A patched version of Hadoop, supporting suspend/resume primitives. Available on BitBucket.
  • Work-sharing ROLLUP operator for Apache Pig. Available on GitHub.
  • Schedsim: a simulator for size-based scheduling with job size estimation error. Available on BitBucket .

Interactive engines

The software packages developed in this context are currently not open source yet. Please, refer to the document section to learn about our scientific achievements, and feel free to contact directly the project coordinator to learn more!

BigFoot Storage: components for large-scale data management

These are the software components that are used to manage data in BigFoot. A major achievement is the DiNoDB interactive data management software. DiNoDB open source software license is currently pending. Please, refer to the document section to learn about our scientific achievements, and feel free to contact directly the project coordinator to learn more!

Analytics-as-a-Service: Virtualized Big Data Analytics

This software package includes our contribution to the Apache OpenStack community. It features a new type of service, that we call Analytics as a service. The whole BigFoot software stack depends on it, and it is used to instantiate virtual machine images that include the components necessary for your application use cases.

Major releases

  • Apache OpenStack Sahara: this component aims to provide users with simple means to provision a data intensive cluster (Hadoop, Spark) by specifying several parameters like software versions, cluster topology, nodes hardware details and a few more. Available on GitHub.
  • Sahara BigFoot image elements: this component aims to automate the construction of custom virtual machine images that include components of the BigFoot software stack. Available on GitHub.

Minor releases and tools

  • PyOstack: python binding to work with OpenStack.Available on GitHub.
  • OpenStackDNS: a tool designed to integrate virtual machines managed by Open Stack into an existing Powerdns system. Available on GitHub.
  • LogStash: Python logging handler for Logstash. Available on GitHub.
  • OpenStack docs: How to reproduce the BigFoot infrastructure. Available on BitBucket.