Overview of the BigFoot architecture.
This document describes the BigFoot system architecture, both from a physical perspective and from a functional perspective.
The physical perspective serves as a basis for a full-fledged deployment of the BigFoot software stack. As a consequence, the physical architecture describes the hardware and software components required to have a private instance of the BigFoot stack. Note, however, that this document is not intended to provide a step-by-step installation guide of the various software components. The functional perspective describes, at a high-level, the various functions executed by the BigFoot stack, including the deployment of instances of virtual machines, their orchestration, together with Analytics-as-a-Service to store and process data.
The underlying assumption necessary to deploy the BigFoot stack is the availability of a physical cluster of server-grade machines interconnected by a local area network, and (optionally) an array of storage devices to host “cold” data. This assumption can be revised in case of multi-site deployment, but the focus of this document is on a single-site infrastructure. Each machine requires the Linux operating system, a legacy hypervisor technology, and a set of services installed as part of the “cloud operating system”. The Figure below provides an illustration of the baseline BigFoot architecture.
BigFoot fully embraces open-source software and, at the lowest layer, is based on (and contributes to) Apache OpenStack. Starting from the “Ice House” release of OpenStack, BigFoot components related to the automatic and self-tuned deployment of analytics services are readily bundled, and do not require any additional manual intervention with respect to the default procedure to install Apache OpenStack. The bare-bone BigFoot architecture requires the following Apache OpenStack services: Keystone (which handles user authentication), Cinder (which handles virtual machine image storage), Nova (which handles virtual machine management and allocation), Neutron (which handles virtual networks), and Sahara (which is in charge of offering Analytics-as-a-Service, and that constitutes one of the BigFoot contributions, in the context of WP5). Additional services that are compatible with the BigFoot stack include Swift (which handles object storage), and monitoring services.
Once OpenStack is properly installed and configured, the BigFoot software stack is essentially ready to operate. Indeed, all other layers of the BigFoot stack are bundled into virtual machine (VM) images that can be downloaded or created on demand (through a command line utility that is part of the Sahara component). VM images instantiate storage and analytics engines (which are built in WP3 and WP4): these engines can be composed based on application-level requirements. For example, VM images can be dedicated to legacy Apache Hadoop or Apache Spark clusters, or to BigFoot-enabled variants of such clusters, which improve over the state-of-the-art by offering better performance or by using less physical resources. VM images can also instantiate individual BigFoot layers: for example, it is possible to deploy NoDB or DiNoDB alone, thus offering in-situ interactive processing of RAW data. Finally, VM images can be pre-configured to expose BigFoot-enabled high-level languages and scalable machine-learning libraries (developed in WP3 and WP2 respectively).
BigFoot services, which in the Figure above are labeled “Big Data Applications”, can be deployed both from the command line and from a web-interface that is part of the Apache OpenStack software, called Horizon.
Once user credentials are provided, each “tenant” gains access to her isolated, virtual private cloud: within this logical space, users can deploy analytics services, and any other additional software required for their application. For example, users can deploy an Apache Spark cluster bundled with the TreeLib scalable machine-learning library (developed in WP2), an Apache HDFS storage layer using the Sahara component (developed in WP5, and integrated by default in Apache OpenStack), and additionally instantiate a number of virtual machines dedicated to data ingestion using Apache Flume. Although data ingestion is not explicitly covered by the BigFoot project, its flexible architecture accommodates external services without any additional requirements.
In practice, to deploy a virtual private data storage and processing cluster, end users follow a one-click procedure which deploys virtual storage and analysis engines on demand, according to two modes of operation: static or ephemeral services. In the first case, the BigFoot Analytics-as-a-Service component deploys a permanent compute cluster, with a dedicated storage layer. In the second case, users need to specify where the data to be processed resides, and to upload an archive containing the data analytics algorithm. At this point, a virtual cluster is instantiated and terminated upon completion of the analytics task.