Building a non-distributed big data computing solution on Amazon Web Services

A prototype solution for applications that have been architected to run on a single computing node

Big data is best processed via distributed computing frameworks (i.e. utilising a cluster of computing nodes) using tool such as Hadoop MapReduce and MPI. However, many applications have traditionally been architected to run on a single computing node only and therefore cannot leverage the scalability of cloud computing in order to process the data in a distributed fashion.

Nevertheless, such applications can, to an extent, still leverage the cloud to process and manage big data on in a cost-effective manner. This article provides a prototype solution for such applications by leveraging Amazon Web Services (AWS) cloud computing platform. The solution can potentially be replicated on other public and private cloud computing platforms as well.


Click here for a high resolution image

The solution demonstrates leveraging Amazon's public cloud platform to efficiently store, manage, and process data in a cost-effective manner by encapsulating and storing the datasets as containers in a repository based on inexpensive and highly available object storage (AWS S3). The repository is inherently integrated with inexpensive and highly-durable archival-storage (AWS Glacier).

The datasets can be processed by creating an on-demand processing infrastructure composed of storage-optimised virtual machine (AWS EC2) and on-demand high-performance block-storage (AWS EBS volumes) along with ephemeral high-performance local block-storage (Instance Store).

The dataset containers can be retrieved on-demand from the repository, stored on the high-performance block-storage, be processed, and eventually stored back in the repository. The on-demand processing infrastructure also has access to other on-demand storage options such as: SQL RDBMS (AWS RDS), NoSQL DB (AWS DynamoDB), and Object-Storage (AWS S3).

Common on-premise data challenges:

  • Data Protection
    • Expensive.
    • Poorly implemented.
  • Data Sharing & Dissemination
    • Ad-hoc solutions.
    • High duplication.
  • Data Access & Analytics
    • Limited access interfaces due to being tightly coupled with the storage and compute resources.
    • No central access logs. Hence, no visibility for what data is being accessed? By who? From/on where? How often?
  • Data Security
    • No access audit.
    • Complicated and inefficient access control mechanisms.
  • Data Recovery
    • Long Recovery Time Objective (RTO).
    • Long Recovery Point Objective (RPO).
    • Unreliable.
  • Data Types
    • Small structured/semi-structured data.
    • Large unstructured data.
  • Data Lifecycle Control
    • Absent or poorly implemented.
    • Orphaned datasets.
  • Data Governance
    • Policies are hard to be enforced due to disintegrated and decentralised storage solutions.
  • Metadata Management
    • Absent or not authoritative.
  • Data Centre Infrastructure
    • Expensive equipments.
    • High operational costs.
    • Slow provisioning.
    • Under or over utilised.
    • One-size-fits-all approach.
    • Unreliable.
    • Inconsistent performance.
    • Lack of trained people.
    • No scalability.
    • No colocation between the compute infrastructure and the storage infrastructure.
    • No Hierarchical Storage Management (HSM).


  • The processing application is not scalable and does not support distributed computing.
  • There is access to a local Amazon Web Services (AWS) Region.
  • No regulations or policies prevent storing data off-premise or out of state.
  • Datasets size is 1-10 TB.
  • Datasets do not need to be processed constantly.
  • Datasets are not expected to be processed immediately when needed.
  • Ability to manage AWS services via AWS SDK, or AWS CLI, or AWS web management portal.
  • AWS Identity and Access Management (IAM) is configured .
  • AWS Virtual Private Cloud (VPC) is configured.
  • AWS services functionalities and limits should be valid as of the time of writing (November 2014).


  • Maximum size of AWS EBS Volume is 1 TB.
  • Maximum throughput of a single AWS EBS Volume is 128 MB/s.
  • Maximum aggregate throughput from the Compute Machine to the Processing Storage ranges from 62 MB/s up to 800 MB/s depending on the Compute Machine type and the number of AWS EBS striped volumes. 
  • The throughput between the Compute Machine and the Datasets Repository depends on the Compute Machine type, with the optimal types it can reach 200-300 MB/s for download and upload.
  • Maximum object size on AWS S3 is 5 TB. Hence, a Dataset Container must be splitted into multiple files in order to be stored in the Datasets Repository if it is larger than 5 TB.
  • AWS Import/Export (shipping disks) service is not available in all Amazon AWS Regions.
  • AWS Glacier service is not available in all Amazon AWS Regions.

Amazon AWS Cloud Solution

Dataset On-premises Preparation

Encapsulate datasets as containers (e.g. virtual filesystem, tarball). Ideally, compress them, and optionally encrypt them (I'll call these “Datasets Containers”).

Datasets Repository

Create an AWS S3 bucket in a local AWS Region to act as a repository (I'll call this the “Datasets Repository”).

  1. Set write-and-read-permissions (upload/list) for the Datasets Repository.
  2. Enable access logging. 
  3. Optionally enable versioning.
  4. Define a Lifecycle policy for Datasets Containers:
    • Auto-archive into the Archival Storage (AWS Glacier), Or 
    • Only Auto-archive previous versions into the Archival Storage (if versioning is enabled), Or 
    • Auto-archive into the Archival Storage, and then delete from the Datasets Repository after # of days, Or
    • Auto-delete from the Datasets Repository after # of days
  5. Define tags for the Datasets Repository (e.g. Cost Allocation Centre, Department, Project)

Datasets Repository Population

  1. Store the Datasets Containers into the Datasets Repository via:
    • Direct upload through the Internet
    • AWS Import/Export (disks shipping)
    • AWS Direct Connect (direct connection to a local AWS facility)
  2. Split the Dataset Container into multiple files if it is larger than 5 TB.
  3. Optionally enable Server Side Encryption for Datasets Containers.
  4. Optionally reduce the redundancy-level for unimportant Datasets Containers to reduce storage costs.
  5. Set read-permissions (download) for Datasets Containers.
  6. Set Datasets Containers checksum values to ensure data integrity.
  7. Optionally set other meta-data values for Datasets Containers (e.g. data category, dataset custodian, container type)

Compute Images Repository

Optionally create application-specific Compute Images by customising the standard AWS AMI images or those available in the AWS Marketplace. The images are stored on inexpensive object-storage (AWS S3).

Metadata Repository

Optionally provision a NoSQL DB (AWS DynamoDB) or an SQL RDBMS (AWS RDS) to act as a Metadata Repository by populating it with the metadata of the Datasets Containers. This repository can be used as a search engine to efficiently locate Datasets Containers. Moreover, it can be leveraged as a tool to enable data governance, and be a rich information source for Datasets analytics.

On-demand Processing Infrastructure

  1. Provision a storage-optimised instance (AWS EC2) in a local AWS Region on-demand. Hereinafter is referred to as a “Compute Machine”.
  2. Allocate multiple high-performance block-storage volumes (AWS EBS) to the Compute Machine on-demand.
  3. Combine the high-performance block-storage volumes as a single volume with data striped across all volumes. Hereinafter is referred to as a “Processing Storage”.
  4. The Processing Storage can be configured as a regular filesystem or a backend storage for an SQL RDBMS.
  5. Retrieve a Dataset Container from the Datasets Repository, decapsulate it, and store it on the Processing Storage.
  6. Process the dataset, and optionally leverage other on-demand AWS storage services such as:
    • NoSQL DB (AWS DynamoDB)
    • Object Storage (AWS S3)
  7. Leverage the ephemeral local block-storage that is associated with the Compute Machine as a Scratch Storage.
  8. Re-capsulate the dataset as a Dataset Container, and store it back in the Datasets Repository when processing is complete.
  9. Terminate the Compute Machine to save computing costs.
  10. Delete the Processing Storage to save storage costs.


  • Data Protection
    • Datasets Containers are automatically and synchronously stored across multiple devices and multiple facilities within a selected geographical region.
    • Datasets Containers can be protected from accidental deletion by enabling Multi-Factor Authentication (MFA) Delete.
    • Datasets Containers can be versioned, and all versions are accessible simultaneously.
    • Datasets Containers can be managed through lifecycle policies.
    • Datasets Containers are 99.999999999% durable, and 99.99% available over a given year.
    • The archived Datasets Containers are stored in multiple facilities and on multiple devices within each facility. The archiving service (Amazon AWS Glacier) return SUCCESS after synchronously storing data across multiple facilities. Also, it performs regular, systematic data integrity checks and is built to be automatically self-healing.
    • The Processing Storage is replicated across multiple servers.
    • The Processing Storage can be snapshotted on-demand, and snapshots will be stored on inexpensive object-storage (AWS S3).
  • Data Sharing & Dissemination
    • Datasets Containers can be shared and published directly via the Internet.
  • Data Access & Analytics
    • The Datasets Containers are floatable, they can be accessed and be processed by Computing Machines running different Computing Images on AWS. They also can be moved back on-premises or to other cloud computing platforms via the Internet or AWS Import/Export service, or through AWS Direct connect. 
    • Access logs can be enabled at the Datasets Repository level.
    • The Datasets Repository access logs along with the Metadata Repository provide a rich source of information for Datasets Analytics, and provide detailed visibility for what data is being accessed? By who? From/on where? How often?
  • Data Security
    • Datasets Repository Access logs can be used for auditing purposes.
    • Different Access Control mechanisms for the Datasets Repository:
      • AWS Identity and Access Management (IAM) policies
      • Access Control Lists (ACLs)
      • Bucket policies
      • Query string authentication
    • Optional encryption at rest for Datasets Containers in the Datasets Repository.
    • Optional encryption at rest for the Processing Storage.
    • Datasets Containers can be encrypted.
    • Archived Datasets are encrypted at rest.
    • Datasets Containers can be downloaded and uploaded via HTTPS (HTTP over SSL)
    • The Cloud API calls can be logged for all cloud services at no additional charge by enabling AWS CloudTrail. 
    • Datasets Repository Access logs and the Cloud API calls logs can be used to enable security analysis, change tracking, and compliance auditing.
    • Amazon AWS data centres utilize state-of-the art electronic surveillance and multi-factor access control systems. Data centres are staffed 24x7 by trained security guards, and access is authorized strictly on a least privileged basis.
    • Amazon AWS has achieved ISO 27001 certification and has been validated as a Level 1 service provider under the Payment Card Industry (PCI) Data Security Standard (DSS).
  • Data Recovery
    • Datasets Containers can be archived and restored seamlessly as the Datasets Repository (AWS S3) is inherently integrated with the Archival Storage (AWS Glacier).
    • Datasets Containers are highly available, and highly durable. However, different data protection solutions (i.e. AWS S3 Versioning, AWS EBS Snapshotting) can be implemented at different layers to shorten RPO and RTO. 
  • Data Types
    • Different storage options available on-demand to store and manage different types of data:
      • NoSQL DB (AWS DynamoDB)
      • RDBMS (AWS RDS)
      • Object-Storage (AWS S3)
      • Block-Storage (AWS EBS, AWS EC2 Instance Store)
  • Data Lifecycle Control
    • Datasets Containers can be managed through data lifecycle policies:
      • Auto-archive into the Archival Storage (AWS Glacier), Or 
      • Only Auto-archive previous versions into the Archival Storage (if versioning is enabled), Or 
      • Auto-archive into the Archival Storage, and then delete from the Datasets Repository after # of days, Or
      • Auto-delete from the Datasets Repository after # of days
  • Data Governance
    • The Datasets Repository, The Metadata Repository, the Datasets Repository access logs, and the Cloud API calls logs can be leveraged as efficient tools to enable effective Data Governance.
  • Metadata Management
    • The Metadata Repository can be leveraged as an efficient tool to enable effective metadata management.
  • Data Centre Infrastructure
    • Global infrastructure available on-demand.
    • Self-service and fast provisioning.
    • Pay as you go.
    • Scalable and elastic.
    • Flexible and diverse infrastructure services available on-demand.
    • Unified computing.
    • Low TCO.


The author makes no warranties regarding comprehensiveness, completeness, accuracy or updating of the information provided. Under no circumstances shall the author be liable for material or non-material damages arising from use or disuse of the provided information or the use of faulty and incomplete information, where there is no proof of the authors intention or gross negligence. The author expressly reserves the right to modify, amend, delete part of the pages or the overall offer without explicit announcement or to temporarily or ultimately stop publication.

More information

For more information about implementing the solution on Amazon AWS or other public/private cloud platforms or to tailor it to meet specific organisational business requirements, please contact Abraham Alawi (aalawi at

Abraham Alawi is a solutions architect and DevOps engineer who has worked across a number of prominent Australian enterprises.

Join the TechWorld newsletter!

Error: Please check your email address.

Tags Amazon Web Servicesbig datacloud computing

More about Amazon Web ServicesAWSClickISOProvisionTCO

Show Comments