Blog

Architecting your big data process on AWS

big-data

We explore the key decision points when setting up AWS environments for big data

Hidden in the vast realms of data flowing into your organization are keys to unlock business success. Big data is full of the kinds of valuable information that could give your business a competitive edge, but harnessing it so it reveals those secrets to you is a massive challenge. The public cloud offers the scale of computing power you need to gather, store and analyze big data effectively. We guide you on structuring your big data architecture on Amazon Web Services (AWS) for optimum results. 

Big data challenges the public cloud solves

Historically, the demands of big data meant only enterprises with the resources to fund almost limitless computing power could afford to harness it. The advent of cloud computing and the availability of on-demand computing resources and services changed all that. Users can engage virtually infinite resources, use them only for as long as they need them and pay solely for the resources and services they use.

As the cloud has evolved, it has given customers increasing autonomy to focus on developing their application code and analytics queries rather than ensuring capacity. In the early cloud years, customers spun up instances on virtual machines and installed applications running their code. Then, cloud providers started offering managed services, taking responsibility for more of the software stack. Now, serverless computing frees up the time developers would otherwise spend provisioning servers and allows them to focus on tasks with greater business value. 

As cloud technology continues to advance, organizations of almost any size that leverage it correctly can access the power of big data technologies.

Key layers in your big data architecture

The sheer volume, variety and velocity of the data you’re dealing with means you need to be able to implement a robust, flexible architecture capable of collecting, storing and processing that data often in real or near real time. Businesses need to evolve their technology stack to handle the volume and variety of data available to them, and they need to implement the infrastructure capable of doing that work at top speed -- often in real or near real time.

To manage the spectrum of tasks an effective big data program demands, you will need a multilayered architecture to handle data storage, processing and consumption. It must facilitate multidirectional flows, as data may be stored before and after analysis.

Storage layer

This layer is where the data is stored and converted into a format that allows it to be cataloged and analyzed. Compliance regulations and governance policies will determine how certain types of data are stored. However, the way you store the data should not dictate the way you process it and vice versa. 

Data access and governance

Given the enormous volumes of data flowing into your storage layer and the new data assets and versions that data transformation, data processing and analytics will create, you need an effective data governance process to help you track it all. A key component of data governance is the data catalog, which combines metadata with specialized data management and search tools to provide an interface to query your data assets and serve as a single source of truth. The AWS Glue Data Catalog serves as a central metastore for batch processing jobs, regardless of the AWS analytic service used for processing.  

Data from batch processing operations is usually stored in a data lake, which can take large volumes of files in different formats. In the case of AWS Lake Formation, a service to simplify and centralize access management, the AWS Glue Data Catalog provides access control for Amazon S3 data lakes with most widely used AWS analytics services, including Amazon Redshift (via Amazon Redshift Spectrum), Amazon Athena, AWS Glue ETL and Amazon EMR (for Spark-based notebooks).

Object storage

Object storage such as Amazon S3 is ideal for data lakes because it allows you to store all types of files without the need for predefined schemas or limits on data volumes. It is natively supported by big data frameworks such as Spark, Hive and Presto, and it offers 99.999999999% object durability across multiple Availability Zones. 

You will need to segment your data lake into landing, raw, trusted and curated zones to store data according to its state of consumption readiness. Data in the data lake is usually ingested and stored without any prior schema definition to reduce the time required for ingestion and preparation before data can be examined.

Stream storage

Real-time data streams or events can be stored using a stream storage product such as Amazon Kinesis. With Amazon Kinesis Data Streams, consumers can read directly from the stream for real-time analytics, but customers who want to store the data for future analysis can use Amazon Kinesis Data Firehose to deliver the data to a target (data lake, data warehouses or analytics services) and perform the analysis later.

You can use AWS Glue crawlers to discover new datasets or partitions added from the stream. It can crawl multiple data stores in a single run, extracting metadata from your data stores to populate the AWS Glue Data Catalog with tables. Extract, transform and load (ETL) jobs you define in AWS Glue read from and write to the data stores identified in the source and target Data Catalog tables. 

Analysis layer

Depending on the context, you can extract business value from your big data using various types of analytics, including batch, interactive, stream or predictive. 

Batch analytics involves processing data in time intervals from minutes to days for applications such as daily or weekly sales reports. Amazon EMR is a comprehensive cloud big-data solution that you can use to perform batch analytics with a data processing framework like Apache Spark. 

Interactive data analytics uses a combination of distributed database systems and rendering capabilities to optimize the analytical potential of business Intelligence (BI) technologies. It applies to situations where you want to get answers from the system in seconds, such as self-service dashboards. Again, you can use Amazon EMR, this time with Spark or the SQL query engine Presto. For large, structured datasets, Amazon Redshift works well. Amazon Athena works for unstructured, semi-structured and structured data stored in Amazon S3.

Streaming analytics is used for applications requiring real-time data, such as fraud alerts. You can build a near real-time analytics pipeline using Amazon EMR with Spark Streaming or Amazon Kinesis Data Analytics.

Predictive analytics relies on machine learning to forecast future behavior based on a user’s purchase history, search history, demographics, ratings and other categories. Amazon Sagemaker is a good solution for predictive analytics because it offers a central location for performing all of your machine learning tasks, providing fully managed infrastructure, tools and workflows for building, training and deploying your machine learning models.

Consumption layer

The consumption layer is where your organization works with the data using analytics engines, data queries, AI and machine learning applications and data visualization to extract valuable business information from large volumes of data. Users generally fall into two categories: 

Business users want to make sense of the data using visualization applications like Tableau or a fully managed BI tool like Amazon Quicksight. They can also use the open-source user interface Kibana to visualize data from Elasticsearch. 

The second category of users is data scientists, who want to access an endpoint for statistical analysis, using a tool like R Studio, for example. They can also use a JDBC driver to connect Amazon Athena or Amazon Redshift to query the data directly.

Big data architecture best practices

Although every use case is different, certain practices are more likely to deliver successful results when architecting your big data process in the public cloud. 

  • Focus on the business value you want to extract from your big data program. Once you have a detailed view of the business objectives your big data initiatives should help you achieve, use that view to inform agile delivery of the technologies you will need to implement them. 
  • Decouple systems to ensure new tools and technologies can be integrated without major disruption. Rather than relying on big, monolithic applications, separate them into smaller systems so that you can iterate on each subsystem and truly build and evolve over time.
  • Adopt a holistic view when building your architecture, approaching it as an agile program to accommodate your strategic vision but incorporating templates that will make it scalable.
  • Ensure you have a comprehensive, trusted data governance program in place to keep your data secure. 
  • Use the right tool for the job: Consider data structure, latency requirements, throughput and access patterns. Of these, data structure and access patterns are the most important. 
  • Don’t try to reinvent the wheel: Leverage managed and serverless services to take advantage of the engineering expertise and best practices that have been invested in these technologies. Managed and serverless services are scalable, elastic, available, reliable and secure and require little or no admin. 
  • Be conscious of cost. Big data does not have to mean big costs.

DoiT’s big data architecture process

DoiT has deep expertise and official partner competencies for data and analytics with AWS. We help customers address questions around both architecture and operations, helping them to meet their goals faster with fewer risks and friction.

We start the process by looking at the customer’s business model, the products and services they offer, their team structure, releasing strategy, and operations before honing in on their data needs, resources and goals. These are some of the questions we might ask: 

  • Do you already have a big data solution? 
  • If so, is it on-premises or already in the cloud?
  • What are the main applications and consumers? BI reporting, ML, etc.
  • What are the data sources (producers)? Think of volume, speed and data structure.
  • Describe the data stages from data retrieval and processing to presentation.
  • How is sensitive data handled? Which regulations are you required to follow?
  • How are teams structured? Both business and tech 
  • What methodology do you use for project management?
  • How experienced are your tech team members in AWS?
  • What are your pain points?
  • Which use cases do you want to cover?
  • What are your priorities and expectations? 

The answers to these questions will determine the appropriate approach, which could be one of the following:

  1. A Migration Readiness Assessment (MRA): We use this for customers planning to migrate to AWS. It involves a deep dive based on an extended questionnaire (80 questions) to collect facts and customer and interviewer observations to define possible next steps. Then, we will create a full report and share it with the customer to evaluate their cloud maturity and what they need to do to implement a successful migration. This helps define the migration paths, timelines, resources, assets inventory/dependencies and technical documentation we will use. The MRA can also be used to request free credits from AWS.
  2. A Well-Architected Review (WAR): This can be useful for customers who have already onboarded and need to have their current status assessed, with a view to identifying actions and priorities for correcting the resulting drifts. WAR uses an evaluation framework developed by AWS and adopted by the industry,which is based on six pillars: operational excellence, security, reliability, performance efficiency, cost optimization and sustainability. Credits-based funding of up to $5k is also available for remediating 

production environments.

  1. Training: DoiT’s customer enablement includes customer training on specific AWS services. For example, Immersion Days involve deep dives that deliver not just conceptual knowledge but hands-on experience.
  2. Prototyping (proof of concept): DoiT supports customers when evaluating a solution by defining success criteria based on KPIs and guiding customers through the technical implementation, based on weekly cadence sessions to remove any doubts or obstacles and provide advice on implementing optimizations. When the prototyping is finished, we measure the results against the KPIs to determine the fit, lessons learned and next steps.

Next steps

If you are interested in harnessing your data for the immense business value it can deliver, talk to DoiT about architecting your big data process on AWS. 

Subscribe to updates, news and more.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related blogs

Connect With Us