Architectural Patterns to Build End-to-End Data Driven ...

27
Architectural Patterns to Build End-to-End Data Driven Applications on AWS AWS Whitepaper

Transcript of Architectural Patterns to Build End-to-End Data Driven ...

Architectural Patterns toBuild End-to-End Data

Driven Applications on AWSAWS Whitepaper

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Architectural Patterns to Build End-to-End Data Driven Applications onAWS: AWS WhitepaperCopyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.

Amazon's trademarks and trade dress may not be used in connection with any product or service that is notAmazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages ordiscredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who mayor may not be affiliated with, connected to, or sponsored by Amazon.

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Table of ContentsAbstract and introduction .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Abstract ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Are you Well-Architected? .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Introduction .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

AWS for data .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Modern data architecture .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Purpose-built databases .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Purpose-built data analytics services .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5ML and AI services .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Data driven architectural patterns .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Build Customer 360 architecture on AWS ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Build event driven architectures with IOT sensor data .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Build personalized recommendations architecture on AWS ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Build near real-time customer engagement on AWS ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Build fraud detection architectures on AWS ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Considerations for building data driven applications on AWS ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Conclusion .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Contributors ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Further reading .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Document revisions .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Notices .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23AWS glossary .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

iii

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Abstract

Architectural Patterns to Build End-to-End Data Driven Applications onAWS

Publication date: August 3, 2022 (Document revisions (p. 22))

AbstractAccording to Forbes, 85% of businesses want to be data driven, but only 37% have been successful. Mostof the organizations that try to use data to modernize and innovate struggle with finding the provenarchitectural patterns that customers have implemented to build data-driven applications. Successfulcustomers use multiple Amazon Web Services (AWS) services between event-driven Internet of Things(IoT) to collect and manage billions of devices and purpose-built databases. This allows them to savecosts, grow and innovate faster, and use data analytics to get the fastest insights on all their data andartificial intelligence/machine learning (AI/ML) to help them innovate for the future.

In this whitepaper, we present some commonly used data-driven applications and proven architecturalpatterns based on successful customer implementation. This enables customers who are looking to buildthose data driven applications to accelerate time to solution.

Are you Well-Architected?The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you makewhen building systems in the cloud. The six pillars of the Framework allow you to learn architectural bestpractices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems.Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you canreview your workloads against these best practices by answering a set of questions for each pillar.

In the Data Analytics Lens, we describe a collection of customer-proven best practices for designing well-architected analytics workloads.

For more expert guidance and best practices for your cloud architecture—reference architecturedeployments, diagrams, and whitepapers—refer to the AWS Architecture Center.

IntroductionThere are more than 200 services provided by AWS. Customers can use this diverse set of choices toselect the correct tool for the right job, but sometimes it can be confusing to map to different usecases. There are some services with overlapping capabilities, and customers often struggle to finda proven pattern when they’re developing their data-driven applications. Some customers take theapproach of doing multiple proof of concepts (POCs), which is time consuming, and might still failto instill confidence in whether a set of services is a proven pattern based on other AWS customerimplementations.

1

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Introduction

This whitepaper brings together the most commonly used data-driven applications and architecturalpatterns that other AWS customers have proven to be successful in their implementations. Thesearchitecture reference patterns provide guidance to quickly select a proven architectural pattern andfurther modify it to meet your application needs. In some cases, these architecture patterns can be usedwithout modification, thereby minimizing the need to POC extensively, and accelerate time to solution.

The architecture reference patterns covered in this whitepaper also provide thought leadership for youto create a future state strategy to modernize your data-driven applications. It helps you look throughthe lens of how various AWS data services, such as AWS IOT for event-driven IOT data collection, managebillions of devices and purpose-built databases to save costs, modernize databases for the cloud, andinnovate faster. Use analytics to get fastest insights on all your data, from big data processing, datawarehousing to visualization, and AI/ML, to build and operationalize ML and AI applications to innovatefor the future.

2

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Modern data architecture

AWS for dataThere are three different stages to building a modern data strategy. These stages are not sequential, andyou can start at any stage, depending on your data journey.

• Modernize — You can modernize your data infrastructure with the most scalable, trusted, and securecloud provider.

• Unify — You can unify your data by putting it to work in the best data lakes and purpose-built datastores.

• Innovate — You can create new experiences and reimagine old processes with AI and ML.

These three stages in turn lead to the development of a modern data strategy on AWS, as illustratedin the following figure. This strategy gives you the best of both data lakes and purpose-built stores. Itenables you to store any amount of data at a low cost, and in open-standards data formats. This helpsyou avoid the risks of getting locked into a proprietary format. It helps you break down data silos andempower your teams to run analytics or ML at scale, using your preferred tools and techniques.

Modern data strategy on AWS

Modern data architectureAWS modern data architecture connects your data lake, your data warehouse, and all other purpose-builtstores into a coherent whole. The following figure depicts a modern data architecture on AWS.

3

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Purpose-built databases

Modern data architecture on AWS

Key:

1. Store data in a purpose-built database that can support a modern application and its differentfeatures. It no longer needs to be a single type of relational database, but could be a NoSQLdatabase, cache store, or anything else which works for the application. The focus is on applicationempowerment, not analytics.

2. Use the data lake, which uses a storage service such as Amazon Simple Storage Service (Amazon S3),to store data from all those purpose-built databases and native open-file formats. Open formats givemore power to organizations to use the data however they need to.

3. Once the data lake is hydrated with data, you can build analytics of any kind easily, and useany technology customers want to use for their use cases. This can range from traditional datawarehousing and batch reporting to more near real-time alerting and reporting. This can even be ad-hoc querying of data, or more advanced ML-based analytics use cases. Data is now stored in a layerthat’s more open and provides more freedom of analytics choices.

4. ML and AI are also critical for a modern data strategy, to help you predict what will happen inthe future, and to build intelligence into your systems and applications. Customers have variedneeds when it comes to AI/ML, and require a broad set of AI/ML services. This is true for expert MLpractitioners, for everyday developers, and for customers who don’t want to build and train models.

5. Finally, data governance is a critical element to combining and sharing data from different sources soyou can put data in the hands of people at all levels of your organization. There are data residencycompliance needs which come with globalization. There is need for granular control over who haspermission to access data, and standard security needs such as encryption, auditability, monitoring,alerting, and so on.

Purpose-built databasesOrganizations are breaking free from legacy databases. They are moving to fully managed databases andanalytics, modernizing data warehouse databases, and building modern applications with purpose-builtdatabases to get more out of their data. Organizations want to move away from old-guard databasesbecause they are expensive, proprietary, lock you in, offer punitive licensing terms, and come with

4

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Purpose-built data analytics services

frequent audits from their vendors. Organizations that are moving to open-source databases are lookingfor the performance of commercial-grade databases with the pricing, freedom, and flexibility of open-source databases.

AWS provides purpose-built database services to overcome these challenges with on-premises databasesolutions. The most straightforward and simple solution for many organizations that are strugglingto maintain their own relational databases at scale is a move to a managed database service, suchas Amazon Relational Database Service (Amazon RDS). In most cases, these customers can migrateworkloads and applications to a managed service without needing to re-architect their applications, andtheir teams can continue to use the same database skill sets. Organizations can lift and shift their self-managed databases such as Oracle, SQL Server, MySQL, PostgreSQL, and MariaDB to Amazon RDS. Fororganizations that are looking for better performance and availability, they can move their relationaldatabases to MySQL and PostgreSQL to Amazon Aurora and get three to five times better throughput.

Many organizations also use non-relational databases such as MongoDB and Redis as document and in-memory databases for use cases such as content management, personalization, mobile apps, catalogs,and near real-time use cases such as caching, gaming leaderboards, and session stores. The moststraightforward and simple solution for many organizations who are struggling to maintain their ownnon-relational databases at scale is a move to a managed database service (for example, moving self-managed MongoDB databases to Amazon DocumentDB or moving self-managed in-memory databasessuch as Redis and ElastiCache to Amazon ElastiCache). In most cases, these organizations can migrateworkloads and applications to a managed service without needing to re-architect their applications, andtheir teams can continue to use the same database skill sets.

Purpose-built database services on AWS

Purpose-built data analytics servicesPurpose-built data analytics services enable you to use the right tool for the right job, and provides theagility to iteratively and incrementally build out the modern data architecture. You gain the flexibility toevolve the data lake house to meet current and future needs as you add new data sources, discover newuse cases and their requirements, and develop newer analytics methods.

AWS provides a portfolio of purpose-built data analytics services, which include AWS Glue, AmazonEMR, Amazon Athena, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (Amazon MSK),

5

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

ML and AI services

Amazon OpenSearch Service, Amazon Redshift, and Amazon QuickSight for various unique analyticsuse cases. These are managed services, so organizations don’t need to worry about administration taskssuch as software provisioning, configuration, backups, patching, and recovery. Instead, organizations canfocus on getting insights from their data and creating a business value, not managing the infrastructure.These purpose-built data analytics services are optimized for specific use cases, because one sizesolution doesn’t fit for all use cases and leads to compromises in analytics. The purpose-built-analyticscapabilities are illustrated in the following figure.

Purpose-built data analytics services on AWS

ML and AI servicesIn the past, ML technology was limited to a few major tech companies and hardcore academicresearchers, but things began to change when cloud computing entered into the mainstream. Computepower and data became more available, and ML is now making an impact across almost every industry,including finance, retail, fashion, real estate, healthcare, and more. ML is moving from the peripheral to acore part of most every business and industry. In time, virtually every application will be infused with MLand AI.

AWS provides the broadest and deepest set ML and AI services, in three different levels, for builders of alllevels of expertise in their ML journey:

• For expert practitioners, AWS supports all the major ML frameworks ,including TensorFlow, MXNet,PyTorch, Caffe 2, and so on. We offer the highest performance instances for ML training in the cloudwith Amazon EC2 P4d instances, powered by the latest NVIDIA A100 Tensor Core GPUs, and coupledwith first-in-the-cloud 400 Gbps instance networking. P4d instances are deployed in hyperscaleclusters, called EC2 UltraClusters, offering supercomputer-class performance for the most complex MLtraining jobs. For inference, which represents 90% of the cost of ML, we provide the lowest cost for MLinference in the cloud with Amazon EC2 Inf1 instances, powered by AWS Inferentia chips.

6

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

ML and AI services

• For data scientists and ML developers, AWS provides Amazon SageMaker. SageMaker was builtfrom the ground up to simplify the process of ML with tools for every step of the ML development,including labeling, data preparation, feature engineering, statistical bias detection, auto-ML, training,tuning, hosting, explainability, monitoring, and workflows.

• For developers and business users, AWS provides pre-trained AI services that provide ready-madeintelligence for applications and workflows, and end-to-end solutions that can solve business needsright out of the box using AutoML technology. These services address common use cases such aspersonalized recommendations, contact center intelligence, document processing, intelligent search,business metrics analysis, and more. AWS also provides industry-specific AI services for both industrialand healthcare industries.

AWS ML and AI services

7

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Build Customer 360 architecture on AWS

Data driven architectural patternsMany data-driven organizations seek the truth by treating data like an organizational asset, no longerthe property of individual departments. They set up systems to collect, store, organize, and processvaluable data and make it available in a secure way, to the people and applications that need it. Theyalso use technologies such as ML to unlock new value from their data such as improving operationalefficiency, optimizing processes, developing new products and revenue streams, and building bettercustomer experiences.

We depict the five most commonly seen architecture patterns on AWS, that cover several use cases forvarious different industries and customer sizes:

• Customer 360 architecture• Event-driven architecture with IOT data• Personalized architecture recommendations• Near real-time customer engagement• Data anomaly and fraud detection

Build Customer 360 architecture on AWSAs customer expectations continue to rise, organizations face a gap between their need to use their datato meet customer expectations, and their current data management practices’ ability to deliver. So howcan companies close this gap? They must work toward taking back ownership of their data to harness thepower of their customer and prospect information, to compete and win on customer experience.

Customer 360 describes the trend of building a complete and accurate picture of every customer’sstructured and unstructured data across an organization. Customer 360 is not a single solution, nor asingle service. It is the aggregation of all customer data into a single unified location, that can be queriedand used to generate insights for improving the overall customer experience.

Some of the common use cases and solutions that fit under these Customer 360 architecture patternsare as follows:

• Marketing and demand generation• Enrich customer profiles and identify the best prospects to drive customer acquisition.• Enable affinity marketing to achieve greater wallet share.• Personalize offers and website experiences based on historical customer data and micro-targeted

segments to increase conversion rates.• Monitor campaign effectiveness and key performance indicators to improve the return on

investment (ROI) of marketing efforts.• Predict consumer churn so you can take proactive actions to drive retention and repeat purchase.

• Sales growth• Gain insights to identify opportunities, predict customer intent, and foster stronger relationships.• Foster relationships with a complete view of a customer’s interactions to better understand the

health of the relationship.• Increase customer lifetime value with data-driven, next best action suggestions, and intelligent

cross-sell and upsell recommendations.• Deliver consistent cross-channel experiences. enabling an online-offline feedback loop.

8

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Build Customer 360 architecture on AWS

• Customer service

• Use holistic customer profiles to accelerate issue resolution and improve customer satisfaction.

• Detect anomalies and strengthen trust.

• Implement self-service tools and chatbots that allow customers to resolve issues themselves.

• Deliver personalized loyalty offers to increase customer lifetime values and reduce churn.

• Generate customer lifetime value and next best action recommendations.

The following diagram illustrates customer 360 architecture on AWS.

Customer 360 architecture on AWS

The steps that data follows through the architecture are as follows:

1. Data ingestion — Establishing customer 360 data by consolidating your disparate data sources in aschemeless ingestion approach, with the guiding principle of:

• Getting the right data ingested as close to the source system as possible from past data.

• Predicting future data and persisting them in object storage such as Amazon S3 or capturing themas streams in near real-time using:

• SAP connectors.

• AppFlow services that intuitively connect you to your Salesforce account.

• Google analytics services and data movement service to connect to JDBC/ODBC sources to capturethe data in incremental fashion with zero coding.

2. Building a unified customer profile — You need to extract and link elements from each customerrecord, creating a customer 360° dataset that will serve as a single source of truth for your customerdata hub, with a data model to build your customers context on their journey. Using Amazon Neptune,you can create a near real-time, persistent, and precise view of the customer and their journey in agraph-based storage system. Use Neptune to store entities that provide context through an associatedand predefined set of rules, allowing relationship traversals viewed in a connected fashion.

This 360° knowledge graph serves the potential information to uncover hidden customer segments,and deepens customer hierarchy, provides cluster analysis resulting in the creation of customerpersonas, and evaluates the size of each segment. The unique customer journey is preserved bywriting reusable Gremlin/SPARQL libraries that are automated through a serverless data lakeframework.

9

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Build event driven architectures with IOT sensor data

3. Intelligence layer — Offloading this journey into analytical store such as Amazon Redshift or S3enables your data scientists and data analysts to refine the ontologies and shorten the graph pathwith access to raw information already preserved in S3 using AWS Glue DataBrew and Amazon Athenaserverless.

4. Activation layer — This refinement to customer ontology represents connectedness betweencustomer 360° data using AWS lake house architecture. It enriches customer profiles withrecommendations, predictions using AI/ML to test the customer journey hypothesis, and createsthe next best action APIs by sensing and responding to signals through the customer lifecycleusing Amazon personalize and pinpoint. Actions offered via next-best APIs are now integrated andpresented across distribution systems and channels for resilient and optimized personal experience.

Build event driven architectures with IOT sensordata

Organizations create value by making decisions from their data in near real-time. Some of the commonuse cases and solutions that fit under this event driven architecture pattern include:

• Industrial IOT use cases to monitor industrial equipment quality, to determine actions such asadjusting machine settings, using different sources of raw materials, or doing additional workertraining that will improve the quality of the factory output.

• Medical device data collection for personalized patient health monitoring, adverse event prediction,and avoidance.

• Connected vehicle use cases such as voice interaction, navigation, location-based services, remotevehicle diagnostics and predictive maintenance, media streaming, and vehicle safety that includescomputing within vehicles and in-the-cloud, near real-time predictive analytics.

• Sustainability and waste reduction solutions on AWS can provide access to dashboards, monitoringsystems, data collection, and summarization tools that use ML algorithms that meet sustainabilitygoals. Meeting sustainability goals is paramount to many use customers in travel and hospitalityindustries.

The following diagram illustrates event-driven IOT sensor data for near real-time predictive analytics.

Derive near real-time predictive analytics from IOT data

10

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Build personalized recommendations architecture on AWS

The steps that data follows through the architecture are as follows:

1. Data originates in IOT devices such as medical devices, car sensors, industrial IOT sensors, and so on.This telemetry data is collected close to the devices using AWS IoT Greengrass, which enables cloudcapabilities to local devices.

2. Data is then ingested into the cloud using edge-to-cloud interface services such as AWS IOT Core,which is a managed cloud platform that lets connected devices easily and securely interact with cloudapplications and other devices, or AWS IOT SiteWise, which is a managed service that lets you collect,model, analyze, and visualize data from industrial equipment at scale.

3. Stream data ingested into the cloud gets transformed in near-real time using Amazon KinesisData Analytics, which offers an easy way to transform and analyze streaming data in near real-time with Apache Flink and Apache Beam frameworks. Stream data often needs to be enrichedusing lookup data which is hosted in a data warehouse. Amazon Redshift uses customer integrationor stream aggregates (for example, one minute or five minutes) from Kinesis Data Analytics. TheFlink application gets written into Amazon Redshift for further business intelligence (BI) reportingdownstream.

4. Amazon SageMaker is a fully ML learning service. Once the ML model is trained and deployedin SageMaker, inferences are invoked in micro-batch using AWS Lambda. Inferenced data is sentto Amazon OpenSearch Service to create personalized monitoring dashboards using AmazonOpenSearch Service Dashboards.

5. The data lake stores telemetry data for future batch analytics. The data is micro-batch streamed intothe S3 data lake using Amazon Kinesis Data Firehose, which is a fully managed service for deliveringnear real-time streaming data to destinations such as S3, Amazon Redshift, Amazon OpenSearchService, Splunk, and any custom HTTP endpoints or owned by supported third-party service providers,including Datadog, Dynatrace, LogicMonitor, MongoDB, New Relic, and Sumo Logic.

Build personalized recommendations architectureon AWS

Consumers are increasingly engaging with businesses through digital surfaces and touchpoints. Everytouchpoint is an opportunity for businesses to give their consumers an experience tailored to theirpreferences and needs. In fact, consumers now expect personalized experiences from businesses.Market research tells us 63% of consumers see personalization as the standard level of service. Genericrecommendations no longer meet customer expectations.

Some of the common use cases and solutions that fit under these personalized recommendationsarchitecture patterns are as follows:

• Personalization experience — Personalize users’ homepages with product or contentrecommendations based on their unique preferences and behavior. Personalize push notifications andmarketing emails with individualized product, content, and promotional recommendations to helpusers find fresh, new products and content based on their unique tastes and preferences.

• Personalization in retail — Improving the customer experience by providing productrecommendations based on their shopping history. Recommend similar items on product details pagesto help users easily find what they are looking for.

• Personalization in media and entertainment — Deliver highly relevant, individualized contentrecommendations for videos, music, and e-books. Create personalized content carousels for every userbased on their content consumption history.

The following diagram illustrates the system for building personalized recommendations on AWS.

11

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Build personalized recommendations architecture on AWS

Build real-time recommendations on AWS

The steps through the architecture are as follows:

1. Data preparation — Collect user interaction data such as item views, and item clicks. This plays apivotal role in building a personalized recommendation. Once data is collected, upload your userinteraction data into Amazon S3, then perform data cleaning using AWS Glue DataBrew to train themodel in Amazon Personalize for real-time recommendations.

2. Train the model with Amazon Personalize — The data we use for modeling on Amazon Personalizeconsists of three types:

• The activity of your users, also known as events — Examples include items your users click on,purchasing, and watching. The events you choose to send Amazon Personalize are dependent onyour business domain. This dataset has the strongest signal for personalization, and is the only onerequired for personalization.

• Details about your items, such as price point, category information, genre, and so on. Essentially,information in your catalog. This dataset is optional, but very useful for scenarios such asrecommending new items. Personalize also enables customers to unlock the information trappedin their product descriptions, reviews, item synopses, or other unstructured text using state-of-the-art natural language processing (NLP) techniques. Amazon Personalize automatically extracts keyinformation about the items in your catalog to use when generating recommendations for yourusers.

• Details about the users, such as their location, age, and so on.

3. Get real time recommendations — Once you have the data, you can, in just a few clicks, get a custom,private, personalization model trained and hosted for you. You can then vend recommendations foryour users through an API exposed through Amazon API Gateway.

12

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Build near real-time customer engagement on AWS

Build near real-time customer engagement onAWS

This architecture is about maximizing customer engagement by delivering targeted messaging to yourusers in near real-time. For a business, engaging with customers is always an important day-to-dayaspect, and those communications need to be clear, concise, and well-stated. There is even more of animpact if that messaging is also targeted and customized for the customer. Using this architecture, youcan interact with that customer in near real-time and, utilizing advanced ML, access targeted messagingdirectly applicable to the customer. It enables the collection and analysis of campaign data, and theability to use that data to create customized models within Amazon SageMaker. This allows you to beable to target a specific customer, or group of customers, and address their needs on a near real-timebasis. This pattern can help you develop a production level SageMaker model, and a feedback loop thatcan unlock value from marketing campaign data.

Some of the common use cases and solutions that would work with this architecture are as follows:

• Churn modeling – Using all the relevant data stored within your data lake, you have the ability tocreate a churn model. This is an AI/ML algorithm using a known set of historical data to learn if similarcustomers will be more likely to leave in the future. This information can then be used to engage withthe customer and save their business.

• Recommendation modeling via customer segmentation – Not all customers are alike, and theywon’t all respond to the same type of marketing campaign. But by creating multiple personas andassigning the customer base to those personas, a marketing team can create personalized messagingor promotions for specific groups.

• Marketing efficiency – Collecting streaming event data such as opens and clicks, and allowing analysisof an ongoing campaign. Stream this data directly into your data lake, and augment this data withother data located there

• Customer engagement – Use transactional messaging to communicate with customers based uponinformation that has just happened.

Near real-time customer engagement architecture on AWS

13

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS WhitepaperBuild fraud detection architectures on AWS

The steps in this architecture are as follows:

1. Initialize Pinpoint and create a marketing project — You first need to configure your project to addyour users and their contact info, such as email addresses. At this time, you also need to configureyour metrics collection to ensure that you can capture your customer interactions to Amazon S3.

2. Near real-time data ingestion — Grab data from Amazon Pinpoint in near real-time through AmazonKinesis Data Firehose. Optionally, you can change this to an Amazon Kinesis Data Stream for near real-time use cases. This data is collected into S3, and used to both train the ML model and to analyzeactivity in your Amazon QuickSight dashboard.

3. SageMaker model implementation — Using a combination of Amazon Pinpoint engagement datawith other customer demographic data, you can train a model that predicts the likelihood of customerchurn (or segmentation, or other customer modeling insight). This is done in an iterative manner untilyou validate that your model is effective. Once that’s done, you can set up a SageMaker endpoint tohost this model and run inference against. This is done in a batch manner. It exports the results toAmazon Pinpoint and S3 for consumption.

4. Data consumption with Athena and QuickSight — View and analyze all of the data being collectedfrom the Amazon Pinpoint engagement, and combine it with other data facts from your data lake withAmazon Athena. You can explore this data and run ad-hoc SQL queries to gain direct insight directlyfrom S3 storage. Combine this with Amazon QuickSight to visualize these insights and share themwith others in the organization.

Build fraud detection architectures on AWSThe ability to move fast and serve customers in near real-time is becoming necessary for businessesto stop fraud in its tracks. Fraud represents a significant problem for businesses, and by addressingthis challenge as soon as it occurs, the ability to recover from and reduce damages is greatly increased.Using this architecture, you can perform fraud detection in near real-time, and build fraud detectionvisualization dashboards for further analysis.

The AWS Solutions Implementation, Fraud Detection Using Machine Learning, enables you to runautomated transaction processing. This can be on an example dataset or your own dataset. The includedML model detects potentially fraudulent activity, and flags that activity for review. The followingdiagram presents an architecture you can automatically deploy using the solution’s implementationguide, and accompanying AWS CloudFormation template.

Some of the common use cases and solutions that work with this architecture are as follows:

• Fraud detection – Organizations with online businesses have to be on guard constantly for fraudulentactivity, such as fake accounts or payments made with stolen credit cards.

• Near real-time analytics – Use this architecture to understand the fraudulent activities data that youhave streaming.

14

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS WhitepaperBuild fraud detection architectures on AWS

Fraud detection architecture on AWS

The steps through the architecture are as follows:

1. Develop a fraud prediction machine learning model — The AWS CloudFormation Template deploysan example dataset of credit card transactions contained in an S3 bucket. An Amazon SageMakernotebook instance with different ML models is trained on the dataset.

2. Perform fraud prediction — The solution also deploys an AWS Lambda function that processestransactions from the example dataset. It invokes the two SageMaker endpoints that assign anomalyscores and classification scores to incoming data points. An Amazon API Gateway REST API initiatespredictions using signed HTTP requests. An Amazon Kinesis Data Firehose delivery stream loads theprocessed transactions into another Amazon S3 bucket for storage. The solution also provides anexample of how to invoke the prediction REST API as part of the Amazon SageMaker notebook.

3. Analyze fraud transactions — Once the transactions have been loaded into S3, you can use analyticstools and services for visualization, reporting, ad-hoc queries, and more detailed analysis.

15

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Considerations for building datadriven applications on AWS

When you are building data driven applications on AWS, work backwards from what are important toyour business application (for example, SLA’s, cost, performance, consumer patterns) so you do not over-engineer nor select a service because it’s the new shiny tool. Once you identify the key requirements,then leverage the reference patterns outlined to pick the right services that fits your use case. Some keyconsiderations for building data driven applications are listed below.

• User personas — You must first identify user personas who are the correct stakeholders for yourapplication, such as data engineers/developers, data analysts, BI teams, data scientists, externalentities, and so on, and learn how they like to use the application.

• Data sources — Based on the application, there can be one or more diverse sets of data sources thatmay host the data required for your application. Treat data like an organizational asset, meaning it isno longer the property of individual departments. Break down data silos by thinking for the future,and make data available to the entire organization. Create a trusted data platform that can driveactionable insights with increased agility, improved customer experience, and increased operationalefficiencies.

• Data ingestion — Bring your data into the storage layer using AWS purpose-built services. They arespecially designed to ingest data from a wide variety of data sources at any velocity, to support uniquedata sources and data types. These purpose-built services can deliver data directly into data lakes anddata warehouse storage layers, as illustrated.

• Real-time data ingestion — You can use either Amazon Kinesis Data Streams or Amazon ManagedStreaming for Apache Kafka (Amazon MSK) to ingest near real-time data such as application logs,website clickstreams, and IoT telemetry data. If you have a streaming use case and you want to use anAWS native, fully managed service, consider Amazon Kinesis Data Streams. If open-source technologiesare critical for your data processing strategy, you are familiar with Apache Kafka, you intend to havemultiple consumers per stream, and you are looking for near real-time latency less than 200ms, werecommended you choose Amazon MSK rather than Amazon Kinesis.

• Batch data ingestion — You can ingest operational database data into a storage layer using AWSDatabase Migration Service (AWS DMS) and AWS Lake Formation blueprints. AWS DMS can be usedto bring commercial databases such as Oracle and Microsoft SQL Server, open-source databasessuch as MySQL and PostgreSQL, Amazon databases such as Amazon RDS and Amazon Aurora, datawarehouses such as Teradata, and No-SQL databases such as MongoDB into the storage layer. AWSLake Formation blueprints workflows make data ingestion simple by generating AWS Glue crawlers,jobs, and triggers that discover and ingest database data into a data lake. The blueprints are designedto support both one-time database data migration into data lakes, or incremental data updates intodata lakes.

• File shares — One of the most commonly used methods to move data within and outside of theorganization is by using files. A widely used pattern to move files is AWS DataSync, which makes itsimple and fast to move large amounts of files from Network File System (NFS) shares, Server MessageBlock (SMB) shares, and Hadoop Distributed File Systems (HDFS) into Amazon S3 data lakes.

• Software as a Service (SaaS) applications — Using Amazon AppFlow makes it easy to ingest SaaSapplications data into Lake House storage layer. It is a fully managed service, and it can connect tovarious SaaS applications such as Salesforce, ServiceNow, Datadog, Dynatrace, Zendesk, TrendMicro,Veeva, Snowflake, and Google Analytics. Amazon AppFlow then ingests SaaS applications data directlyinto the data lake, or staging tables in an Amazon Redshift data warehouse.

16

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

• Partner data feeds — AWS Transfer Family is a serverless service that provides secure file transferprotocol (FTP) endpoints and integrates with Amazon S3. It stores partner data feeds as S3 objects inthe landing zone of the data lake.

• Third-party data sources — AWS Data Exchange has hundreds of commercial data products acrossvarious industries such as financial services, healthcare, retail, media, and entertainment. It includeshundreds of free data sets collected from popular public sources. You can ingest third-party data setsinto an S3 data lake landing zone by subscribing to third-party data products. Amazon Data Exchangehas native integration with Amazon Redshift, where you can share your Amazon Redshift databasedata sets as a data producer.

• Custom data sources — AWS Glue custom connectors can discover and integrate with a variety ofdata sources, such as SaaS applications and custom data sources. You can search and select connectorsfrom the AWS Marketplace, and begin your data ingestion workflow in minutes to bring custom datasources’ data into an S3 data lake.

• Query federation — You can use Amazon Redshift materialized views or Amazon Aurora queryfederation with scheduled refresh to minimize extract, transform, load (ETL) wherever it’s applicable,depending on your use case.

• Data storage — The storage layer consists of Amazon S3 and Amazon Redshift, besides the purpose-built databases outlined earlier in this whitepaper such as Amazon RDS and Amazon DynamoDB. Theyprovide an integrated storage layer for the data platform. You can ingest data from various sourcesinto data lake raw zones in S3. Next, the processing layer transforms the data by applying schema,partition, and stores into cleaned or transformed zones in S3. Then the processing layer curates atransformed dataset by modeling it and joining it with other datasets, and stores it in a curated layer.The datasets from the curated layer in S3 are partially or fully loaded into Amazon Redshift datawarehouse storage for the use cases that need very low latency, or need to run complex SQL queries.

• Data catalog — The data catalog layer is responsible for storing business and technical metadata fromthe storage layer. As you build out your modern data architectures on AWS, you will start ingestinghundreds to thousands of data sets from a variety of sources into data lakes and data warehouses. Youwill need a central data catalog for all datasets in the storage layer in a single place, making it easilyavailable for retrieving information such as data location, format, and columns schema of data sets.AWS AWS Glue Data Catalog provides a central catalog to store metadata for all datasets hosted inthe storage layer. Also, AWS Glue Data Catalog provides APIs to enable metadata management usingcustom scripts and third-party products.

• Data processing — Data processing pipelines can be multi-step, or scheduled on a regular interval.You can also invoke data processing pipelines based on event triggers. AWS Glue and AWS StepFunctions provide all required components to build, orchestrate, and run pipelines that can scale toprocess huge data volumes for various use cases. You can build multi-step workflows using AWS Glueand Step Functions that can catalog, transform, and enrich the datasets, and load the data sets intoraw, transformed, and curated zones in the Lake House storage layer. If you are processing for largedata sets, use Amazon EMR for data processing. If your data processing is based on Apache Sparkand you are looking for a serverless option, use AWS Glue instead of Amazon EMR for data sets lessthan 5TB. You can also use Amazon EventBridge for building event-driven data pipelines, and AmazonAthena for lightweight ETL.

• Data consumption — Data consumption layer is responsible for enabling user personas for purpose-built analytics. For example, you can use Amazon Athena for ad-hoc interactive queries, Amazon EMRfor big data processing and ETL, Amazon Redshift for modern data warehouses, Amazon OpenSearchService for operational analytics, Amazon Kinesis Analytics for near real-time analytics, and AmazonQuickSight for BI reports and dashboards.

• Data governance — You need to consider how to enforce data security and data governance in yourdata lake architecture. AWS Lake Formation integrates with AWS AWS Glue Data Catalog, and providesfine-grained access control so you can have a unified data governance across various AWS purpose-built analytics services.

• Data prediction — You can use multiple options for your data prediction solutions. You can useAmazon SageMaker-built algorithms for ML, or you can bring your algorithm into SageMaker for

17

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

training and deployment of your ML model for data prediction. You can also use Amazon Web ServicesAI services to solve many business problems. Use Amazon Kendra for intelligent search, AmazonFraud Detector for fraud detection, Amazon Personalize for personalized recommendations, AmazonForecast for time-series forecasting, and so on. These services are developed, trained, and continuouslyoptimized by Amazon data scientists, so there is no ML knowledge or experience required of the enduser. These services can simply be accessed by the end user by using an API call. You can also considerML built-in integrations for various data services such as Aurora ML, Neptune ML, Amazon Redshift ML,Athena ML, QuickSight ML, and Amazon OpenSearch Service ML for your data prediction solutions.

18

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

ConclusionAWS provides various purpose-built data services to build data driven architecture for any use case. Eachof these services has its own characteristics in terms of scalability, latency, and cost. Most importantly,they are deeply integrated with one another, making them easy to use. The reference architecturesoutlined in this whitepaper give you a great start, and accelerate time to solution. We encourage you tomake your architecture extensible by using new data services as they become available.

19

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Contributors• Harsha Tadiparthi, Analytics Specialist Principal Solutions Architect, Amazon Web Services.• Raghavarao Sodabathina, Enterprise Solutions Architect, Amazon Web Services

20

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Further readingFor additional information, refer to:

• Analytics on AWS• Machine Learning on AWS• Databases on AWS• Harness the power of your data with AWS Analytics (AWS blog post)• Derive Insights from AWS Modern Data (AWS whitepaper)• Design a data mesh architecture using AWS Lake Formation and AWS Glue (AWS blog post)

21

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

Document revisionsTo be notified about updates to this whitepaper, subscribe to the RSS feed.

update-history-change update-history-description update-history-date

Initial publication (p. 22) Whitepaper published. August 3, 2022

22

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

NoticesCustomers are responsible for making their own independent assessment of the information in thisdocument. This document: (a) is for informational purposes only, (b) represents current AWS productofferings and practices, which are subject to change without notice, and (c) does not create anycommitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or servicesare provided “as is” without warranties, representations, or conditions of any kind, whether express orimplied. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements,and this document is not part of, nor does it modify, any agreement between AWS and its customers.

© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved.

23

Architectural Patterns to Build End-to-End DataDriven Applications on AWS AWS Whitepaper

AWS glossaryFor the latest AWS terminology, see the AWS glossary in the AWS General Reference.

24