How to Design AWS data integration: Configuration Patterns and Consideration Steps Architects Should Understand

How to Design AWS data integration: Configuration Patterns and Consideration Steps Architects Should Understand

The optimal method for data integration on AWS varies greatly depending on the requirements (frequency, latency, data volume, and security).
This article provides a comprehensive overview, from the initial consideration steps to typical integration patterns, data lake design, and account separation. In conclusion, successful designs don't start with the service name; instead, they begin by defining SLOs and operational requirements (re-execution, auditing, cost), and then solidifying the data's destination and separation of responsibilities.

Things to decide when designing AWS data integration

The quality of a design depends on whether all the "key issues that need to be decided from the start" are identified without fail.

First, define the success conditions numerically. Clearly define the upper limit of delay, the acceptable level of data loss, the time available for reprocessing, and the level of granularity required by the user. If this is vague, requirements will expand later, and the configuration will become complicated.
Next, we define the data boundaries and division of responsibility. A stable operation can be achieved by dividing the work so that the provider delivers the correct data in a defined format, the infrastructure handles delivery, storage, and re-execution, and the user has the logic for the converted data.
Finally, we fix security and cost as constraints. By first determining encryption, access scope, retention period, and deletion requirements, we can make consistent decisions regarding S3 design, AWS KMS keys, and network routing.

▼I want to know more about data integration
data integration / data integration platform | Glossary

Steps for considering data integration

The consideration process will be concretized in the following order: requirements → method → desired outcome → operation.
The reason this order is effective is that if you select the method or service first, it becomes based on what's "possible," and when requirements arise later, it becomes difficult to implement them properly. In particular, delay, sequencing, reprocessing, and auditing, if added later, will involve rebuilding the data model and storage structure as well.

Requirements definition (data type, frequency, delay, volume)

The target data is classified. The storage format changes depending on whether it is structured, semi-structured, or unstructured, and the strength of encryption and access control is determined by whether or not sensitive information is included.
The update frequency and acceptable latency should be set within a range. Streams are suitable for second-by-second intervals, while batches are suitable for hourly or daily intervals. Low latency that is not required by the business will only increase costs and operational complexity.
We estimate the amount of data, including both normal and peak periods, and determine the RPO/RTO, retention period, audit logs, and deletion requirements.

Selection of collaboration method (batch, streaming, event)

Integration methods are divided into batch, streaming, and event-driven. Batch is robust against re-execution but has high latency; streaming has low latency but requires careful operational design; and event-driven is loosely coupled and robust against changes.
Many delivery guarantee configurations assume an at-least-once approach, allowing for duplication to ensure idempotency. Rather than aiming for exactly-once equivalent, a data model that allows for correct aggregation while accepting duplication is more robust in long-term operation.

▼I want to know more about loose coupling
Loose coupling | Glossary

Designing the destination for data and schema management

The goal is determined by working backward from "where it will be used." For analysis and machine learning, use an S3 data lake; for complex aggregation, use a data warehouse; and for application reference, use an application database.
In a data lake, you define layers (zones) of data. Separating data into raw (raw data), cleansed (processed data), and curated (aggregated/domain model) allows you to rebuild from the raw data in the event of a failure.
By deciding how to handle schema changes (whether to maintain compatibility or update), you can estimate the modification costs for users.

▼I want to know more about data lakes
Data Lake | Glossary

Infrastructure Design: Storage and Networking

After data integration method has been decided, the next step is infrastructure design. Storage and networks are not merely components; they are layers that embody non-functional requirements such as latency, availability, security, and cost as real-world constraints. Decisions made here greatly affect the operational burden and the difficulty of disaster recovery. This chapter will summarize the key points of storage design, focusing on S3, and network design, including on-premises connections.

Storage design (primarily Amazon S3)

The core of data integration is Amazon S3. It offers high durability, unlimited storage, allows processing to be triggered by S3 events, and connects directly to our analytics services.
Partition and prefix design significantly impacts operation and performance. Partitioning by system name and date (dt=YYYY-MM-DD) simplifies re-execution and extraction.

For security, encryption (SSE-S3/KMS), versioning, and access logs are included as the basic set. The lifecycle is automated, from migrating to S3 Standard-IA/S3 Glacier when the frequency of access decreases, to deletion when the retention period expires.

Network design (on-premises integration)

For on-premises connections, you can choose between AWS Site-to-Site VPN (quick to get started but dependent on internet quality) or AWS Direct Connect (stable bandwidth, but higher initial cost).
Both approaches consider routing, path control, name resolution, and reachability monitoring as a set. We also verify the requirements for forwarding encryption and ensure consistency in application layer encryption and TLS termination points.

Because data transfers across Availability Zones, regions, and the internet tend to be costly, we visualize the data flow and make the billing points clear.

▼I want to know more about on-premise
On-Premises | Glossary

Three standard patterns for AWS data integration

While data integration in AWS may appear to be a combination of individual services, it actually converges into several representative configuration patterns. Understanding these patterns eliminates the need to design from scratch, allowing you to focus on the differences between your requirements and the actual implementation. This chapter will organize the design considerations for three classic patterns: file transfer, database integration, and event/stream integration.

Three standard patterns for AWS data integration

file transfer (S3 + transfer)

This is the most straightforward method for transferring files using S3 as the destination. It allows for easy tracking of history and is robust for auditing and reprocessing.

You can choose from AWS DataSync (on-premises synchronization), AWS Transfer Family (SFTP), or AWS Storage Gateway (on-premises gateway) as your transfer method.
Arrival detection is performed using S3 events, and processing is initiated only after all files (markers) and consistency checks have been completed to ensure everything is in order.

▼I want to know more about file transfer.
file transfer | Glossary

DB integration (CDC/replication)

We import CDC data using AWS DMS and synchronize it in near real-time. Key design considerations include tracking schema changes, consistency, and source database load.
For analytical purposes, it's easier to scale if you store the history in S3 and access it from a data warehouse. You should pre-define the procedures for switching between initial loading and continuous synchronization, the operational rules for schema changes, and the missing data detection metrics.

Event stream integration (queue/stream)

This is a loosely coupled and asynchronous connection method. It divides the roles of Amazon SQS (processing buffer), Amazon SNS (multiple destination send), Amazon EventBridge (routing), and Amazon Kinesis/Amazon MSK (high throughput and sequential).

The core of the design lies in retries, DLQ, idempotency, and event schema management. Assuming duplicates will occur, the receiving side is designed to handle the same event multiple times without corrupting the results.

Design Perspectives for AWS-Wide Integration and Hybrid Integration

The patterns described so far are very powerful configurations that are entirely within AWS, but in actual business operations, connections with on-premises systems, other clouds, and SaaS are often required, and the complexity of the design increases significantly due to network constraints, protocol differences, and audit requirements.

The key in this process is not to consolidate everything on the AWS side, but to clearly define where the responsibility for integration lies. One approach is to handle data storage and analysis on AWS, while separating integration layers such as delivery control, re-execution management, and protocol conversion onto a separate platform.
For example, when using an iPaaS integration platform like HULFT Square, one option is to keep AWS simple as a data utilization platform while unifying the handling of connections with external systems and reprocessing control.

In actual design, the line drawn between "how much to implement using AWS services" and "whether to leave it to the integration platform" significantly impacts operational burden, change tolerance, and ease of auditing. This is especially true in environments with multiple systems and protocols; building the system without clearly defined roles can easily lead to operational complexity later on.

What's important isn't the service name, but the design philosophy—what responsibilities AWS will handle and where the responsibility for the integration platform begins.

iPaaS-based data integration platform HULFT Square

iPaaS-based data integration platform HULFT Square

HULFT Square is a Japanese iPaaS (cloud-based data integration platform) that supports "data preparation for data utilization" and "data integration that connects business systems." It enables smooth data integration between a wide variety of systems, including various cloud services and on-premise systems.

Data Lake Design and Governance

Since the ultimate goal of data integration is utilization, designing the data lake, which will be the endpoint, is unavoidable. However, simply consolidating data into S3 is not enough; it only becomes a "usable platform" when zone design, metadata management, access control, and audit trail management are all included. In this chapter, we will organize the design principles and governance concepts to prevent the data lake from becoming a "quagmire."

Data Lake and Zone Design

A data lake aggregates diverse data, enabling analysis, machine learning, and cross-sectional use, but if data is placed haphazardly, it becomes a "swamp."
Zone design is key. Dividing it into stages—raw (receipt record, immutable), cleansed (after quality check), and curated (aggregated according to intended use)—allows you to start over from the raw stage if a problem occurs.

Catalog and permission management

Manage metadata in AWS Glue Data Catalog so that Amazon Athena, AWS Glue, and Amazon Redshift Spectrum can all refer to the same definition. For raw data, automatic detection is practical, while for curated data, a fixed definition is more practical.

AWS Lake Formation allows you to design table, column, and row-level permission controls and ensure secure cross-account sharing. CloudTrail provides audit trails, and permissions are strictly limited, time-limited, and reviewed regularly.

Account separation and operational design

Account separation is designed to improve not only security but also operational and cost transparency.
A typical approach is a combination of environment separation (development/testing/production) and responsibility separation (infrastructure/workload). The data lake is placed in a dedicated account, and user departments access only the necessary data across accounts.

In terms of operation, first determine how to detect delays, missing data, duplicates, and format errors, and clearly define who will be notified and to what extent automatic recovery will be performed upon detection. Recovery will be faster if retries are performed using exponential backoff, DLQ (Data Loss Quantity) settings are implemented, and re-executions are performed on a time-partition basis.
Identify cost-increasing factors such as transfer fees, S3 requests, and stream throughput, and set alert thresholds.

summary

Designing AWS data integration is less prone to deviations if you proceed in the following flow: requirements definition → method selection → endpoint determination → operation design. Instead of starting with the service name, it's more efficient to apply patterns while keeping a record of the rationale for your decisions.
As the next step, conducting a small Proof of Concept (PoC) with just one target dataset to verify not only the normal operation but also the detection and recovery of delays and data loss, how costs increase, and the procedures for granting permissions will significantly reduce operational risks after the transition to production.

The person who wrote the article

Affiliation: Marketing Department

Yoko Tsushima

After joining Appresso (now Saison Technology), he worked as a technical sales representative, in charge of technical sales, training, and technical events. After leaving the company to return to his hometown, he rejoined the company in April 2023 under the remote work system. After gaining experience in the product planning department, he is currently in charge of creating digital content in the marketing department.
(Affiliations are as of the time of publication)