AWS S3 // Set a both ways multi-region replication

7 min readOct 3, 2019

A few years ago, we developed our first API product based on AWS services. The main role of this API is to store and retrieve files from … wait for it … AWS S3, one of the key and first service of Amazon Web Services!

Photo by Maksym Kaharlytskyi on Unsplash

This famous component, with eleven 9’s of durability and 99.99% of availability (standard class), can however encounter outages, like in February 2017 as you can read here, here or here (… and on many other websites).

After this event, we decided to set a disaster recovery plan for our application, by implementing multi-region architecture and improve resilience.

“Let’s make it region-fail-proof!”

We then worked on how it could be done, component by component, based on the active-active principles: we wanted to offer the global service with the lower latency as possible.

Route 53? Native. The service is designed for that, you just need to set up the low latency rules.

AWS EC2 components? No problem! Machines can be built also in another region.

AWS DynamoDB tables? Hmm… We need to rebuild them as a Global tables, but that is fine, we can do it.

AWS S3? Well… Replication does exist and works fine, but… can we replicate in both ways without entering in the replication-loop-of-the-death?

Will it create a “replication-loop-of-the-death”?

Good question! If you post a document on BucketA (in region 1), it will be replicated to Bucket B (in region 2). But then? Will it be replicated again back on Bucket A, and then Bucket B and so on in an infinite loop?

This is the starting point of the more technical part: how to create a both-way-replication rules between two S3 buckets and what are its setup limits.

Let’s start our journey!

First of all, having two buckets replicating each other will not bring you into the replication-loop-of-the-death!

How I figured it out? Well… I can be an adventurer, or one says crazy, so… I basically tested it!

And guess what? The replication is working like a charm, exactly as expected!

Setting it through the console is quite straightforward.

First, set replication for bucket A to bucket B

Replication setting from bucket A (Ireland) to bucket B (Singapore)

… then set replication for bucket B to bucket A

Replication setting from bucket B (Singapore) to bucket A (Ireland)

I told you it was easy, I did not lie!

So what? End of story?

Nope. This was a one shot creation. How can I industrialize that, now, following best practices?

The second step of the journey brings me to Terraform to build the foundations.

This is a best practice in cloud architecture to code your infrastructure, to be able to reproduce it for any kind of environment (development, qualification, integration, production, etc.), or after a massive crash when you need to rebuild everything (fingers crossed).

Terraform is one of the best tools to do that nowadays.

So, my Terraform files (split by service) I first wrote created:

Two buckets in different regions (used multiple AWS providers to do so),
One role and one policy to be used for replication way A-to-B
One role and one policy to be used for replication way B-to-A
One replication rule from Bucket A to Bucket B

(don’t worry, the full source code is open source, and I share the repo link later in this article)

Until that point, everything went well. But then, comes the time to include the other direction replication, from B to A.

I first tried to copy/paste the replication part, adjusting roles/policies names, but basically, for Terraform, I was asking “please create resource A based on resource B ARN and create resource B based on resource A ARN”. Like the famous paradox of the egg and the hen… That leads me to a magnificent cycle error:

**Error configuring: 1 error(s)occured:
 *Cycle: aws_s3_bucket.s3-2, aws_s3_bucket.s3-1

Well… That makes sense, but I needed a way to work around it. I didn’t want to encapsulate my Terraform call inside a script. I wanted the Terraform scripts to be fully autonomous.

Step three of this journey was then to introduce in my Terraform files concepts of provisionner and null_ressource provider.

The provisioner, part of the first S3 bucket creation script, creates a local temporary JSON file, that includes the bucket ARN information (to be applied to Bucket B) amongst other replication rules attributes:

provisioner "local-exec" {command = <<CMD
echo '{"Role": "${aws_iam_role.replication-way2.arn}","Rules":[{"DeleteMarkerReplication": { "Status": "Disabled" },"Status":"Enabled","Priority":1,"Filter":{"Prefix":""},"Destination":{"Bucket":"${self.arn}"}}]}' > ${var.tempFilename}
CMD
}

The JSON file is set to create a replication rule through the s3api.

Here comes the null_resource provider. This provider does not create resource, but simulate it.

So, in my provider declaration file, I added:

provider "null" {}

… and then, in the Terraform script, I have this resource declaration:

resource "null_resource" "s3bucket" {  depends_on = ["aws_s3_bucket.s3-2","aws_s3_bucket.s3-1"]  provisioner "local-exec" {
    command = "sleep 5 && aws s3api put-bucket-replication --bucket=${aws_s3_bucket.s3-2.bucket} --replication-configuration=file://${var.tempFilename}"
  }
}

This code waits for the end of the S3 buckets creations (keyword depends_on) and runs locally the AWS CLI command to set replication rule to Bucket B (command “aws s3api put-bucket-replication”)!

And to clean up all of this, I also added another null_resource resource:

resource "null_resource" "cleanup" {
  depends_on = ["null_resource.s3bucket"]
  provisioner "local-exec" {
    command = "rm ${var.tempFilename}"
  }
}

And … voilà!

Our buckets and their replication rules are set up!

The third step was about testing this architecture, identifying possible limits.

In our use case, even if the files took a few seconds to replicate, it was acceptable. But I can easily imagine some cases where this normal latency can be a real issue.

So, I imagined a quick test to be ran on a AWS Lambda function:

Post a file on bucket A and note timestamp when it’s done
Try to get the file (head only, to same bandwidth) on bucket B
Repeat quickly until you find it and note the timestamp when found … or give up after a certain number of attempts
Calculate the time between POST and GET
Repeat with a different size file.

The Lambda code is done in NodeJS 10.x. It is part of the repository too. It will be added to your infrastructure by the same Terraform script.

This script is more basic. It creates a role/policy for the function, uploads the function and sets some environment variables ARN of the buckets.

Note: tests could be more relevant and handle larger files. But the way I designed it is not appropriate for files larger than 8 or 16 Mb. Feel free to improve them or lead your own … and share back the results!

The last step of our journey is to analyze the tests results.

Here is the kind of result I had, for a replication set between Ireland (eu-west-1) and Singapore (ap-southeast-1):

{
    "AtoB": [
        {
            "file_size_in_bytes": 131072,
            "post_date": "2019-10-02T13:49:41.033Z",
            "get_date": "2019-10-02T13:49:45.292Z",
            "duration_in_ms": 4259
        },
        {
            "file_size_in_bytes": 262144,
            "post_date": "2019-10-02T13:49:45.656Z",
            "get_date": "2019-10-02T13:49:48.121Z",
            "duration_in_ms": 2465
        }
    ],
    "BtoA": [
        {
            "file_size_in_bytes": 131072,
            "post_date": "2019-10-02T13:50:15.790Z",
            "get_date": "2019-10-02T13:50:34.142Z",
            "duration_in_ms": 18352
        },
        {
            "file_size_in_bytes": 262144,
            "post_date": "2019-10-02T13:50:35.715Z",
            "get_date": "2019-10-02T13:50:59.042Z",
            "duration_in_ms": 23327
        }
    ]
}

This is only an extract of the actual results. But the tests I’ve done were considering 6 files of various sizes (from 128 Kb to 4 Mb) and showed an average replication time of 4.7 seconds for replication from Ireland to Singapore.

… other direction has an average replication time of 11.1 seconds.

Replication time chart (duration in milliseconds)

As you can see on the diagram, latency seems not directly linked to the file size (at least for the tested sizes) and is much bigger from Singapore to Ireland than the other way.

It is not a problem for our project, but I think it is relevant to keep it in mind for other use cases!

TL;DR

Setting a both way replication between S3 buckets can be done thanks to Terraform scripts. Source code of my investigations can be found on Gitlab: https://gitlab.com/arnaduga/s3_sync_tests

Keep in mind that replication can take several seconds. Is it appropriate for your use case? If not… maybe you should consider active/passive architecture.

The source code is not perfect, so feel free to improve it and share back your contribution!

Enjoy!

AWS S3 // Set a both ways multi-region replication

TL;DR

Written by Arnaud