This script facilitates bidirectional synchronization between a local file
system path and an Amazon S3 storage bucket by
wrapping the (unidirectional)
aws s3 sync
CLI. It is functionally similar to the following with added logic for event
triggering, distributed use and automated
Amazon CloudFront edge cache
invalidations:
aws s3 sync <LocalPath> <S3Uri>
aws s3 sync <S3Uri> <LocalPath>
This script was created to solve the problem of utilizing S3 storage concurrently within clusters of ephemeral AWS ECS Fargate container instances with software that expects and depends on a traditional (persistent) file system for application data. Prior to writing this script, multiple alternatives were considered for this use case including:
-
s3fs-fuse - A FUSE-based file system backed by Amazon S3. Unsuitable because it requires privileged mode which is not supported by ECS Fargate containers. Additionally, in some preliminary testing, even in privileged mode performance was poor and often files were corrupted in busy file systems.
-
Peer-to-peer file synchronization tools - Syncthing and Resilio Sync. Unsuitable because they depend on at least 1 peer being active at all times as well as due to complexity involved in provisioning/deprovisioning of nodes within an often fast changing containerized environment.
-
Client/server file synchronization tools - Nextcloud and ownCloud. Unsuitable because they depend on a maintaining a dedicated server (which is what we're trying to get away from with AWS Fargate), as well as the complexity of registering/deregistering new clients within a fast changing containered environment.
-
File hosting and collaboration services - Dropbox and Google Drive, for example. While these services could work, they are not designed with this use case in mind, and the costs would be too high, and orchestration of authorizing/deauthorizing container instances could be challenging.
This script uses the following options and arguments. Some may also be set using environment variables (uppercase/underscored values). The script does not daemonize - it will run continually until terminated.
s3-sync2 [options] <LocalPath> <S3Uri>
<LocalPath> = LOCAL_PATH
<S3Uri> = S3_URI
[--cf-dist-id | -c] = CF_DISTRIBUTION_ID
[--cf-inval-paths] = CF_INVALIDATION_PATHS
[--debug] = DEBUG
[--dfs | -d] DFS
[--dfs-lock-timeout | -t] = DFS_LOCK_TIMEOUT
[--dfs-lock-wait | -w] = DFS_LOCK_WAIT
[--init-sync-down | -i]
[--init-sync-up | -u]
[--max-failures | -x]
[--md5-skip-path | -s]
[--only-down]
[--only-up]
[--poll | -p] = POLL_INTERVAL
[--sync-opt-down-*]
[--sync-opt-up-*]
All standard
aws s3 sync
CLI options are supported, in addition to the following s3-sync2
specific
options. As with aws s3 sync
, the only options required by this script are
<LocalPath>
and <S3Uri>
(which may be specified interchangeably).
<LocalPath>
Local directory to synchronize - e.g. /path/to/local/dir
<S3Uri>
Remote S3 URI - e.g. s3://mybucket/remote/dir
--cf-dist-id | -c
ID of a CloudFront distributuion to trigger
edge cache invalidations
on when local changes occur.
--cf-inval-paths
Value for the
aws cloudfront create-invalidation --paths
argument. Default is invalidation of all cached objects: /*
--debug
Debug output level - one of ERROR (default), WARN, DEBUG or NONE
--dfs | -d
Run as a quasi
distributed file system
wherein multiple nodes can run this script concurrently. When enabled, an
additional
distributed locking
step is required when synchronizing from <LocalPath>
to <S3Uri>
. To do so,
S3's read-after-write consistency model
is leveraged in conjunction with an object PUT
operation where the object
contains a unique identifier for the node acquiring the lock. This script
has not been tested nor is it recommended for high IO/concurrency distributed
environments.
--dfs-lock-timeout | -t
the maximum time (secs) permitted for a distributed
lock by another node before it is considered to be stale and force released.
Default is 60
(1 minute)
--dfs-lock-wait | -w
the maximum time (secs) to wait to acquire a distributed
lock before exiting with an error. Default is 180 (3 minutes)
--init-sync-down | -i
if set, aws s3 sync <S3Uri> <LocalPath>
will be
invoked when the script starts
--init-sync-up | -u
if set, aws s3 sync <LocalPath> <S3Uri>
will be
invoked when the script starts
--max-failures
max sychronization failures before exiting (0 for infinite).
Default is 3
--md5-skip-path | -s
by default, every file in <LocalPath>
is used to
generate md5 checksums determining when contents have changed. The script
does not translate --include/--exclude
sync
arguments to local file paths. Use this
option to alter the behavior by specifying one or more paths in <LocalPath>
to
exclude from the checksum. Do not repeat this option - if multiple
paths should be excluded, use pipes (|
) to separate each. Each path designated
should be a child of <LocalPath>
. Only directories may be specified and they
should not include the trailing slash
--only-down
Only synchronize from <S3Uri>
to <LocalPath>
--only-up
Only synchronize from <LocalPath>
to <S3Uri>
--poll | -p
frequency in seconds to check for both local and remote changes
and trigger the necessary synchronization - default is 30. Must be between 0
and 3600. If 0, then script will immediately exit after option validation and
initial synchronization
--sync-opt-down-*
an aws s3 sync
option that should only be applied when
syncing down <S3Uri>
to <LocalPath>
. For example, to only apply the
--delete
flag in this direction, set this option --sync-opt-down-delete
--sync-opt-up-*
same as above, but for syncing up <LocalPath>
to <S3Uri>
To avoid bloated container images and complex setup/configurations, this script intentionally utilizes minimal dependencies.
-
AWS Command Line Interface - this script uses
aws s3
, andaws cloudfront
(cloudfront
is only utilized if the associated options are set). The AWS CLI must both be installed and supplied with the necessary credentials and AWS IAM credentials/permissions required by these commands and the corresponding S3 storage bucket. -
md5sum | md5 - This script uses md5sum to determining when
<LocalPath>
changes have occurred.