Using Cloud Storage
Introduction
Using some form of cloud storage has become a common use case with services provided by external parties, such as Amazon S3, Google Cloud Storage or Azure Storage.
Letting our command-line tools read from cloud storage is done by simply providing the URLs to the content as input. Authentication can be added as outlined in Authenticate requests to AWS S3.
However, writing to cloud storage in an optimal way is a little more complex. This guide will cover writing directly to AWS S3, Google Cloud Storage and Azure Blob storage. The approach described here pipes the output of our command-line tools to the native tools provided by these platforms to guarantee the best interoperability and other benefits:
Orthogonal between cloud vendors (it works the same)
Agnostic to chosen architecture (it works anywhere, from on-prem hardware to cloud to hybrid/multi cloud)
Streaming: not having arbitrary file limits when for instance using PUT
Scalable: adding or removing processing entities based on load
Separation of concern: each cloud vendor supports their own tooling (and can be asked about it)
Being failsafe and able to handle error situations
Not implement any proprietary protocols but use standard practice
Amazon S3 (Simple Storage Service)
To write the output of mp4split
directly to S3 we use the AWS CLI.
Linux distributions typically have this available as a prebuilt package, although often these are outdated versions. Updating to the latest recommended AWS CLI version 2 is relatively simple, as shown on the installation page.
To manipulate resources on S3, you use the aws s3
subcommand, which has
many sub-commands. To download or upload files, you use the aws s3 cp
sub-command.
Why we use Amazon S3 Multipart Uploads
We prefer to upload in "streaming" mode, which means that our tools produce chunks of data, and upload these chunks as they become available. This means that no temporary files are required, and data that already has been uploaded can be discarded on the client side.
Amazon S3 supports this use case through their so-called Multipart Upload method. This is a proprietary protocol designed by Amazon, broadly consisting of three steps:
Initiating the multipart upload, via a special HTTP POST request.
Uploading the actual data in multiple parts, via separate HTTP PUT requests.
Completing the multipart upload, via another special HTTP POST request.
How to write directly to Amazon S3
To accept input from another program, the aws s3 cp
command accepts -
as
the local file argument, to signify the standard input. Our CLI tools in turn
accept stdout:
as their output, optionally followed by a file extension, to
write their output on the standard output.
For example, to convert an MP4 file "test.mp4" to fragmented MP4 using Unified Packager, and simultaneously upload it to an S3 bucket named "mybucket", you run:
#!/bin/bash
mp4split -o stdout:.ismv test.mp4 | aws s3 cp - s3://mybucket/test.ismv
Unified Packager will write its output in 4MB chunks, and the AWS CLI will read as much as it needs to perform its multipart uploads, if necessary. (By default, the AWS CLI S3 sub-command uses 8MB chunks.)
Error handling while writing to Amazon S3
Both our CLI tools and the AWS CLI can encounter errors during their operation. For example, out tools could find a problem in the input file(s), and be unable to continue processing, or the AWS CLI could encounter a connection failure.
Typically, if our CLI tools encounter a fatal error, they prints an error message, and exits with a non-zero exit code. The AWS CLI will only notice that its standard input reaches end-of-file (EOF), but cannot distinguish between a successful or failed status. Therefore, the uploaded file will likely be cut short, and should therefore not be used.
When using shell pipelines to connect commands, the return status of the whole
pipeline is normally the exit status of the last command. Therefore, when
running mp4split | aws s3 cp
, only a failure of the AWS CLI can be detected.
As this is a pretty fundamental design problem in shell pipelines, most shells, such as bash, zsh and others offer a pipefail option. If this option is enabled, the pipeline's return status becomes the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands exist successfully.
We can use this to detect failures during uploads, and in case of such failure, get rid of the uploaded file. For example:
#!/bin/bash
set -o pipefail; \
mp4split -o stdout:.ismv test.mp4 | \
aws s3 cp - s3://mybucket/test.ismv || \
aws s3 rm s3://mybucket/test.ismv
In this example, if either mp4split
throws an error while processing
"test.mp4", or aws s3 cp
throws an error while uploading the result, the
aws s3 rm
command after the ||
will be run, deleting the partial output
on S3.
Another possible approach is to use the bash-specific internal array variable
PIPESTATUS
, which contains the exit status values from the processes in the
most-recently-executed foreground pipeline. For example:
#!/bin/bash
mp4split -o stdout:.ismv test.mp4 | aws s3 cp - s3://mybucket/test.ismv
if test ${PIPESTATUS[0]} != 0 -o ${PIPESTATUS[1]} != 0; then
aws s3 rm s3://mybucket/test.ismv
fi
Azure Blob Storage
Microsoft's Azure Storage has multiple features, one of which is the Azure Blob Storage. Although this service is comparable to Amazon S3 and Google Cloud Storage it is slightly different in that it distinguishes between types of 'blobs', namely:
Block blobs: similar to S3 objects, being individual 'files'.
Append blobs: optimized for append operations, such as logs.
Page blobs: optimized for random read/write access, such as virtual machine disks, or database engines.
For our use cases, only the 'block blobs' are relevant and we will write to them
using the AzCopy tool, azcopy
. This is written in Go and distributed as a
single executable file that can be downloaded from the AzCopy download page.
Put the executable in any directory in your PATH
, and it is ready to run.
After installing AzCopy and logging into Azure
(using azcopy login
), the azcopy
tool can be used to list, download,
upload and otherwise manage Azure Storage blobs.
Note
When running azcopy login
, make sure to use the --tenant-id
option,
otherwise it might choose the wrong one, and show an incomprehensible (and
non-actionable) error message. See the Azure documentation
on how to find the correct tenant ID.
How to write directly to Azure Blob Storage
To accept input from another program, and upload it to Azure Blob Storage, the
azcopy cp
command uses the --from-to PipeBlob
option. Our CLI tools in
turn accept stdout:
as their output, optionally followed by a file
extension, to write their output on the standard output.
For example, to convert an MP4 file "test.mp4" to fragmented MP4 using Unified Packager, and simultaneously upload it to an Azure Storage container named "mycontainer", under storage account "myaccount", you run:
#!/bin/bash
mp4split -o stdout:.ismv test.mp4 | azcopy cp --from-to PipeBlob \
https://myaccount.blob.core.windows.net/mycontainer/test.ismv
Unified Packager will write its output in 4MB chunks, and azcopy
will upload
the data to Azure Storage.
Error handling while writing to Azure Blob Storage
Similar to the error handling method used for uploading to Amazon S3 and Google
Cloud Storage, the set -o pipefail
shell command can be used to detect that
errors occurred in any of the commands used in a shell pipeline. For example:
#!/bin/bash
set -o pipefail; \
mp4split -o stdout:.ismv test.mp4 | \
azcopy cp --from-to PipeBlob \
https://myaccount.blob.core.windows.net/mycontainer/test.ismv || \
azcopy rm https://myaccount.blob.core.windows.net/mycontainer/test.ismv
In this example, if either mp4split
throws an error while processing
"test.mp4", or azcopy cp
throws an error while uploading the result, the
azcopy rm
command after the ||
will be run, deleting the partial output
on Azure Blob Storage.
Google Cloud Storage
Google's Cloud Storage service is fairly similar to Amazon S3, using the same
terminology like "buckets". As with Amazon S3, the most straightforward way of
uploading data to Google Cloud Storage is the gsutil
tool, contained in the
Google Cloud SDK. This can be obtained from the Google Cloud SDK site, and has installation instructions
for several Linux distributions (both apt-based and rpm-based ones), macOS and
Windows.
After installing the Google Cloud SDK and authorizing it (using gcloud init
or
gcloud auth login
), the gsutil
tool can be used to list, download,
upload and otherwise manage Google Cloud Storage files.
How to write directly to Google Cloud Storage
To accept input from another program, the gsutil cp
command accepts -
as
the local file argument, to signify the standard input. Our CLI tools in turn
accept stdout:
as their output, optionally followed by a file extension, to
write their output on the standard output.
For example, to convert an MP4 file "test.mp4" to fragmented MP4 using Unified Packager, and simultaneously upload it to an Google Cloud Storage bucket named "mybucket", you run:
#!/bin/bash
mp4split -o stdout:.ismv test.mp4 | gsutil cp - gs://mybucket/test.ismv
Unified Packager will write its output in 4MB chunks, and gsutil
will read
as much as it needs to perform its multiple chunk uploads,
if necessary.
Error handling while writing to Google Cloud Storage
Similar to the error handling method used for uploading to S3, the set -o
pipefail
shell command can be used to detect that errors occurred in any of
the commands used in a shell pipeline. For example:
#!/bin/bash
set -o pipefail; \
mp4split -o stdout:.ismv test.mp4 | \
gsutil cp - gs://mybucket/test.ismv ||
gsutil rm gs://mybucket/test.ismv
In this example, if either mp4split
throws an error while processing
"test.mp4", or gsutil cp
throws an error while uploading the result, the
gsutil rm
command after the ||
will be run, deleting the partial output
on Google Storage.