Improving Throughput to Cloud Storage
One of my ex-colleagues from startup days, recently asked me how to move large data to cloud storage. The context was to backup few 100 TBs from BigData systems to Cloud Storage. While talking to him, I decided to put some of my thoughts down. The suggestions here are generic and apply to any cloud storage vendor AWS/GCS/Azure/OpenStack/… (I don’t reference to Oracle Cloud Storage here. A lot of my learning came from my previous experiences in startups, which did not use Oracle Cloud Storage at the time).
Handling high latency
Cloud Storage vendors provide high bandwidth and near-infinite horizontally scalable storage, but with significant latency. The latency is primarily due to physical distance from customer site to Cloud Storage endpoint and the backend architecture of cloud storage service. The traditional ways of handling high latency LFN (long fat network) links are Wan-optimization or Fast TCP methods. These require customer network software at both ends of the connection. In Cloud storage scenarios, the user has control on only one end of the connection and so we can’t apply these techniques.
For Cloud Storage Service, a better alternative may be to fill the pipe using multiple TCP connections and upload multiple objects in parallel. To fill a 1-10GB pipe, we would need anywhere from 40 to 2048 TCP connections, which would require that many object uploads in parallel.
The ideal number of parallel TCP connections can vary based on latency and available bandwidth. If the link is shared with other services, the ideal connection count may vary at different times, based on link usage.
Maximum single object size limit
Cloud Storage also has limit on maximum size of single object, typically 5-10GB, though some vendors are now allowing larger sizes (exception: Azure has block blob size of 256MB). For large files, like system backup, the Application has to split the file into multiple objects and keep track of the objects in some metadata. The apps can construct their own metadata format. This is favored in scenarios where, (a) different large files have some common segments, or (b) multiple versions of a large file need to be stored. The cloud can also be used to store metadata using multipart upload or compose REST API. These cloud API effectively stitch all the given objects into a single downloadable file.
Example
For example, to upload a 200GB file, the Application can split it into 512 objects of 400MB each and upload them in parallel. This would provide higher bandwidth utilization and can fill a 10Gb pipe. Also, if an object upload fails (which does happen), only that object of 300MB needs to be re-uploaded.
Original file to backup : Bigdata.backup ( size 200GB)
This is broken into 512 objects, each of size 400MB. The objects are named as
‘Bigdata.backup/segment_1’, ‘Bigdata.backup/segment_2’, ‘Bigdata.backup/segment_3’, .. ‘Bigdata.backup/segment_512’
Upload all the objects in parallel, opening up 512 TCP connections.
Once uploaded, create a metadata file in Cloud Store named ‘Bigdata.backup’ which contains following content:
<CompleteMultipartUpload>
<Part>
<PartNumber>Bigdata.backup/segment_1</PartNumber>
<ETag>"md5 checksum of this segment "</ETag>
</Part>
<Part>
<PartNumber>Bigdata.backup/segment_2</PartNumber>
<ETag>"checksum of this segment"</ETag>
</Part>
...
...
<Part>
<PartNumber>Bigdata.backup/segment_512</PartNumber>
<ETag>"a checksum of this segment 8"</ETag>
</Part>
</CompleteMultipartUpload>
This metadata file tells the cloud storage that Bigdata.backup is composed of 512 objects. When a download is requested for the file ‘Bigdata.backup’, the cloud will serve the data from the objects listed in this metadata file.
This functionality is provided by most cloud vendor tooling. For more features like resuming upload, cleaning object in case of partial upload failure, compression, segmented download or auto-detection of maximum parallel segments to upload based on current bandwidth, Applications may have to extend the tooling or write from their own tools.
Note on some specific providers
Azure allows maximum object size of only 256MB, so in the above example the object size will need to be changed. All the objects except the last one should be of same size. The last object should be of same or larger size.
GCS does not allow more than 32 objects to be combined using their compose API, so with GCS we would need to use larger object size. GCS model is to support small number of larger size objects. For GCS, if you have large number of objects, it may be preferable for Application to maintain the metadata in a local or cloud-based DB.
Improving speed and maximizing efficiency of Load Balancers
When trying to get speeds above 1Gbps, the Cloud Storage endpoint may become another bottleneck. The Cloud Storage endpoint, is typically front-ended by a Load Balancer, which has a limited bandwidth.
To scale horizontally, Cloud Storage vendors use multiple Load Balancers and rotate between them by time-based DNS round-robin method. For example, in S3, the DNS server changes the Load Balancer every few seconds.
To get higher bandwidth, Applications have to hit more than one load balancer. One option is to query and cache last few LoadBalancer IPs and spread traffic to all the LoadBalancers. A better alternative, is to start object uploads in a staged manner. In the above example, out of the 512 objects to be uploaded, one option is to start the upload of 128 objects every 30 seconds. The TTL for S3 DNS entries is only few seconds. This would force DNS fetch for Storage endpoint every 30 seconds and naturally spread the load on multiple LoadBalancers.
Importance of naming and organizing files
Cloud Storage supports filenames but not a directory structure. Typically they use some form of hashing using filename (full path) and metadata to arrive at a target storage server. Openstack Swift architecture is a good though somewhat outdated basic reference
The object uploaded is broken into chunks and spread to storage servers with some form of replication or erasure coding. For client, this translates to faster read and write access, but not great performance for list objects (especially if there are large number of objects in the bucket). It is better for Applications to avoid listing buckets due to this reason. If the Application requires the listing functionality, it is better to maintain metadata in a local DB or cloud DB or in separate metadata files. These metadata files can also be periodically backed up to the cloud storage.
Distributing files to multiple directories in a local filesystem, improves performance, due to the directory specific records(direntries). In cloud storage, which has no directory concept, this distribution may give little to no performance improvement. However, it is still advisable to use a directory-like structure for human readability and for better prefix-suffix pattern based listing. Some vendors suggest adding some random characters to filenames for better hashing in the backend.
TCP Congestion control and testing
Enabling TCP window scaling on the client machines can improve performance. For TCP congestion control, it is better to switch to BBR protocol from Cubic, where cloud supports it. Running a layer-4 traceroute to Cloud Storage endpoint on TCP Port 443, can help identify if any hops in the path are slow.
Most cloud storage provide some sort of Network Interconnect, which can provide a faster and less noisy direct link.
Finally, test the network with upload and download of sample data, of at least 1TB or higher. Run this test a few times, at different times of the week, to ensure that the pipe is available at all the time (especially of the pipe is shared by multiple services).