According to the listing documentation it is possible to treat a large navigate number of keys as though they were hierarchial. I am planning to store a large number of keys (let's say a few hundred million), distributed over a sensible-sized 'hierarchy'.
What is the performance of using a prefix and delimiter? Does it require a full enumeration of keys at the S3 end, and therefore be an O(n) operation? I have no idea whether keys are stored in a big hash table, or whether they have indexing data structures, or if they're stored in a tree or what.
I want to avoid the situation where I have a very large number of keys and navigating the 'hierarchy' suddenly becomes difficult.
So if I have the following keys:
abc/def/ghi/0abc/def/ghi/1abc/def/ghi/...abc/def/ghi/100,000,000,000Will it affect the speed of the query Delimiter='/, Prefix='abc/def'?
Although S3 bucket names are globally unique, each bucket is stored in a Region that you select when you create the bucket. To optimize performance, we recommend that you access the bucket from Amazon EC2 instances in the same AWS Region when possible. This helps reduce network latency and data transfer costs.
You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an Amazon S3 bucket. There are no limits to the number of prefixes that you can have in your bucket.
A delimiter is a character you use to group keys. Encoding type used by Amazon S3 to encode object keys in the response. The owner field is not present in listV2 by default, if you want to return owner field with each key in the result then set the fetch owner field to true.
A key prefix is a string of characters that can be the complete path in front of the object name (including the bucket name). For example, if an object (123. txt) is stored as BucketName/Project/WordFiles/123. txt, the prefix might be “BucketName/Project/WordFiles/123.
Aside from the Request Rate and Performance Considerations document that Sandeep referenced (which is not applicable to your use case), AWS hasn't publicized very much regarding S3 performance. It's probably private intellectual property. So I doubt you'll find very much information unless you can get it somehow from AWS directly.
However, some things to keep in mind:
Based on all of the above, chances are that it's much better than an order O(n) algorithm when you retrieve listing of keys. I think you are safe to use prefixes and delimiters for your hierarchy.
As long as you are not using a continuous sequence (such as date 2016-13-08, 2016-13-09 and so on) in the prefix you shouldn't face any problem. If your keys are auto-generated as a continuous sequence then prepend a randomly generated hash key to the keys (aidk-2016-13-08, ujlk-2016-13-09). The amazon documentation says:
Amazon S3 maintains an index of object key names in each AWS region. Object keys are stored in UTF-8 binary ordering across multiple partitions in the index. The key name dictates which partition the key is stored in. Using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood that Amazon S3 will target a specific partition for a large number of your keys, overwhelming the I/O capacity of the partition. If you introduce some randomness in your key name prefixes, the key names, and therefore the I/O load, will be distributed across more than one partition.
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With