Distributed Hash Generation Algorithm in Python - Twitter Snowflake Approach
Into The Background...
In any software development lifecycle, unique IDs play a major role in manipulating and showcasing data. A lot depends on the uniqueness of each data segment associated with user transactions.
There are several ways to generate unique IDs, but when it comes to generating hashes at a rate of approximately 10,000 hashes per second in a distributed fashion, it becomes a computational challenge.
The Twitter Snowflake approach is a widely used concept in recent advancements in distributed algorithms. Several other approaches can be used for hash generation, which will be covered in a future article!
Some Knowledge on Approach...
To enable distributed hash generation, this algorithm employs the principle of "divide and conquer" on a 64-bit hash. The algorithm uses five different parameters to establish uniqueness while generating the hash and ensuring there are no collisions.
The five parameters are as follows based on Figure 2 - 1:
Maximum Timestamp represented in 41 bits: (2 ^ 41 - 1) ms ~ 69 years
To backtrace, we can add the total milliseconds of the current time in UTC to Twitter's epoch, i.e., 1288834974657, which represents November 4, 2010, at 1:42:54 UTC.
Each machine in the system involved in generating hashes can run the following templates to generate a hash:
Recommended by LinkedIn
Let's Code...
While writing the code for generating a hash, if you don't have a distributed architecture or the resources to implement one, you can set the default bits of 'Datacenter ID' as '00000' and 'Machine ID' as '00000'. This scenario is also reflected in the following code:
from datetime import datetime, timedelta
class GenerateHash:
def __init__(self):
self.sequence_timestamp = datetime.utcnow()
self.sequence_number = 0
# Setting default epoch to 04 November 2010 at 1:42:54 UTC Time
self.time_ref = datetime(2010, 11, 4, 1, 42,54)
def generate(self):
## Sign bit: Will be useful for future references
sign_bit = '0'
## Datacenter bit (Default Datacenter: '00000')
datacenter_bit = '0' * 5
## Machine bit (Default Machine: '00000')
mac_bit = '0' * 5
## Timestamp bit
# Getting Current Time in utc
time_now = datetime.utcnow()
# Time Difference since default epoch
time_diff = time_now - self.time_ref
# Total time_diff in miliseconds
ms = int(time_diff.total_seconds() * 1000)
# Miliseconds conversion in binary to 41 bits
ms_bin = format(ms, '41b')
timestamp_bit = ms_bin
## Sequence bit
# Time difference since last request on this machine
time_difference = time_now - self.sequence_timestamp
# If time_difference > 1 ms: reset the sequence number
if time_difference > timedelta(milliseconds=1):
self.sequence_number = 0
# Set the sequence timestamp to current requests timestamp
self.sequence_timestamp = time_now
# Joining all strings
# Also formatting self.sequence_number int -> binary of twelve bytes
final_bin = sign_bit + timestamp_bit + datacenter_bit + mac_bit + format(self.sequence_number, '12b')
# Replacing all spaces with zero as bit
final_bin = final_bin.replace(' ', '0')
# Incrementing the sequence_number
self.sequence_number += 1
# Converting final binary to decimal and decimal to hex value
hex_val = hex(int(final_bin, 2))
# Avoiding extracting string 0x.. generated due to hex()
final_hex = hex_val[2:]
return final_hex
hash_generator = GenerateHash()
new_hash = hash_generator.generate()
print(new_hash)
Here we have used the 'datetime' library to support time operations since we use universal time to employ machines/servers globally.
The Benefits, Concerns, and more...
Benefits First :)
Some Concerns :(
Something more to focus on...
Thank you so much for getting this far!
Very helpful!
Amazing!