Cryptographic Hash

What is a hash function?

A hash function is a function that takes an input of any length and returns a fixed size output.

h: \{0,1\}^* \rightarrow \{0,1\}^n

e.g.

Example of a hash parameter and output, with n=8.

Input Output
01010101110110101 10101010
01 00101010
1011101101010101010101010101010101010101010101 10101010

So, the input can be of any length, but the output is always the same length.

Non cryptographic hash

Let’s create a simple hash function

def hash(num: int) -> int:
    return num % 1000
print(hash(123983))
print(hash(123))
print(hash(293939393939393122))
983
123
122

Regardless how long the input is, the output is always a number [0, 1000), which fulfills the definition of a hash function

h: \{0,1\}^* \rightarrow \{0,1\}^n

Collision

A collision happen when two different inputs produce the same output.

print(hash(123983))
print(hash(983))
983
983

Good hash function should have a low probability of collision. But - by definition - it is impossible to have a hash function that never collide (unless the output is the same size as the input)

One Way

A hash function is one way if it is easy to compute the output from the input, but it is hard to compute the input from the output.

The following is not a one way hash function:

def hash_function(input):
    return upper(input)

Well, it’s pretty obvious that the input is lower(output), so it’s not one way.

Another non one way hash function is: encryption algorithm. If you know the key, you can easily decrypt the output.

Hash Applications

Password Storage

username password
alice 123456
bob rahasia
charlie Password!

Oh, no! Don’t store password in plain text, do you? Why not?

  • The database can be stolen, the whole world will know the passwords
  • We don’t want the system administrator to know the password, do we?

So, how to store it securely while still being able to authenticate the user?

Use hash!

username hash(password)
alice 8237492392332892302183243
bob 2837498234792109348923582
charlie 490447284045892492384232

Upon login, the system will hash the password and compare it with the stored hash.

def check_password(username, password) -> bool:
    stored_hash = get_stored_hash_from_db(username)
    return stored_hash == hash(password)

It’s not secure yet, we will discuss later why.

Q: Why not just encrypt the password?

A: Well, we can, but then we need to decrypt it to compare it with the input. So, the system administrator can still know the password.

Checksum

We can also use hash to check the integrity of a file, i.e. to check whether the file is corrupted/altered or not.

Let’s say the file contains the following text:

Hello! My name is Alice. I am a student studying computer science. I love cryptography!

We can calculate the hash of the file and store it in a separate file, e.g. checksum.txt

23885728283

If someone alter the file, e.g. by adding a period at the end of the file, the hash will be different.

When we want to check the integrity of the file, we can calculate the hash of the file and compare it with the stored hash.

def check_integrity(filename) -> bool:
    stored_hash = get_stored_hash_from_file(filename)
    return stored_hash == hash(filename)

Many websites provide the hash of the file so that you can check the integrity of the file you download, e.g. OpenOffice

But is it secure? What if someone alter the file and also alter the stored hash? Hash is not secure if the attacker can alter both the file and the stored hash.

We could use a digital signature to tackle this problem, but it’s a topic for another day (or another course).

Attacks on Hash Functions

Preimage Attack

Given a hash h and a hash function H, it should be difficult to find a message m such that H(m) = h.

Let’s say we have the following simple hash function

def hash(s: str) -> int:
    return sum([ord(c) for c in s]) % 1000

secret = "Password"
hashed = hash(secret)
print(hashed)
851

Given that we know that 851, can we find m?

Yes*, we can try all possible values of m until we find the one that produces 983.

common_passwords = ["secret", "1234", "password", "Password"]

for guess in common_passwords:
    if hash(guess) == hashed:
        print("Password cracked! It was:", guess)
        break
Password cracked! It was: Password

But, what if we don’t the password is not in the common password list?

We can find another message m' that produces the same hash h (i.e. find collision).

Look at the previous hash function

def hash(s: str) -> int:
    return sum([ord(c) for c in s]) % 1000

Can we find a string s that produces the same hash as 851?

# draw ascii table of 'a' to 'z'
for i in range(26):
    print(chr(ord('a') + i) + " -> " + str(ord('a') + i), end=" | ")
a -> 97 | b -> 98 | c -> 99 | d -> 100 | e -> 101 | f -> 102 | g -> 103 | h -> 104 | i -> 105 | j -> 106 | k -> 107 | l -> 108 | m -> 109 | n -> 110 | o -> 111 | p -> 112 | q -> 113 | r -> 114 | s -> 115 | t -> 116 | u -> 117 | v -> 118 | w -> 119 | x -> 120 | y -> 121 | z -> 122 | 

Let’s analyze, what characters should be chosen so that the sum of their ASCII values is equal to 851?

hash("zzzzzzw")
851

Bingo! We found a collision!

Since collision will happens, to prevent pre-image attack, we need to make sure that it’s computationally infeasible to mount a pre-image attack.

For a n bit hash, the complexity of pre-image attack is O(2^n) (brute force). Modern hash function has a hash size of 256 bits, which means that the complexity of pre-image attack is O(2^{256}).

Rainbow Table

O(2^n) seems to be a big number, but it’s not impossible to mount a pre-image attack.

Let’s use the following hash function

import hashlib

def h(s: str) -> int:
    # Encode the string into bytes
    text_bytes = s.encode('utf-8')
    hash_object = hashlib.md5(text_bytes)

    # Get the hexadecimal representation of the MD5 hash
    md5_hash = hash_object.hexdigest()

    return md5_hash

And we have the following users table:

users = []

users.append(("Alice", h("password123")))
users.append(("Bob", h("123456")))
users.append(("Charlie", h("qwerty")))
users.append(("Diana", h("secret")))
users.append(("Eve", h("aman")))
users.append(("Frank", h("iloveyou")))

users
[('Alice', '482c811da5d5b4bc6d497ffa98491e38'),
 ('Bob', 'e10adc3949ba59abbe56e057f20f883e'),
 ('Charlie', 'd8578edf8458ce06fbc5bb76a58c5ca4'),
 ('Diana', '5ebe2294ecd0e0f08eab7690d2a6ee69'),
 ('Eve', 'ccda1683d8c97f8f2dff2ea7d649b42c'),
 ('Frank', 'f25a2fc72690b780b2a14e140ef6a9e0')]

Can we crack the password of Eve?

Using mere brute force, we need to try all possible passwords. Let’s try

def crack(hashed_password) -> str:
    # it surely can be made recursive, but for simplicity we will use 4 nested loops
    for c1 in range(26):
        for c2 in range(26):
            for c3 in range(26):
                for c4 in range(26):
                    guess = chr(ord('a') + c1) + chr(ord('a') + c2) + chr(ord('a') + c3) + chr(ord('a') + c4)
                    hashed_guess = h(guess)
                    # print(guess, hashed_guess)

                    if hashed_password == hashed_guess:
                        return guess
    return None

hashed_password = users[4][1]
crack(hashed_password)
'aman'

Wow! We found the password of Eve!

The complexity is $O(26^4), which is not that big. That’s why we are encouraged to use a long password and use alphanumeric characters.

number_of_characters = [4, 8, 10]
character_sets_size = [26, 52, 62]

for i in range(len(number_of_characters)):
    print("Number of characters:", number_of_characters[i])
    print("Character set size:", character_sets_size[i])
    print("Number of possible passwords:", character_sets_size[i] ** number_of_characters[i])
    print("")
Number of characters: 4
Character set size: 26
Number of possible passwords: 456976

Number of characters: 8
Character set size: 52
Number of possible passwords: 53459728531456

Number of characters: 10
Character set size: 62
Number of possible passwords: 839299365868340224

Ok, bruteforce attack seems infeasible, but what if we just try common/leaked passwords? People reuse password afterall.

known_password = [
    "password", "secure", "secret", "qwerty", "123456", "12345678", "12345", "iloveyou", "princess", "1234567",
]

We can pre-calculate the hash of common passwords and store it in a table, this is called rainbow table.

for password in known_password:
    print(password, h(password))
password 5f4dcc3b5aa765d61d8327deb882cf99
secure 1c0b76fce779f78f51be339c49445c49
secret 5ebe2294ecd0e0f08eab7690d2a6ee69
qwerty d8578edf8458ce06fbc5bb76a58c5ca4
123456 e10adc3949ba59abbe56e057f20f883e
12345678 25d55ad283aa400af464c76d713c07ad
12345 827ccb0eea8a706c4c34a16891f84e7b
iloveyou f25a2fc72690b780b2a14e140ef6a9e0
princess 8afa847f50a716e64932d995c8e7435a
1234567 fcea920f7412b5da7be0cf42b8c93759

Given the following users table, can you crack the password of Diana?

users
[('Alice', '482c811da5d5b4bc6d497ffa98491e38'),
 ('Bob', 'e10adc3949ba59abbe56e057f20f883e'),
 ('Charlie', 'd8578edf8458ce06fbc5bb76a58c5ca4'),
 ('Diana', '5ebe2294ecd0e0f08eab7690d2a6ee69'),
 ('Eve', 'ccda1683d8c97f8f2dff2ea7d649b42c'),
 ('Frank', 'f25a2fc72690b780b2a14e140ef6a9e0')]

Yes, just look up the rainbow table!

Another problem with simply hashing password is that, if two users have the same password, they will have the same hash

users.append(("Grace", h("qwerty")))
users
[('Alice', '482c811da5d5b4bc6d497ffa98491e38'),
 ('Bob', 'e10adc3949ba59abbe56e057f20f883e'),
 ('Charlie', 'd8578edf8458ce06fbc5bb76a58c5ca4'),
 ('Diana', '5ebe2294ecd0e0f08eab7690d2a6ee69'),
 ('Eve', 'ccda1683d8c97f8f2dff2ea7d649b42c'),
 ('Frank', 'f25a2fc72690b780b2a14e140ef6a9e0'),
 ('Grace', 'd8578edf8458ce06fbc5bb76a58c5ca4')]

Ooho, “Charlie” and “Grace” have the same password!

Salt

The solution is to add a random string to the password before hashing it. This random string is called salt.

So, rather than just hash(password), we do hash(salt + password)

salt = "supersecretsalt"
users_with_salt = []

users_with_salt.append(("Alice", h("password123" + salt)))
users_with_salt.append(("Bob", h("123456" + salt)))
users_with_salt.append(("Charlie", h("qwerty" + salt)))
users_with_salt.append(("Diana", h("secret" + salt)))
users_with_salt.append(("Eve", h("aman" + salt)))
users_with_salt.append(("Frank", h("iloveyou" + salt)))
users_with_salt.append(("Grace", h("qwerty" + salt)))
users_with_salt
[('Alice', 'bef36a6b8abe65a1ca0017fb29ce389a'),
 ('Bob', '792a39b2d91e301c841b0b4e6dc11d06'),
 ('Charlie', 'db899b76a6609a8aac3dbd60a1962236'),
 ('Diana', '41ade3cb6140859235528f6b1c507850'),
 ('Eve', 'b42d96585e5c0a13eac7438a813d32bf'),
 ('Frank', '760f87a08cbe8f2316d283b41f9caf99'),
 ('Grace', 'db899b76a6609a8aac3dbd60a1962236')]

Now, we can’t mount a rainbow table attack anymore, since we don’t know the salt.

But what if we know the salt? Can we still mount a rainbow table attack?

Yes!

Just pre-calculate the hash of common passwords with the salt.

Different salt for each user

We missed one thing, the salt is the same for all users. So, if two users have the same password, they will have the same hash. And it is also susceptible to rainbow table attack.

The solution is to use different salt for each user.

users_with_salt = []

salt = "supersecretsalt"
users_with_salt.append(("Alice", h("password123" + salt), salt))

salt = "2839sdfsjkdfkl"
users_with_salt.append(("Bob", h("123456" + salt), salt))

salt = "sdfjksdfjksdf"
users_with_salt.append(("Charlie", h("qwerty" + salt), salt))

salt = "iou23u4iousod"
users_with_salt.append(("Diana", h("secret" + salt), salt))

salt = "23894798sdjkfskl"
users_with_salt.append(("Eve", h("aman" + salt), salt))

salt = "89273klscdnkljkls"
users_with_salt.append(("Frank", h("iloveyou" + salt), salt))

salt = "i2349iusdjfksf"
users_with_salt.append(("Grace", h("qwerty" + salt), salt))

users_with_salt
[('Alice', 'bef36a6b8abe65a1ca0017fb29ce389a', 'supersecretsalt'),
 ('Bob', '155727303933b14ead7a48751f073166', '2839sdfsjkdfkl'),
 ('Charlie', '753d4d9815f30a4ed36b89ed1876565c', 'sdfjksdfjksdf'),
 ('Diana', '1dff7cfa2c56261f3e52e1120b2b564b', 'iou23u4iousod'),
 ('Eve', '6107eef0d46f7b3db6ae22e3ca45f4db', '23894798sdjkfskl'),
 ('Frank', '38a2d0871c9e925e7f6f03408fe665df', '89273klscdnkljkls'),
 ('Grace', 'f751c0d1057e2bbb30174091d69cd9f7', 'i2349iusdjfksf')]

Rainbow table becomes not effective anymore, since we need to pre-calculate the hash of common passwords with each salt.

Bonus: Grace and Bob now have different hash!

BCrypt

BCrypt is a password hashing function that is designed to be slow and expensive to compute. It is designed to be slow to prevent brute force attack.

It’s used by Devise, a popular authentication library for Ruby on Rails.

!pip install bcrypt

Use Cases

Hashing Password

import bcrypt

# Generate a salt
salt = bcrypt.gensalt()

# Hash a password for the first time
password = b"super secret password"
hashed = bcrypt.hashpw(password, salt)

# Print the hashed password
print(hashed)
b'$2b$12$9wUZrRjuimKbqdz.uAR0UuDl5pCDLvbvqyZNXxEPoeqnGiM837HNG'

The hashed contains the salt and the hash of the password. This hashed value is stored in the database.

username hashed
alice $2b$12$9wUZrRjuimKbqdz.uAR0UuDl5pCDLvbvqyZNXxEPoeqnGiM837HNG

Verifying Password

# retrieved from user input (in login form)
password = b"super secret password"

# retrieved from database
hashed = b'$2b$12$9wUZrRjuimKbqdz.uAR0UuDl5pCDLvbvqyZNXxEPoeqnGiM837HNG'

if bcrypt.checkpw(password, hashed):
    print("Login successful.")
else:
    print("Incorrect password.")
Login successful.