An associative array is an abstract data type used to store key-value pairs. Different languages have different data types for it:
- In Python — Dictionary
- In Ruby — Hash
- In Lua — Table
- In JavaScript — Object
- In Elixir and Java — Map
Associative arrays are popular in applied programming. They are convenient for representing compound data that contain many different parameters.
In a usual indexed array, we arrange values by indexes, which means that it can be put into memory as is. Things work differently with associative arrays. They do not have indexes to determine the order — so there's no easy way to get to the values. Associative arrays often use a hash table — a specific data structure to implement them.
A hash table allows you to organize the data of an associative array for convenient storage. A hash table uses an indexed array and a function to hash the keys. That said, a hash table isn't just a way to place data in memory; it involves logic.
In this lesson, we will discuss hash tables and learn how associative arrays work internally. You'll understand how many different processes go on in a program when we run this kind of simple code:
data = {}
data['key'] = 'value'
How do hash functions work
Any operation inside a hash table starts with converting the key somehow into an index of a regular array. To obtain an index from a key, you need to perform two actions:
- Find the hash, that is, hash the key
- Convert a key to an index — for instance, through the remainder of a division
Hashing is an operation that converts any input data into a string or a number of a fixed length. The function implementing the conversion algorithm is called a hash function. The result of hashing is called a hash or hash sum.
The most famous are CRC32, MD5, SHA, and many other types of hashing:
# Python has the zlib library, which contains the crc32 hashing algorithm
# This algorithm is convenient for clarity
import zlib
# Any data we want to hash is represented as a byte string
data = b'Hello, world!'
hash = zlib.crc32(data)
# hash is always the same for the same data
print(hash) # => 3957769958
We often encounter hashing in development. For example, the commit id in git 0481e0692e2501192d67d7da506c6e70ba41e913
is a hash resulting from hashing commit data.
When hashing, you first need to get a hash. It can then be converted into an array index — for example, to calculate the remainder of a division: