Guest User

Untitled

a guest
Dec 2nd, 2024
1,080
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 18.20 KB | None | 0 0
  1. /*
  2.  
  3. A simple hash table.
  4.  
  5. Notes Last Updated: 11 May 2022
  6.  
  7. # Overview
  8.  
  9. This hash table stores all entries in a contiguous array, for good performance when
  10. looking things up. (Some tables work by storing linked lists of entires, but this can
  11. lead to more cache misses.)
  12.  
  13. When storing a value, we map its hash to a slot index; if that slot is free, we
  14. put the key and value there, otherwise we just keep incrementing the slot index
  15. by 1 until we find an empty slot. Because the table can never be full, we are
  16. guaranteed to find a slot eventually.
  17.  
  18. When looking up a value, we perform this same process to find the correct slot.
  19.  
  20. We use hash values to indicate whether slots are empty or removed, because we can't
  21. really know that based on keys or values, given that keys or values can be of arbitrary
  22. type.
  23.  
  24. A hash of 0 means that slot is not used, so new values can be put there. A hash of 1
  25. means that slot was once valid, but has since been removed. A hash of 2 or higher
  26. (`FIRST_VALID_HASH`) means this is a currently live entry.
  27.  
  28. Whenever we hash a key, if the result is less than 2, we just add 2 to it
  29. to put it into the valid range. This means that four values out of our 32-bit hash range
  30. have a doubled probability of collision, but that is a small price to pay.
  31.  
  32. You can iterate over the table straightforwardly:
  33.  
  34. for table {
  35. key, value := it_index, it;
  36. }
  37.  
  38. */
  39.  
  40.  
  41. /*
  42.  
  43. By way of explaining why it's all good and okay for this hash table to be so simple,
  44. we'll quote Charles Bloom, who said in 2017:
  45.  
  46. -----
  47. The last time I looked at hashing, my main conclusions were :
  48.  
  49. 1. A properly loaded (eg. not too full (eg. less than 70% or so)) hash
  50. table satisfies almost all queries in the first slot.
  51.  
  52. (I think any statistics or testing of open-address hashing at more than 70%
  53. full is not helpful.)
  54.  
  55. 2. Because of #1, what you do after the first probe is almost
  56. irrelevant. The only exception to this is if you have some bad degeneracy
  57. in your hash function or data, in which case anything you can do to break
  58. that degeneracy is a useful difference. The simplest (and therefore best)
  59. seems to be quadratic probing.
  60.  
  61. 3. Because of #1, storing your data compactly, like {Hash,Key,Data} is
  62. best, because you always take exactly one cache miss per query. There are
  63. some advantages to doing Hash-Hash-Hash, Key-Key-Key, but doing that
  64. naively means a much higher constant # of cache misses. (I've seen a lot
  65. of hash table implementations that use multiple arrays;
  66. eg. khash and google's, and that's just totally wrong.)
  67.  
  68. 4. Because of #1, any prefetching was a lose. After you compute the hash,
  69. you need that first line immediately, so there's no time for a prefetch.
  70. And then you need the second line so rarely that a prefetch doesn't help.
  71.  
  72. 5. Because of #1, Cuckoo hashing is strictly worse, since it needs 2 cache
  73. lines instead of 1.
  74.  
  75. 6. Because of #1, the hash function is at least as important as the lookup
  76. method. The hash function probably takes more clocks.
  77.  
  78.  
  79. The only time to ever look at hash table speed is if you are memory
  80. constrained. If you're not, then hey just make it bigger and it also gets
  81. faster. So given that you are memory constrained, you always have to trade
  82. off against things like making the key/data smaller or omitting one or two
  83. of hash/key/data. eg. if hash/key/data can be 8 bytes or 12 bytes instead
  84. of 16, you can run the table much less full in the same memory.
  85.  
  86. */
  87.  
  88.  
  89. /*
  90. This hash table started life as some code in C by Sean Barrett, but it's
  91. been changed so much that it's probably no longer recognizable.
  92.  
  93. For some evidence that this table is reasonably fast, even for large table sizes,
  94. see jai/examples/hash_table_test.jai. We maybe should put that test here,
  95. but I didn't want to bloat the file. There is also some additional
  96. documentation there about how we handle collisions in this table,
  97. and why.
  98. */
  99.  
  100. #module_parameters (COUNT_COLLISIONS := false)();
  101.  
  102. // A hash table that maps keys to values using 32 bit hashes.
  103. Table :: struct (Key_Type: Type, Value_Type: Type,
  104. given_hash_function: (Key_Type) -> u32 = null,
  105. given_compare_function: (Key_Type, Key_Type) -> bool = null,
  106. LOAD_FACTOR_PERCENT: u32 = 70,
  107. REFILL_REMOVED := true
  108. ) {
  109. // If you leave compare_function null, the table will use
  110. // operator ==.
  111.  
  112. count: int; // The number of valid items in the table.
  113.  
  114. allocated: int; // The number of slots for which we have allocated memory.
  115. slots_filled: int; // The number of slots that can't be used for new items (either currently valid items or items that were removed).
  116.  
  117. allocator: Allocator;
  118.  
  119. Entry :: struct {
  120. hash: u32;
  121. key: Key_Type;
  122. value: Value_Type;
  123. }
  124.  
  125. entries: [] Entry;
  126.  
  127. // We either use the hash function that was passed to us, or,
  128. // if it was null, we use a default one.
  129. #if given_hash_function {
  130. hash_function :: given_hash_function;
  131. } else {
  132. hash_function :: x => Hash.get_hash(x);
  133. }
  134.  
  135. // Same situation with the compare function.
  136. #if given_compare_function {
  137. compare_function :: given_compare_function;
  138. } else {
  139. compare_function :: (a: Key_Type, b: Key_Type) -> bool { return a == b; };
  140. }
  141.  
  142. #if COUNT_COLLISIONS {
  143. add\_collisions: s64;
  144. find_collisions: s64;
  145. }
  146.  
  147. SIZE_MIN :: 32;
  148. }
  149.  
  150. for_expansion :: (table: *$T/Table, body: Code, flags: For_Flags) #expand {
  151. #assert(!(flags & .REVERSE)); // We don't handle the reverse flag.
  152.  
  153. for * entry, i: table.entries {
  154. if entry.hash < FIRST_VALID_HASH continue;
  155.  
  156. #if flags & .POINTER {
  157. `it := *entry.value;
  158. } else {
  159. `it := entry.value;
  160. }
  161.  
  162. `it_index := entry.key;
  163.  
  164. #insert (remove={entry.hash=REMOVED_HASH; table.count-=1;}) body;
  165. }
  166. }
  167.  
  168.  
  169. // Initialize a `Table` and allocates the given number of slots.
  170. // You don't have to call init; if you don't, you'll get
  171. // an initial table of size `SIZE_MIN`.
  172.  
  173. init :: (using table: *Table, slots_to_allocate: s64 = 0) {
  174. Basic.remember_allocators(table);
  175.  
  176. resize(table, slots_to_allocate);
  177. }
  178.  
  179. resize :: (using table: *Table, slots_to_allocate: s64 = 0) {
  180. if slots_to_allocate == 0 slots_to_allocate = SIZE_MIN;
  181. n := next_power_of_two(slots_to_allocate);
  182. table.allocated = n;
  183.  
  184. // Right now we align to 8 bytes. Should we handle arbitrary alignment of both
  185. // keys and values? Maybe, but that will complicate the table,
  186. // so let's not worry about it now unless we need it.
  187.  
  188. // We used to create entries[] uninitialized for performance reasons,
  189. // but this led to problems when people would for example try to assign
  190. // hash tables containing strings to global data; the compiler would
  191. // try to copy the uninitialized strings and crash (because it has no way
  192. // to know they are uninitialized). For now we will go back to zero-initializing
  193. // by putting 'true' here, but I don't love this. -jblow, 24 November 2024.
  194. entries = Basic.NewArray(n, Entry, true,, table.allocator);
  195. for * entries it.hash = 0;
  196. }
  197.  
  198. // Frees the memory allocated by the table.
  199. deinit :: (table: *Table) {
  200. Basic.free(table.entries.data,, table.allocator);
  201. }
  202.  
  203. // This reset keeps the current number of allocated slots;
  204. // it just clears occupancy.
  205. // @@ IC: I like this behavior, but array_reset frees memory and that is confusing. See array_reset for more discussion.
  206. table_reset :: (using table: *Table) {
  207. count = 0;
  208. slots_filled = 0;
  209. for * entries it.hash = 0;
  210. }
  211.  
  212. //
  213. // Walk_Table is a macro that factors out the details of traversing the hash table,
  214. // so that they can be re-used by multiple routines without having to keep these details
  215. // in sync. We expect the calling site to define:
  216. //
  217. // table: Table;
  218. // key: table.Key_Type;
  219. //
  220. // We write back into the caller:
  221. //
  222. // hash: u32;
  223. // index: u32;
  224. // table_while_loop: iterator variable you can use to 'continue', etc.
  225. // Walk_Table is intended for our own internal use. User-level traversal should
  226. // probably use the for_expansion.
  227. //
  228. Walk_Table :: (code: Code) #expand {
  229. mask := cast,trunc(u32)(`table.allocated - 1);
  230.  
  231. `hash := `table.hash_function(`key);
  232. if hash < FIRST_VALID_HASH hash += FIRST_VALID_HASH;
  233.  
  234. `index := hash & mask;
  235.  
  236. // We use the 'triangular numbers' version of quadratic probing.
  237. // It yields more collisions per insertion than double hashing
  238. // (double hashing gives us 0.74, triangular numbers give us 0.84)
  239. // but the collisions are cheaper because the initial probes
  240. // are cache-coherent, unlike with double hashing. At large table sizes
  241. // this ends up mattering substantially. (examples/hash_table_test.jai,
  242. // at the 128 million entry table size, takes 15.5 seconds with triangular
  243. // numbers, 17.8 seconds with double hashing.)
  244.  
  245. // Triangular numbers are guaranteed to hit every entry in the table:
  246. // https://fgiesen.wordpress.com/2015/02/22/triangular-numbers-mod-2n/
  247.  
  248. probe_increment : u32 = 1;
  249.  
  250. // If table.entries[index] is zero, that slot is unoccupied, so we are done walking.
  251. // If it's nonzero, we do whatever the caller wants to do (for example, check whether
  252. // it is the entry we are looking for).
  253. while `table_while_loop := `table.entries[index].hash {
  254. #insert code; // Do whatever the caller wants to do for this slot index.
  255.  
  256. index = (index + probe_increment) & mask; // Since table.allocated is always power of two, the & wraps our index within the table. (Requires that probe_increment is not big enough to wrap index into negative numbers, which it won't be because it's u32 and index is s64.);
  257. probe_increment += 1;
  258. }
  259. }
  260.  
  261.  
  262. // table_ensure_space
  263. //
  264. // Call this if you want to be able to add 'items' number of things to the table without the table
  265. // resizing in the middle. Useful in cases where you want to add multiple items to a table inside
  266. // one block of code, while keeping pointers to those items, and wanting those pointers to remain valid.
  267. table_ensure_space :: (table: *Table, items: s64) {
  268. if (table.slots_filled + items)*100 >= table.allocated*table.LOAD_FACTOR_PERCENT expand(table);
  269. }
  270.  
  271.  
  272. // table_add
  273. //
  274. // Adds the given key value pair to the table, returns a pointer to the inserted value.
  275. // If you add a key twice, the table will not currently notice that this has happened,
  276. // so you'll just get the first one. It's unclear whether we should assert on this;
  277. // it sort of depends on how slow you think key comparison might be. Maybe that should
  278. // be a compile-time parameter to the hash table? -jblow, 8 September 2020
  279. table_add :: (table: *Table, key: table.Key_Type, value: table.Value_Type) -> *table.Value_Type {
  280. // The + 1 is here to handle the weird case when the table size is 1 and you add the first item...
  281. // If we just do table_count * 2 >= table.allocated, we would fill the table, causing an infinite loop on find.
  282.  
  283. #assert table.LOAD_FACTOR_PERCENT < 100; // A 100% full table will infinite loop (and you will want to be substantially smaller than this for reasonable performance).
  284.  
  285. // Without dividing, we want to test:
  286. // (filled / allocated >= 70/100)
  287. // Therefore, we say
  288. // (filled * 100 >= allocated * 70)
  289.  
  290. if (table.slots_filled + 1)*100 >= table.allocated*table.LOAD_FACTOR_PERCENT expand(table);
  291.  
  292. Basic.assert(table.slots_filled < table.allocated);
  293.  
  294. Walk_Table(#code {
  295. #if table.REFILL_REMOVED {
  296. if table.entries[index].hash == REMOVED_HASH {
  297. table.slots_filled -= 1; // 1 will get re-added below, for total increment 0.
  298. break;
  299. }
  300. }
  301.  
  302. #if COUNT_COLLISIONS table.add_collisions += 1;
  303. });
  304.  
  305. // Walk_Table walked us to an unused entry, so, add our new data into this slot.
  306.  
  307. table.count += 1;
  308. table.slots_filled += 1;
  309.  
  310. entry := *table.entries[index];
  311. entry.hash = hash;
  312. entry.key = key;
  313. entry.value = value;
  314.  
  315. return *entry.value;
  316. }
  317.  
  318. // table_set
  319. //
  320. // Adds or replaces the given key value pair.
  321. table_set :: (table: *Table, key: table.Key_Type, value: table.Value_Type) -> *table.Value_Type {
  322. value_ptr := table_find_pointer(table, key);
  323. if value_ptr {
  324. value_ptr.* = value;
  325. return value_ptr;
  326. } else {
  327. return table_add(table, key, value);
  328. }
  329. }
  330.  
  331. // table_contains
  332. //
  333. // Returns whether a table contains key. Useful when table is being used as a set.
  334. table_contains :: (table: *Table, key: table.Key_Type) -> bool {
  335. return table_find_pointer(table, key) != null;
  336. }
  337.  
  338.  
  339. // Lookup the given key and return a pointer to the corresponding value.
  340. // If multiple values are added with the same key, we return the first match.
  341. // If you think you might have key collisions, use table_find_multiple.
  342.  
  343. // We now take 'table' by pointer so that if COUNT_COLLISIONS, we can modify
  344. // the collision count, which is stored on the table. If we stored it elsewhere
  345. // this would no longer be necessary (but might be slower?) - 8 July 2022
  346. table_find_pointer :: (table: *Table, key: table.Key_Type) -> *table.Value_Type {
  347. if !table.allocated return null;
  348.  
  349. Walk_Table(#code {
  350. entry := *table.entries[index];
  351. if entry.hash == hash {
  352. if inline table.compare_function(entry.key, key) return *entry.value;
  353. }
  354.  
  355. #if COUNT_COLLISIONS table.find_collisions += 1;
  356. });
  357.  
  358. return null;
  359. }
  360.  
  361. // Lookup the given key and return the corresponding value.
  362. //
  363. // You need to pay attention to 'success' because if it's false,
  364. // the return value will be uninitialized. Thus the `#must`.
  365. //
  366. // See the note on table_find_pointer about collisions and about
  367. // why 'table' is being passed by pointer.
  368. table_find :: (table: *Table, key: table.Key_Type) -> (table.Value_Type, success: bool #must) {
  369. pointer := inline table_find_pointer(table, key);
  370. if pointer return pointer.*, true;
  371.  
  372. dummy: table.Value_Type = ---;
  373. return dummy, false;
  374. }
  375.  
  376. // Warning: table_find_multiple has not yet been tested.
  377. table_find_multiple :: (table: *Table, key: table.Key_Type) -> [] table.Value_Type /* Allocated with the temporary allocator! */ {
  378. if !table.allocated return .[];
  379.  
  380. results: [..] table.Value_Type;
  381. results.allocator = Basic.temporary_allocator;
  382.  
  383. Walk_Table(#code {
  384. entry := *table.entries[index];
  385. if entry.hash == hash {
  386. if inline table.compare_function(entry.key, key) {
  387. Basic.array_add(*results, entry.value);
  388. } else {
  389. #if COUNT_COLLISIONS table.find_collisions += 1;
  390. }
  391. } else {
  392. #if COUNT_COLLISIONS table.find_collisions += 1;
  393. }
  394. });
  395.  
  396. return results;
  397. }
  398.  
  399.  
  400. //
  401. // End of deprecated routines.
  402. //
  403.  
  404. // Remove the first entry at the given key. Returns false if the key was not found.
  405. table_remove :: (table: *Table, key: table.Key_Type) -> (success: bool, value: table.Value_Type) {
  406. if !table.allocated {
  407. dummy: table.Value_Type;
  408. return false, dummy; // Empty table, means not found!
  409. }
  410.  
  411. Walk_Table(#code {
  412. entry := *table.entries[index];
  413.  
  414. if (entry.hash == hash) && inline table.compare_function(entry.key, key) {
  415. entry.hash = REMOVED_HASH; // No valid entry will ever hash to REMOVED_HASH.
  416. table.count -= 1;
  417. return true, entry.value;
  418. }
  419. });
  420.  
  421. dummy: table.Value_Type;
  422. return false, dummy;
  423. }
  424.  
  425.  
  426. // find_or_add is kind of like table_set, but used when you
  427. // just want a pointer to the value, which you can fill in.
  428. find_or_add :: (table: *Table, key: table.Key_Type) -> (entry: *table.Value_Type, newly_added: bool) {
  429. value := table_find_pointer(table, key);
  430. if value return value, false;
  431.  
  432. new_value: table.Value_Type;
  433. value = table_add(table, key, new_value);
  434. return value, true;
  435. }
  436.  
  437.  
  438. // We expose these reserved hash values just in case you want
  439. // to write your own hash table iterator.
  440. NEVER_OCCUPIED_HASH :: 0;
  441. REMOVED_HASH :: 1;
  442. FIRST_VALID_HASH :: 2;
  443.  
  444. // A beta user requested being able to call expand(), so, here:
  445. expand :: (table: *Table) {
  446. old_entries := table.entries;
  447.  
  448. // If we were adding and removing lots of stuff from the table,
  449. // we might have lots of slots filled with REMOVED_HASH, so,
  450. // in that case, don't grow!
  451. new_allocated: s64 = ---;
  452.  
  453. if (table.count * 2 + 1) * 100 < table.allocated * table.LOAD_FACTOR_PERCENT { // The *2 is to say, if we double the size, are we still small enough to fit into the current memory? The reason we are doing this test is, if we removed a bunch of elements, maybe we are full of REMOVED_HASH markers, and if we just rehash to the same size, we can get rid of those. An alternate version (simpler?) might be to check table.count vs table.slots_filled. Note that this becomes less necessary if REFILL_REMOVED is enabled.
  454. // Just go with the current size, but clean out the removals.
  455. new_allocated = table.allocated;
  456. } else {
  457. // We need to go deeper!
  458. new_allocated = table.allocated * 2;
  459. }
  460.  
  461. if new_allocated < table.SIZE_MIN new_allocated = table.SIZE_MIN;
  462.  
  463. resize(table, new_allocated);
  464.  
  465. table.count = 0; // count and slots_filled will be incremented by table_add.
  466. table.slots_filled = 0;
  467.  
  468. for * entry, index: old_entries {
  469. // Note that if we removed some stuff, we will over-allocate the next table.
  470. // Maybe we should count the number of clobbers and subtract that. I dunno.
  471. if entry.hash >= FIRST_VALID_HASH table_add(table, entry.key, entry.value);
  472. }
  473.  
  474. Basic.free(old_entries.data,, table.allocator);
  475. }
  476.  
  477. next_power_of_two :: inline (x : int) -> int {
  478. // @Speed: This could certainly be faster.
  479. // There is probably something in Hacker's Delight for this.
  480. Basic.assert(x != 0);
  481. p := 1;
  482. while x > p p += p;
  483. return p;
  484. }
  485.  
  486. #scope_file
  487.  
  488. Hash :: #import "Hash";
  489. Basic :: #import "Basic";
  490.  
Add Comment
Please, Sign In to add comment