Guest User

Untitled

a guest
Nov 22nd, 2017
91
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.29 KB | None | 0 0
  1. # feature hashing in vw
  2.  
  3. ## base hash function
  4.  
  5. Murmur32 hash is implemented in uniform_hash in hash.cc.
  6. It takes a string and a seed and return uint64_t.
  7.  
  8. There are two hash modes specified by --hash option.
  9.  
  10. Strings mode (hashstring in parse_primitives.cc)
  11. use parsed uint64_t value plus seed for integer string(matched by [0-9]+) and murmur32 for other type.
  12. ```
  13. uint64_t hash(string s, uint64_t seed) =
  14. if s match [0-9]+
  15. return uint64_t(s)+seed
  16. else:
  17. return murmur32(s, seed)
  18. ```
  19.  
  20. All mode (hashall in parse_primitives.cc) use murmur32.
  21.  
  22. ```
  23. hash(str, seed) = murmur32(str, seed)
  24. ```
  25.  
  26. ## hash for constant term
  27.  
  28. ```C++
  29. const uint64_t constant = 11650396; // defined in constant.h
  30. uint64_t constant_hash = 11650396 % (1<<b);
  31. ```
  32.  
  33. ## index for feature without namesapce
  34. ```C++
  35. uint64_t feature_hash = hash(feature_name, 0)
  36. uint64_t feature_index = feature_hash % (1<<b)
  37. ```
  38. where 0 is seed, b is num of bits used.
  39.  
  40. ## hash for feature with name space
  41.  
  42. ```C++
  43. uint64_t namespace_hash = hash(namespace, 0)
  44. uint64_t feature_hash = hash(feature_name, namespace_hash)
  45. uint64_t feature_index = feature_hash % (1<<b)
  46. ```
  47. ## hash for quadratic feature
  48.  
  49. from interactions.h
  50.  
  51. given hashes(indices) h1, h2 of two features
  52. ```C++
  53. const uint32_t FNV_prime = 16777619; //defined in constant.h
  54. uint64_t h = (h1*FNV_prime) ^ h2 // ^ is xor
  55. uint64_t i = h % (1<<b)
  56. ```
  57.  
  58. Whether h1, h2 are feature hashes(before mode) or indices (after mod) will give the same result.
  59.  
  60. ## example
  61.  
  62. sample:
  63. ```
  64. 1 | 123456789 has_car |gender 男 |interest 篮球 |age 42
  65. ```
  66.  
  67. index calculation(strings mode, b=18):
  68. ```
  69. constant -> 11650396 % 1<<18 = 116060
  70.  
  71. 123456789 -> 123456789 % 1<<18 = 249109
  72.  
  73. hashstring(hash_car, 0) = 2086702313
  74. has_car -> 2086702313 % (1<<18) = 36073
  75.  
  76. hashstring(gender, 0) = 2678092942
  77. hashstring(男, 2678092942) = 430992541
  78. gender^男 -> 430992541 % 1<<18 = 27805
  79.  
  80. hashstring(interest, 0) = 570245443
  81. hashstring(篮球, 570245443) = 2287792271
  82. interest^篮球 -> 2287792271 % 1<<18 = 61583
  83.  
  84. hashstring(age, 0) -> 717653329
  85. hashstring(42, 717653329) -> 717653329+42 = 717653371
  86. age^42 -> 717653371 % 1<<18 = 165243
  87.  
  88. gender^男*interest^篮球 -> ((27805 * 16777619) ^ 61583) % 1<<18 = 134056
  89. gender^男*gender^男 -> ((27805*16777619)^27805) % 1<<18 = 169914
  90. interest^篮球*interest^篮球 -> ((61583*16777619)^61583) % 1<<18 = 147858
  91. ```
Add Comment
Please, Sign In to add comment