Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- # feature hashing in vw
- ## base hash function
- Murmur32 hash is implemented in uniform_hash in hash.cc.
- It takes a string and a seed and return uint64_t.
- There are two hash modes specified by --hash option.
- Strings mode (hashstring in parse_primitives.cc)
- use parsed uint64_t value plus seed for integer string(matched by [0-9]+) and murmur32 for other type.
- ```
- uint64_t hash(string s, uint64_t seed) =
- if s match [0-9]+
- return uint64_t(s)+seed
- else:
- return murmur32(s, seed)
- ```
- All mode (hashall in parse_primitives.cc) use murmur32.
- ```
- hash(str, seed) = murmur32(str, seed)
- ```
- ## hash for constant term
- ```C++
- const uint64_t constant = 11650396; // defined in constant.h
- uint64_t constant_hash = 11650396 % (1<<b);
- ```
- ## index for feature without namesapce
- ```C++
- uint64_t feature_hash = hash(feature_name, 0)
- uint64_t feature_index = feature_hash % (1<<b)
- ```
- where 0 is seed, b is num of bits used.
- ## hash for feature with name space
- ```C++
- uint64_t namespace_hash = hash(namespace, 0)
- uint64_t feature_hash = hash(feature_name, namespace_hash)
- uint64_t feature_index = feature_hash % (1<<b)
- ```
- ## hash for quadratic feature
- from interactions.h
- given hashes(indices) h1, h2 of two features
- ```C++
- const uint32_t FNV_prime = 16777619; //defined in constant.h
- uint64_t h = (h1*FNV_prime) ^ h2 // ^ is xor
- uint64_t i = h % (1<<b)
- ```
- Whether h1, h2 are feature hashes(before mode) or indices (after mod) will give the same result.
- ## example
- sample:
- ```
- 1 | 123456789 has_car |gender 男 |interest 篮球 |age 42
- ```
- index calculation(strings mode, b=18):
- ```
- constant -> 11650396 % 1<<18 = 116060
- 123456789 -> 123456789 % 1<<18 = 249109
- hashstring(hash_car, 0) = 2086702313
- has_car -> 2086702313 % (1<<18) = 36073
- hashstring(gender, 0) = 2678092942
- hashstring(男, 2678092942) = 430992541
- gender^男 -> 430992541 % 1<<18 = 27805
- hashstring(interest, 0) = 570245443
- hashstring(篮球, 570245443) = 2287792271
- interest^篮球 -> 2287792271 % 1<<18 = 61583
- hashstring(age, 0) -> 717653329
- hashstring(42, 717653329) -> 717653329+42 = 717653371
- age^42 -> 717653371 % 1<<18 = 165243
- gender^男*interest^篮球 -> ((27805 * 16777619) ^ 61583) % 1<<18 = 134056
- gender^男*gender^男 -> ((27805*16777619)^27805) % 1<<18 = 169914
- interest^篮球*interest^篮球 -> ((61583*16777619)^61583) % 1<<18 = 147858
- ```
Add Comment
Please, Sign In to add comment