Untitled


Hello Benoit,

Thank you very much for your input! I really like your suggestion about the more efficient approach and I will insert it in the code :D
I 'm finishing now a final run on the webspam dataset and will send a link by tonight with all the results combined, somehow!

To be honest I do have a question, I got a bit confused about the hashing process and I would like to clarify a few things.

In ngrams/tokens we had the following pseudo-algorithm :

for f in tokens {
    h_idx = Hash( f, seed) % target_dim;
    vec[h_idx]++;
}

Now,
in numerical features are we supposed to get the following? ( because I was still following the above procedure)

for ( i=0; i<data.size(); i++) {
    h_idx = Hash( i, seed:i ) % target_dim;
    vec[h_idx] += data[i];    // instead of +1
}

And so, would that result in quadratics of the following form?

for ( i=0; i<data.size(); i++) {
    for ( j = i; j < data.size(); j++ )  {
        h_idx = Hash( i * data.size() + j, seed: I*data.size()+j) % target_dim;
        vec[h_idx] += data[i] * data[j];
    }
}

I hope it's not too confusing.
Looking forward to your reply.

Vangelis