Untitled

Heres some research into a new LLM architecture I recently built and have had actual success with.

The idea is simple, you do the standard thing of generating random vectors for your dictionary of tokens, we'll call these numbers your 'weights'. Then, for whatever sentence you want to use as input, you generate a context embedding by looking up those tokens, and putting them into a list.

Next, you do the same for the output you want to map to, lets call it the decoder embedding.

You then loop, and generate a 'noise embedding', for each vector or individual token in the context embedding, you then subtract that token's noise value from that token's embedding value or specific weight.

You find the weight index in the weight dictionary (one entry per word or token in your token dictionary) thats closest to this embedding. You use a version of cuckoo hashing where similar values are stored near each other, and the canonical weight values are actually the key of each key:value pair in your token dictionary. When doing this you align all random numbered keys in the dictionary (a uniform sample from 0 to 1), and look at hamming distance between the context embedding+noise embedding (called the encoder embedding) versus the canonical keys, with each digit from left to right being penalized by some factor f (because numbers further left are larger magnitudes), and then penalize or reward based on the numeric closeness of any given individual digit of the encoder embedding at the same index of any given weight i.

You then substitute the canonical weight in place of this encoder embedding, look up that weights index in my earliest version, and then use that index to lookup the word|token in the token dictionary and compare it to the word at the current index of the training output to match against.

Of course by switching to the hash version the lookup is significantly faster, but I digress.

That introduces a problem.
If each input token matches one output token how do we get variable length outputs, how do we do n-to-m mappings of input and output?

One of the things I explored was using pseudo-markovian processes, where theres one node, A, with two links to itself, B, and C.
B is a transition matrix, and A holds its own state. At any given timestep, A may use either the default transition matrix (training data encoder embeddings) with B, or it may generate new ones, using C and a context window of A's prior states.

C can be used to modify A, or it can be used to as a noise embedding to modify B.

A can take on the state of both A and C or A and B. In fact we do both, and measure which is closest to the correct output during training.

What this *doesn't* do is give us variable length encodings or decodings.

So I thought a while and said, if we're using noise embeddings, why can't we use multiple?

And if we're doing multiple, what if we used a middle layer, lets call it the 'key', and took its mean
over *many* training examples, and used it to map from the variance of an input (query) to the variance and mean of
a training or inference output (value).

But how does that tell us when to stop or continue generating tokens for the output?

Well, the next thing I asked was what if we used the middle layer, trained on many input|output pairs,
to determine, given the variance of the input, and the mean of the middle layer, the *intended* mean and
variance of the output?

We could then generate many output stubs, and gauge which was closest, and then in turn feed in each token
in the output, using anything from minmax, to alpha-beta pruning, and type A, to plausibility orderings, and
discard the worst, each in turn. The output will naturally end or continue until it is within the mean, variance,
and std of the input + middle layer (or query and key).

That could work.

But this is a lot of iteration, and a lot of memory, and a lot of function calls, wouldn't it be slower.

But then I went one step further and I asked, what if we took the mean and variance of sentences with some tokens
[w, x, y, z], *between* the tokens?

And not just between them, but for scattershot random combinations from this?

ou would do abs(w-x)/2, abs(x-y)/2, abs(y-z)/2, lets call this function simply mean()

but you could do 'scattershot' embeddings like [w, y, z] where you get the mean, std, variance, etc, as well.

a 'full embedding' would be

mean(w, x), mean(w, y), mean(w, z), mean(x, y), mean(x, z), mean(y, z)


but you could also do hierarchical ones as well

mean(w, x)=f0, mean(x, y)=f1, mean(y, z)=f2  -> L0

and then

mean(f0, f1)=g0, mean(f1, f2)=g1 -> L1

and then

mean(g0, g1) -> L2

and so forth, up to Ln.

From there you could simply save the values of whatever layer Lm, which
could reproduce the correct output the best.

"uncompressing" this is another matter.

But hypothetically, by training like this, you can compare one sentence
entered in a prompt, to past training sentences.

What we're doing is looking for the most compressible sequence possible,
that is closest (precision, accuracy) to the distribution (norm, mean, std, variance, manhattan distance, etc)
of the output.

Scattershot embeddings could use this x->y->z format as well, or have their own scattershot subembeddings
that they search for and test, depending on what hyperparameters such as cutoffs or resource limits you
want to set (or even determine it dynamically)

Hypothetically we could use this potentially to reconstruct the transition matrix of a markov
model, or a modified semi-markovian model like on my whiteboard.
Only the transitions in this case aren't to what node should be selected, but to what noise embedding to use
when we loop back to the single node in our model.