Untitled

As for the first question, it's mainly an issue of price and of context window size. When converted to plaintext and trimmed down, the full documentation for Ithkuil is about 600k tokens. You pay for 600k input tokens for every question you ask then (since this is all done over the API, not through a web interface with a fixed monthly price). The top performer, Opus 4, has a context window of only 200k tokens, meaning you can't really even ask Opus 4 questions in this way. If we go to a model with a bigger context window, like o3, we can calculate the price. Each question uses 600k input tokens, and the price per million input tokens for o3 is $1 (normally $2, but becomes $1 due to input caching), so each question costs $0.60. There are 301 questions, so running the benchmark once would cost a bare minimum of $180.60, just for this one model, before the model even starts to answer.

As for web searches, all the models I tested *can* be given this ability, but on API, giving them the ability to search the internet is explicitly opt-in, and costs extra money for each time they do it. I didn't enable this option for any of the models tested.

Even if I had been able to feasibly feed in the full language documentation for every question, I would've chosen not to. This would basically reduce the problem to the "needle in a haystack" problem, something which is already pretty well researched in this field as something language models are capable of doing. https://arxiv.org/abs/2406.11230 (old paper, tl;dr is that the tested models performed quite strongly, and modern models perform even better.) This would essentially trivialize the test.

The benchmark was created by having o3-high do two passes over each individual section of the documentation. In the first pass, it's asked to generate questions, listing 1 correct and 3 incorrect answers. In the second pass, it's shown it's previously written questions without being told it was the one who wrote them, along with the same section of documentation, and asked to verify that the questions make sense. While this method of benchmark generation isn't perfect and *can* still lead to hallucinated questions and answers being in the dataset, the scaling of the performance of the models across this benchmark is in line with scaling in other hard benchmarks. This leads me to believe that the questions are at least *mostly* valid, which, frankly, is probably a better result than I would've been able to achieve writing the questions myself. That all being said, the questions and answers are all publicly available and each one lists the section of the docs it was written from, so you're welcome to look them over and let me know if any are totally wrong.