RC Week 8: Life happens, and databases are hard
Saturday, November 12, 2022
I'm two-thirds of the way done with my RC batch now. Eight weeks down, four weeks to go.
The last two weeks have been difficult for me because of life happening. Week 7 was hard because I had some travel to help my parents, and that just takes me out of my routine and is generally stressful. It was good, and I am very glad that I had the opportunity to help. But it was still a lot, and it made it hard to make significant progress on my project. And week 8 was hard because I finally caught the cold that's been going through my household. My brain was just... fuzzy this week.
In that vein, I spent the last two weeks mostly focused on figuring out how to build an inverted index over all the unique positions in a collection of chess games I have. To make it concrete, this is 3.8 million games and about 240 million unique positions (north of 300 million before dedup). I ran in circles, and I think it was a combination of:
- Life happened so I wasn't at my best
- Databases are hard
I'm not upset, though! By trying a lot of dead-ends, I got to understand the problem space more deeply. For example, I learned that:
- Chess positions cannot be represented in a fixed-size struct of less than ~29 bytes1 (and that is a maybe)
- Rust's HashMap implementation is based on a hashmap created at Google (video explaining how it works)
- Data-oriented design gives many practical benefits in structuring data to reduce overhead, such as storing a struct of lists instead of a list of structs
- 64-bit hashes can handle a lot of elements (4 billion-ish?) before you expect to see your first collision
So after these last two weeks, I've finally gotten the index built! And it saves it out to the disk, which I can load back in to quickly find the games which contain a given position. (How quickly is yet to be seen—I'll benchmark it, and have a few ideas for improvements if it's slower than necessary.)
Now I'm moving forward on building an application on top of this index. I'm going to first make an opening tree explorer, where you can click through from the beginning of a game and see how many games occurred with that position, the results from there, and drill into a (partial) list of the actual games containing that position. This will require building out a basic frontend (entirely HTML/CSS for now! I don't think this needs much, if any, JS), and it will also require adding some additional basic indexes, like bitmaps of game results.
Next week, I'm hoping to have something that I can demo! It will be rough-and-ready, but it'll be the start, and then I can spend a few more weeks adding in more interesting query support and more filters on the games. Long-term, I think that isabella-db can fill a gap in chess tooling today by making it possible to query for really interesting sequences of positions in games, like where sacrifices occur or where tactics are available. (This will likely also require integrating with an engine!)
I want to get more folks involved in this project, and the sooner the better. If you're interested in being an alpha user or helping with the queries and indexing, please reach out to me by email (or on Zulip, if you're a Recurser). I'm excited to see what I learn via this project for the rest of my batch, and where it goes after that!
See you all next week!
This blog post used to say that it couldn't be less than 36 bytes. I think that might be true if you use 1 byte per piece and then bit-pack the rest of the information, but a fellow Recurser and I worked out that it can indeed be smaller. Right now speculatively we think it can get down to 29 bytes, but I'm not about to write an implementation to prove it.