So here we are now. It took a long time but I’m back. And this project is well and alive!
So what’s the status of this project you might ask? Well, I have downloaded 90% of all the files I could find at the links on Lars Balzer’ huge list of chess games websites. All the files I have collected and converted so far is what will be in the first release of the CGR database. For details on what’s going to be in those files, check the FAQ page!
As I write this, I have been processing all the raw PGN files with PGN-Extract for 9 hours now. Right before this, I had to decompress a zillion files in every file compression format there ever existed on this Earth! It was a nightmare!
The first pass (which I am doing right now) splits all files into 500 smaller files by determining the ECO code (A00 to E99) of the opening. The data I have to process consists of 206G (yes, 206 gigabytes!!) of PGN files (17251 files). My machine is currently processing 6.5 million games an hour and preliminary estimates tell me the first pass should take almost 32 hours! Roughly speaking, so far every gigabyte averages 8.3 million games…
The second pass will eliminate duplicates in each of those 500 files. Some splitting could also be necessary afterwards as some files (A00 for instance) is already at 9GB after only 9 hours and cannot be loaded into Scid vs PC (one of the other tools I will use)!
The third pass will correct names and some PGN tags, eliminate duplicates (yes, a second run is necessary since Scid vs PC is better at this than PGN-Extract!), filter out some more garbage, filter based on Elo, etc.
The fourth pass will aggregate those 500 files into something more manageable (I was thinking ending up with 10 files, e.g. A00-49.pgn, A50-99.pgn, B00-49.pgn, etc). I will see how I will split the result into something not too big nor have too many files. The goal is to be able to load every file in Scid vs PC without busting the 16 million games limit.
I will keep you posted as things progress…