After waiting months I’ve finally been given access to the .com TLD zone file! (And the .net, but who cares about .net, right?)
So what is a zone file, you ask? Basically this file keeps track of how to access every .com domain name in the entire world. Well, technically not all of them, just the ones that have name servers associated with them, but practically all of them. So how many .com’s are there in the zone file? Over 88 million! Holy cow that is a lot of domains!
So the first problem is how to use the zone file. It is a 6.5 GB text file so you can’t just open it up and say “hey is jacoballred.com taken?” and expect a quick reply. On top of that, it isn’t even designed to give you a list of taken domain names, that is just a happy side effect of keeping track of how to access all the .com’s in the world.
My solution was to preprocess the data using a few Linux utilities, then load it into MySQL.
The zone file looks a little like this:
NS E.GTLD-SERVERS.NET.
NS M.GTLD-SERVERS.NET.
$TTL 172800
ENERCONTECHNOLOGIES NS NS1.BIZ.RR
ENERCONTECHNOLOGIES NS NS2.BIZ.RR
SELF-DRIVE-CAR-RENTAL NS NS3.IZP
None of the domains in the file have .com on the end. Each of these lists a nameserver after it. There are also non-TLD domains (nameservers) that I don’t care about, and other random markers in the file ($TTL). All I want are the domain names, so I use Linux to strip out the stuff I don’t want:
sed -e '/^[^A-Z0-9]/d' -e '/^$/d' -e 's/ .*$//' -e /[^A-Z0-9\-]/d com.zone \
| sort -u \
| awk -F "" '{close(f);f=$1}{print > "com.zone.split."f}'
Wow doesn’t that look fun? So lets go over it.
That first line uses sed to load in the zone file (com.zone) and remove all lines that don’t start with A-Z or 0-9 (the only valid characters for the first character of a .com domain), then it removes blank lines, then it removes all but the first word on each line (gets rid of the nameserver after the domain name), and finally removes any line that has characters that aren’t allowed in a domain (anything other than A-Z, 0-9, or a dash). This gets a list of JUST the domain names (without the .com), but has duplicates and they aren’t in any particular order.
The next line sorts the list of domains and removes duplicates.
The last line uses awk to split the list of domains into 36 separate files, one for each starting character (A-Z, 0-9). This isn’t technically needed but makes things more convenient.
My server is pretty wussy (1GB of RAM) so I’m preprocessing on my fast 8GB of RAM desktop at home. So I kick off that command and 20 minutes later I have files ready to be loaded into MySQL.
My table structure is pretty basic. I have 1 table for each letter/number (for performance) that has a numeric primary key and a varchar for the domain name. So I run this for each letter and number:
DROP TABLE IF EXISTS `com_A`;
CREATE TABLE `com_zone`.`com_A` (
`id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`name` VARCHAR( 255 ) NOT NULL
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci;
Next I use LOAD DATA INFILE to quickly pull in the data:
LOAD DATA INFILE '/path/to/file/com.zone.split.A' INTO TABLE `com_A` (name);
This step took about 5 minutes total for all the processed files. It is super super fast, but we still have one step left. Without an index on the name field, queries are really slow (about 2 seconds for a single domain). So we add an index to each table:
ALTER TABLE `com_A` ADD UNIQUE `name` ( `name` ( 255 ) );
This step was painfully slow, about 40 minutes, but once it was done I could do pretty much any query in a fraction of a second.
The final step was to turn off MySQL on my desktop, copy the MyISAM files to my server, then restart MySQL on my server so it could use them. Woot! I know have nearly every .com in the world on my server, ready to tell any web app I want if a domain is available or not with a high degree of confidence. I have a couple really fun webpages in the works that will use this.
Well that was a bit of a ramble but should be enough to get someone else in my position on the road to domain goodness!
Update 11/6/2013: A visitor asked me to create a command to parse the .INFO zone file. Here you go:
sed -e 's/\.INFO\.//g' -e '/^[^A-Z0-9]/d' -e '/^$/d' -e 's/ .*$//' -e /[^A-Z0-9\-]/d info.zone \
| sort -u \
| awk -F "" '{close(f);f=$1}{print > "info.zone.split."f}'