In this section I illustrates a typical use case for ECN-Spider. I will highlight how the various scripts that make up ECN-Spider work together.
First, I obtain a CSV data file with a list of domain names and traffic rank information that I would like to test. To use Alexa’s list of the top 1 million domains, I did this:
ecn$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip ecn$ unzip ./top-1m.csv.zip
The list has the following format:
1,google.com 2,facebook.com 3,youtube.com 4,yahoo.com 5,baidu.com 6,wikipedia.org 7,qq.com 8,twitter.com 9,linkedin.com 10,taobao.com
A records consists of a rank and a domain name. The rank is only used for the analysis at the very end, and is stored together with the domain name through the processing in all scripts.
For most tests, I choose not to use the entire domain name list. Using the script new_subset.py, I can extract a shorter list of two parts: the first n unique domains, and m randomly selected unique domains from the remainder.:
ecn$ python ./new_subset.py 50000 50000 ./top-1m.csv ./subset.csv
Note that this script should always be used (even when using the complete input list and not a subset), since this script not only does subset selection: it also does some clean-up and other minor manipulation of the list. If this script is not used, the analysis at the end may produce incorrect results.
The main testing script ecn_spider.py expects an input file with domain names and IP addresses they resolve to. The script resolution.py takes an input file and runs address resolution on the domain names therein:
ecn$ python ./resolution.py --workers 10 --www preferred ./subset.csv ./resolved.csv
With input files like Alexa’s top 1M list, resolved.csv will now contain many duplicate IP addresses, due to many popular websites being hosted on CDNs that share an IP address between multiple sites. The script unique.py ensures that both the IPv4 and IPv6 addresses of the resolved domain names are unique. Non-unique IP addresses may lead to erroneous results in the analysis.
ecn$ python ./unique.py ./resolved.csv ./input.csv
The list now has the following format:
1,www.google.com,18.104.22.168,2a00:1450:400a:804::1013 2,www.facebook.com,22.214.171.124,2a03:2880:f01b:1:face:b00c:0:1 3,www.youtube.com,126.96.36.199,2a00:1450:400a:804::1002 4,www.yahoo.com,188.8.131.52,2a00:1288:f006:1fe::3001 5,www.baidu.com,184.108.40.206, 6,www.wikipedia.org,220.127.116.11,2620:0:862:ed1a::1 7,www.qq.com,18.104.22.168, 8,www.twitter.com,22.214.171.124, 9,www.linkedin.com,126.96.36.199, 10,www.taobao.com,188.8.131.52,
Note that in this particular example, the option --www preferred for the resolution script has led to most domains in input.csv to now have a prepended www..
Now that the input file has been prepared, I can run ecn_spider. Before I start ECN-Spider, I run tcpdump as root in a separate shell, to capture all TCP packet headers for later analysis:
root$ tcpdump -ni eth0 -w ./ecn_spider.pcap -s 128
ecn$ python ./ecn_spider.py --verbosity INFO --workers 64 --timeout 4 ./input.csv ./retry.csv ./ecn-spider.csv ./ecn-spider.log
The rate at which ECN-Spider tests domains varies greatly with the number of worker threads used for testing. This number can be adjusted with the command line option --workers. Of course, the rate also depends on the the round-trip time to the tested domains and the value of the --timeout option.
To find the optimal number of workers, the script simple_bench.sh can be used.