Welcome to ECN-Spider’s documentation!

Contents:

Setting up ECN-Spider on a Machine

This section will give an overview over compiling the current Python version from source, and setting up an unpriviledged account with exactly the right permissions to modify the Kernel’s ECN-related behavior (which normally only root can do).

The following instructions have been tested on Ubuntu 14.04 LTS. Ubuntu 14.04 ships with Python 3.4 by default, but for demonstration purposes Python 3.4 is compiled from source here.

Setting up a User Account

To run ECN-Spider, I create a separate user account.

root$ adduser ecn --disabled-password

Since I am only accessing this account by su, I will not allow password logins.

To give the user ecn the privileges to change the ECN behavior, the configuration file for sudo has to be adjusted. The configuration file is edited with the visudo program:

root$ visudo

The following listing shows the complete configuration file (with some comments removed) after editing:

Defaults        env_reset
Defaults        mail_badpass
Defaults        secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

# User privilege specification
root    ALL=(ALL:ALL) ALL
ecn ALL=NOPASSWD: /sbin/sysctl -w net.ipv4.tcp_ecn=[0-2]

# Members of the admin group may gain root privileges
%admin ALL=(ALL) ALL

# Allow members of group sudo to execute any command
%sudo   ALL=(ALL:ALL) ALL

The line starting with ecn... was added to the existing configuration. Now, the user ecn can change the necessary settings:

root$ su - ecn
ecn$ /sbin/sysctl net.ipv4.tcp_ecn
net.ipv4.tcp_ecn = 2
ecn$ sudo /sbin/sysctl -w net.ipv4.tcp_ecn=0
net.ipv4.tcp_ecn = 0

Note that changing this setting using sysctl affects all TCP connections created with the Kernel’s network stack.

Setting up Python

ECN-Spider requires Python 3.4. Since this version is not yet packaged for many Linux distributions, I compile it from source. Compiling Python from source also provides the appropriate versions of the virtualenv and pip utilities. The latter is required to install ECN-Spider’s dependencies.

First, I download Python’s source code and unpack it:

ecn$ wget https://www.python.org/ftp/python/3.4.1/Python-3.4.1.tar.xz
ecn$ tar xf Python-3.4.1.tar.xz

Some of Python’s optional dependencies should be installed:

root$ apt-get install build-essential libbz2-dev libsqlite3-dev libreadline-dev zlib1g-dev libncurses5-dev libssl-dev libgdbm-dev liblzma-dev tk-dev

Now, Python can be compiled:

ecn$ ./configure --prefix=/home/ecn/bin/Python-3.4.1
ecn$ make
ecn$ make test
ecn$ make install

Setting up the Environment for ECN-Spider

I use a virtual environment for running ECN-Spider in. It is set up as follows:

ecn$ bin/Python-3.4.1/bin/pyvenv ~/ecnsenv
ecn$ source ecnsenv/bin/activate
(ecnsenv) ecn$ python --version
Python 3.4.1

ECN-Spider has a few dependencies that need to be installed as well. This can be done using the Python package manager pip:

(ecnsenv) ecn$ pip install psutil dnspython3

Note that dnspython is not the same thing as dnspython3.

How To Use ECN-Spider

In this section I illustrates a typical use case for ECN-Spider. I will highlight how the various scripts that make up ECN-Spider work together.

Getting The Input Ready

First, I obtain a CSV data file with a list of domain names and traffic rank information that I would like to test. To use Alexa’s list of the top 1 million domains, I did this:

ecn$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
ecn$ unzip ./top-1m.csv.zip

The list has the following format:

1,google.com
2,facebook.com
3,youtube.com
4,yahoo.com
5,baidu.com
6,wikipedia.org
7,qq.com
8,twitter.com
9,linkedin.com
10,taobao.com

A records consists of a rank and a domain name. The rank is only used for the analysis at the very end, and is stored together with the domain name through the processing in all scripts.

For most tests, I choose not to use the entire domain name list. Using the script new_subset.py, I can extract a shorter list of two parts: the first n unique domains, and m randomly selected unique domains from the remainder.:

ecn$ python ./new_subset.py 50000 50000 ./top-1m.csv ./subset.csv

Note that this script should always be used (even when using the complete input list and not a subset), since this script not only does subset selection: it also does some clean-up and other minor manipulation of the list. If this script is not used, the analysis at the end may produce incorrect results.

The main testing script ecn_spider.py expects an input file with domain names and IP addresses they resolve to. The script resolution.py takes an input file and runs address resolution on the domain names therein:

ecn$ python ./resolution.py --workers 10 --www preferred ./subset.csv ./resolved.csv

With input files like Alexa’s top 1M list, resolved.csv will now contain many duplicate IP addresses, due to many popular websites being hosted on CDNs that share an IP address between multiple sites. The script unique.py ensures that both the IPv4 and IPv6 addresses of the resolved domain names are unique. Non-unique IP addresses may lead to erroneous results in the analysis.

ecn$ python ./unique.py ./resolved.csv ./input.csv

The list now has the following format:

1,www.google.com,173.194.40.52,2a00:1450:400a:804::1013
2,www.facebook.com,31.13.91.2,2a03:2880:f01b:1:face:b00c:0:1
3,www.youtube.com,173.194.40.32,2a00:1450:400a:804::1002
4,www.yahoo.com,46.228.47.115,2a00:1288:f006:1fe::3001
5,www.baidu.com,180.76.3.151,
6,www.wikipedia.org,91.198.174.192,2620:0:862:ed1a::1
7,www.qq.com,80.239.148.10,
8,www.twitter.com,199.16.156.38,
9,www.linkedin.com,108.174.2.129,
10,www.taobao.com,195.27.31.241,

Note that in this particular example, the option --www preferred for the resolution script has led to most domains in input.csv to now have a prepended www..

Running The Test

Now that the input file has been prepared, I can run ecn_spider. Before I start ECN-Spider, I run tcpdump as root in a separate shell, to capture all TCP packet headers for later analysis:

root$ tcpdump -ni eth0 -w ./ecn_spider.pcap -s 128

And now:

ecn$ python ./ecn_spider.py --verbosity INFO --workers 64 --timeout 4 ./input.csv ./retry.csv ./ecn-spider.csv ./ecn-spider.log
This run creates three output files:
retry.csv:
This file is used as the input file for later runs of ecn_spider and contains only the IP addresses that had problems during this test run.
ecn-spider.csv:
This file contains the collected test data used for further analysis.
ecn-spider.log:
This file contains human-readable log data useful for debugging. It is not needed for normal use of the tools of ECN-Spider.

Benchmarking the --workers parameter

The rate at which ECN-Spider tests domains varies greatly with the number of worker threads used for testing. This number can be adjusted with the command line option --workers. Of course, the rate also depends on the the round-trip time to the tested domains and the value of the --timeout option.

To find the optimal number of workers, the script simple_bench.sh can be used.

Subset

DNS Bulk Lookup Utility

Resolution: Resolve a large number of domains to IPv4 and IPv6 addresses.

Copyright 2014 Damiano Boppart

This file is part of ECN-Spider.

resolution.Q_SIZE = 100

Maximum domain queue size

resolution.SLEEP = None

Time to sleep before each resolution, for crude rate-limiting.

resolution.TIMEOUT = None

The timeout for DNS resolution.

resolution.WWW = None

The value of the -www command line option

resolution.arguments(argv)[source]

Parse the command-line arguments.

Parameters:argv – The command line.
Returns:The return value of argparse.ArgumentParser.parse_args.
resolution.csv_gen(skip=0, count=0, *args, **kwargs)[source]

A wrapper around csv.reader(), that makes it a generator.

csv_gen() does not return entire records, instead it extracts one particular field from a record.

Parameters:
  • *args – Arguments passed to csv.reader().
  • **kwargs – Keyword arguments passed to csv.reader().
Returns:

One field from each record on each call to next().

resolution.main(argv)[source]

Method to be called when run from the command line.

resolution.resolve(domain, query='A')[source]

Resolve a domain name to IP address(es).

Parameters:
  • domain (str) – The domain to be resolved.
  • query (str) – The query type. May be either ‘A’ or ‘AAAA’.
Returns:

A list of IP addresses as strings.

Throws:

Instances of dns.exception

resolution.resolve_both(domain)[source]

Helper function to handle_domain.

Unique

Unique: Ensure all IP addresses are unique in the output of Resolution.

Copyright 2014 Damiano Boppart

This file is part of ECN-Spider.

unique.arguments(argv)[source]

Parse the command-line arguments.

Parameters:argv – The command line.
Returns:The return value of argparse.ArgumentParser.parse_args.
unique.get_input(file_)[source]

Read input CSV file and return it as a DataFrame.

Parameters:file – The filename.
Returns:The DataFrame.
unique.main(argv)[source]

Method to be called when run from the command line.

unique.unique_col(df, col_name)[source]

Remove duplicate records based on the values of only one column of the DataFrame.

Parameters:
  • df – The DataFrame.
  • col_name – The column name.

ECN_Spider

ECN-Spider: Crawl web pages to test the Internet’s support of ECN.

Setting up ECN-Spider:

This application requires root privileges to change some settings using sysctl. The following steps grant ECN-Spider the minimal rights to make it work: 1) Install the application ‘sudo’. 2) Add the following rule to the sudoers file, adjusting ‘username’ as necessary: username ALL=NOPASSWD: /sbin/sysctl -w net.ipv4.tcp_ecn=[0-2]

Of course, if your setup allow caching the password for use of sudo, then that works too. Apart from during startup, subsequent calls to sudo should never be more than a user-defined timeout value (+ a small constant) apart, so typically in the order of 10 seconds. Your cached password should not expire.

Copyright 2014 Damiano Boppart

This file is part of ECN-Spider.

ecn_spider.ARGS = None

argparse configuration

class ecn_spider.BigPer[source]

A thread-safe class that allows the calculation of percentiles of an internal list of values that can be continually added to.

append(value)[source]

Add a new value to the internal list of values.

length[source]

Calculate the length of the list of values.

percentile_left(p=50)[source]

Calculate the (possibly rounded down) pth percentile of the internal list of values.

Returns:Always returns a value from the list.
ecn_spider.DLOGGER = None

DataLogger instance shared between all threads

class ecn_spider.DataLogger(file_name)[source]

A logger that outputs CSV.

This logger generates its messages using Python’s CSV module, and has no fancy log string formatting. Only the content of the iterable passed to writerow() will be written to the logfile.

writerow(data, lvl=10)[source]

Produce one logfile record.

Parameters:
  • data – An iterable of fields. This will be converted to a CSV row, and then written to file.
  • lvl – Logging level. See documentation of the logging module for information on levels.
ecn_spider.E = {'refused': 'Connection refused', 'noroute': 'No route to host', 'unreach': 'Network is unreachable', 'success': 'success', 'perm': 'Permission denied', 'timeout': 'socket.timeout', 'invalid': 'Invalid argument'}

Error strings used by ecn_spider

class ecn_spider.Job

Type of elements in job queue

domain

Alias for field number 1

ip

Alias for field number 2

rank

Alias for field number 0

ecn_spider.Q_SIZE = 100

Maximum job queue size

ecn_spider.RETRY_LOGGER = None

DataLogger instance shared for writing the retry data file

ecn_spider.RUN = False

Signal end to master and worker threads

class ecn_spider.Record

Type used to parse the input CSV file into

domain

Alias for field number 1

ipv4

Alias for field number 2

ipv6

Alias for field number 3

rank

Alias for field number 0

ecn_spider.START_TIME = None

Start time. Used to calculate runtime.

class ecn_spider.SemaphoreN(value)[source]

An extension to the standard library’s BoundedSemaphore that provides functions to handle n tokens at once.

acquire_n(value=1, blocking=True, timeout=None)[source]

Acquire value number of tokens at once.

The parameters blocking and timeout have the same semantics as BoundedSemaphore.

Returns:The same value as the last call to BoundedSemaphore‘s acquire() if acquire() were called value times instead of the call to this method.
empty()[source]

Acquire all tokens of the semaphore.

release_n(value=1)[source]

Release value number of tokens at once.

Returns:The same value as the last call to BoundedSemaphore‘s release() if release() were called value times instead of the call to this method.
class ecn_spider.SharedCounter(initial_value=0)[source]

A counter object that can be shared by multiple threads. Based on : http://chimera.labs.oreilly.com/books/1230000000393/ch12.html#_problem_200

decr(delta=1)[source]

Decrement the counter with locking

incr(delta=1)[source]

Increment the counter with locking

value[source]

Get the value of the counter.

ecn_spider.USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0'

User agent string used for HTTP requests

ecn_spider.VERBOSITY = 100

Print message about processing speed every VERBOSITY jobs.

ecn_spider.arguments(argv)[source]

Parse the command-line arguments.

Parameters:argv – The command line.
Returns:The return value of argparse.ArgumentParser.parse_args.
ecn_spider.check_ecn()[source]

Test that all the things that are done with sysctl work properly.

Returns:If this function returns without raising an exception, then everything is in working order.
ecn_spider.count = None

Shared counter instance to keep track of completed jobs.

ecn_spider.disable_ecn()[source]

Wrapper for set_ecn() to disable ECN.

ecn_spider.domain_reader(max_lines, *args, **kwargs)[source]

A wrapper around csv reader, that makes it a generator. Reads records from the input file, and returns them as the namedtuple Record.

Parameters:
  • *args – Arguments passed to csv.reader().
  • **kwargs – Keyword arguments passed to csv.reader().
Returns:

One record in the form of namedtuple Record on each call to next()

ecn_spider.enable_ecn()[source]

Wrapper for set_ecn() to enable ECN.

ecn_spider.filler(file_name, queue_)[source]

Fill a queue with jobs from the input file.

Parameters:
  • file_name – Input file with jobs.
  • queue – Job queue to fill.
ecn_spider.get_ecn_linux()[source]

Use sysctl to get the kernel’s ECN behavior on Linux.

Raises:subprocess.CalledProcessError when the command fails.
ecn_spider.get_ecn_mac()[source]

Use sysctl to get the kernel’s ECN behavior on Mac OS X.

Raises:subprocess.CalledProcessError when the command fails.
ecn_spider.limited_reader(max_lines=0, *args, **kwargs)[source]

A wrapper around csv.reader(), that returns only the first max_lines lines.

Parameters:
  • max_lines (int) – The maximum number of lines to return. All, if set to 0.
  • *args – Arguments passed to csv.reader().
  • **kwargs – Keyword arguments passed to csv.reader().
ecn_spider.main(argv)[source]

Method to be called when run from the command line.

ecn_spider.make_get(client, domain, note)[source]

Make an HTTP GET request and return the important bits of information as a dictionary.

Parameters:
  • client – The instance of http.client.HTTPConnection for making the request with.
  • domain – The value of the Host field of the GET request.
  • note – The string ‘eoff’ or ‘eon’. Used as part of the keys in the returned dictionary.
ecn_spider.master(num_workers, ecn_on, ecn_on_rdy, ecn_off, ecn_off_rdy)[source]

Master thread for controlling the kernel’s ECN behavior.

This thread synchronizes with the worker threads using the following semaphores:

ecn_on
Master signals the workers that ECN has just been turned on.
ecn_on_rdy
Worker signals the master that ECN may be turned on now.
ecn_off
Master signals the workers that ECN has just been turned off.
ecn_off_rdy
Worker signals the master that ECN may be turned off now.

The five semaphores must have been created before this thread is started, and their values must have been set to zero, i.e. acquiring a token is not possible.

Parameters:
  • num_workers (int) – Number of worker threads (that perform HTTP requests)
  • ecn_on, ecn_on_rdy, ecn_off, ecn_off_rdy, end (SemaphoreN) – The semaphores described above.
ecn_spider.print_platform()[source]

Print information about the platform.

ecn_spider.reporter(queue_)[source]

Periodically report on the length of the job queue.

ecn_spider.retry_count = None

Shared counter instance for keeping track of number of jobs to be retried.

ecn_spider.set_ecn_linux(value)[source]

Use sysctl to set the kernel’s ECN behavior on Linux

This is the equivalent of calling “sudo /sbin/sysctl -w “net.ipv4.tcp_ecn=$MODE” in a shell.

Raises:subprocess.CalledProcessError when the command fails.
ecn_spider.set_ecn_mac(value)[source]

Use sysctl to set the kernel’s ECN behavior on Mac OS X

Raises:subprocess.CalledProcessError when the command fails.
ecn_spider.set_up_logging(logfile, verbosity)[source]

Configure logging.

Parameters:
  • logfile (file) – Filename of logfile.
  • verbosity (verbosity) – Stdout logging verbosity.
ecn_spider.setup_socket(ip, timeout)[source]

Open a socket using an instance of http.client.HTTPConnection.

Parameters:
  • ip – IP address
  • timeout – Timeout for socket operations
Returns:

A tuple of: Error message or None, an instance of http.client.HTTPConnection.

ecn_spider.worker(queue_, timeout, ecn_on, ecn_on_rdy, ecn_off, ecn_off_rdy)[source]

Worker thread for crawling websites with and without ECN.

This thread synchronizes with the master thread using the semaphores described in the documentation of master().

The five semaphores must have been created before this thread is started, and their values must have been set to zero, i.e. acquiring a token is not possible.

Parameters:
  • queue (Queue) – A job queue with elements of type Job.
  • timeout (int) – Timeout for socket operations.
  • ecn_on, ecn_on_rdy, ecn_off, ecn_off_rdy (SemaphoreN) – The semaphores referenced above.
ecn_spider.worker_no_ecn(queue_, timeout)[source]

Worker thread for crawling websites without an ECN cycle.

Parameters:
  • queue (Queue) – A job queue with elements of type Job.
  • timeout (int) – Timeout for socket operations.

CSV Input File Format

ECN-Spider’s CSV input file contains domain names and IP addresses to which connection attempts are made.

Each record in the file has the following format:

name,ipv4,ipv6
The fields have the following meanings:
name:
A domain name. This will be used as is as the value of the HOST header field in HTTP requests. This field must not be empty.
ipv4:
The IPv4 address that is a DNS A record of name. This field may be empty.
ipv6:
The IPv6 address that is a DNS AAAA record of name. This field may be empty.

This is a sample input file snippet:

www.mail.ru,94.100.180.70,
www.ask.com,184.29.106.11,
www.google.it,173.194.43.31,2607:f8b0:4006:802::1018
www.tmall.com,220.181.113.241,
www.sina.com.cn,58.63.236.31,
www.google.fr,173.194.43.23,2607:f8b0:4006:802::1017
www.example.com,,

CSV Output File Format

ECN-Spider’s CSV output file contains information on all the TCP connections and the HTTP traffic it generates. The file format is designed with ease of parsing in mind, and not optimized for minimal file size.

Each record in the file has the following format:

time,ip,domain,ecn_mode,record_type,data
The fields have the following meanings:
time:
A timestamp of when the log record was created. In seconds since Unix epoch.
ip:
The IP address of the web server connected to.
domain:
The URL of the web server connected to.
ecn_mode:
on if this TCP connection uses ECN, off otherwise.
record_type:

What event this log message represents. This also defines the meaning of the data field.

This field has one of the following values:
PRE_CONN:
Immediately before the opening of a TCP connection.
POST_CONN:
Immediately after opening a TCP connection.
PRE_REQ:
Immediately before making an HTTP request.
POST_REQ:
Immediately after having parsed the response to an HTTP request.
REQ_HDR:
The headers of an HTTP response.
data:
Additional data. The meaning of this field depends on the record_type and is defined as follows:
PRE_CONN (Nothing):
Always None.
POST_CONN (Port):
The local port of the open TCP connection. 0 if the connection could not be established.
PRE_REQ (is_dummy):
True, if no actual request will be made (because the connection could not be established), False otherwise.
POST_REQ (Status Code):
The status code of the HTTP response. 0 if no response was made, 418 if the request failed, or the response could not be parsed.
REQ_HDR (Headers):
The headers of an HTTP response.

Each test of a single domain by ECN-Spider generates exactly the following pattern of records in the given order (only record types listed, for display purposes):

PRE_CONN
POST_CONN
PRE_CONN
POST_CONN
PRE_REQ
POST_REQ
REQ_HDR
PRE_REQ
POST_REQ
REQ_HDR

Whereby the occurrence of REQ_HDR type records are optional since it depends on configuration of ECN-Spider.

This is a sample output file snippet:

1401961474.3344474,66.211.160.88,ebay.com,off,PRE_CONN,
1401961474.3347988,206.190.36.45,yahoo.com,off,PRE_CONN,
1401961474.335094,173.194.40.87,google.co.jp,off,PRE_CONN,
1401961474.335769,[2a00:1450:4001:c02::bf],blogspot.com,off,POST_CONN,0
1401961474.3749285,162.243.54.31,fc2.com,on,POST_REQ,200
1401961474.375057,162.243.54.31,fc2.com,on,REQ_HDR,"[('Accept-Ranges', 'bytes'), ('Content-Type', 'text/html'), ('Date', 'Thu, 05 Jun 2014 09:44:01 GMT'), ('ETag', '""683d3b-8818-4fb13875fde40""'), ('Last-Modified', 'Thu, 05 Jun 2014 09:40:01 GMT'), ('Server', 'nginx/1.1.19'), ('Vary', 'Accept-Encoding'), ('Content-Length', '34840'), ('Connection', 'Close')]"
1401961474.375366,54.200.228.182,fc2.com,off,PRE_REQ,False
1401961474.4061024,206.190.36.45,yahoo.com,off,POST_CONN,47885
1401961474.4142444,66.211.160.88,ebay.com,off,POST_CONN,55109
1401961474.4262393,173.194.70.191,blogspot.com,off,POST_CONN,49240
1401961474.4331276,173.194.40.87,google.co.jp,off,POST_CONN,45960
1401961474.4698431,162.243.54.31,fc2.com,off,POST_REQ,200

Analysis

FIXME.

Simple-Bench

A benchmarking utility to determine the ideal number of workers for running tests with ecn_spider.py.

Indices and tables