ECN_Spider

ECN-Spider: Crawl web pages to test the Internet’s support of ECN.

Setting up ECN-Spider:

This application requires root privileges to change some settings using sysctl. The following steps grant ECN-Spider the minimal rights to make it work: 1) Install the application ‘sudo’. 2) Add the following rule to the sudoers file, adjusting ‘username’ as necessary: username ALL=NOPASSWD: /sbin/sysctl -w net.ipv4.tcp_ecn=[0-2]

Of course, if your setup allow caching the password for use of sudo, then that works too. Apart from during startup, subsequent calls to sudo should never be more than a user-defined timeout value (+ a small constant) apart, so typically in the order of 10 seconds. Your cached password should not expire.

Copyright 2014 Damiano Boppart

This file is part of ECN-Spider.

ecn_spider.ARGS = None

argparse configuration

class ecn_spider.BigPer[source]

A thread-safe class that allows the calculation of percentiles of an internal list of values that can be continually added to.

append(value)[source]

Add a new value to the internal list of values.

length[source]

Calculate the length of the list of values.

percentile_left(p=50)[source]

Calculate the (possibly rounded down) pth percentile of the internal list of values.

Returns:Always returns a value from the list.
ecn_spider.DLOGGER = None

DataLogger instance shared between all threads

class ecn_spider.DataLogger(file_name)[source]

A logger that outputs CSV.

This logger generates its messages using Python’s CSV module, and has no fancy log string formatting. Only the content of the iterable passed to writerow() will be written to the logfile.

writerow(data, lvl=10)[source]

Produce one logfile record.

Parameters:
  • data – An iterable of fields. This will be converted to a CSV row, and then written to file.
  • lvl – Logging level. See documentation of the logging module for information on levels.
ecn_spider.E = {'refused': 'Connection refused', 'noroute': 'No route to host', 'unreach': 'Network is unreachable', 'success': 'success', 'perm': 'Permission denied', 'timeout': 'socket.timeout', 'invalid': 'Invalid argument'}

Error strings used by ecn_spider

class ecn_spider.Job

Type of elements in job queue

domain

Alias for field number 1

ip

Alias for field number 2

rank

Alias for field number 0

ecn_spider.Q_SIZE = 100

Maximum job queue size

ecn_spider.RETRY_LOGGER = None

DataLogger instance shared for writing the retry data file

ecn_spider.RUN = False

Signal end to master and worker threads

class ecn_spider.Record

Type used to parse the input CSV file into

domain

Alias for field number 1

ipv4

Alias for field number 2

ipv6

Alias for field number 3

rank

Alias for field number 0

ecn_spider.START_TIME = None

Start time. Used to calculate runtime.

class ecn_spider.SemaphoreN(value)[source]

An extension to the standard library’s BoundedSemaphore that provides functions to handle n tokens at once.

acquire_n(value=1, blocking=True, timeout=None)[source]

Acquire value number of tokens at once.

The parameters blocking and timeout have the same semantics as BoundedSemaphore.

Returns:The same value as the last call to BoundedSemaphore‘s acquire() if acquire() were called value times instead of the call to this method.
empty()[source]

Acquire all tokens of the semaphore.

release_n(value=1)[source]

Release value number of tokens at once.

Returns:The same value as the last call to BoundedSemaphore‘s release() if release() were called value times instead of the call to this method.
class ecn_spider.SharedCounter(initial_value=0)[source]

A counter object that can be shared by multiple threads. Based on : http://chimera.labs.oreilly.com/books/1230000000393/ch12.html#_problem_200

decr(delta=1)[source]

Decrement the counter with locking

incr(delta=1)[source]

Increment the counter with locking

value[source]

Get the value of the counter.

ecn_spider.USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0'

User agent string used for HTTP requests

ecn_spider.VERBOSITY = 100

Print message about processing speed every VERBOSITY jobs.

ecn_spider.arguments(argv)[source]

Parse the command-line arguments.

Parameters:argv – The command line.
Returns:The return value of argparse.ArgumentParser.parse_args.
ecn_spider.check_ecn()[source]

Test that all the things that are done with sysctl work properly.

Returns:If this function returns without raising an exception, then everything is in working order.
ecn_spider.count = None

Shared counter instance to keep track of completed jobs.

ecn_spider.disable_ecn()[source]

Wrapper for set_ecn() to disable ECN.

ecn_spider.domain_reader(max_lines, *args, **kwargs)[source]

A wrapper around csv reader, that makes it a generator. Reads records from the input file, and returns them as the namedtuple Record.

Parameters:
  • *args – Arguments passed to csv.reader().
  • **kwargs – Keyword arguments passed to csv.reader().
Returns:

One record in the form of namedtuple Record on each call to next()

ecn_spider.enable_ecn()[source]

Wrapper for set_ecn() to enable ECN.

ecn_spider.filler(file_name, queue_)[source]

Fill a queue with jobs from the input file.

Parameters:
  • file_name – Input file with jobs.
  • queue – Job queue to fill.
ecn_spider.get_ecn_linux()[source]

Use sysctl to get the kernel’s ECN behavior on Linux.

Raises:subprocess.CalledProcessError when the command fails.
ecn_spider.get_ecn_mac()[source]

Use sysctl to get the kernel’s ECN behavior on Mac OS X.

Raises:subprocess.CalledProcessError when the command fails.
ecn_spider.limited_reader(max_lines=0, *args, **kwargs)[source]

A wrapper around csv.reader(), that returns only the first max_lines lines.

Parameters:
  • max_lines (int) – The maximum number of lines to return. All, if set to 0.
  • *args – Arguments passed to csv.reader().
  • **kwargs – Keyword arguments passed to csv.reader().
ecn_spider.main(argv)[source]

Method to be called when run from the command line.

ecn_spider.make_get(client, domain, note)[source]

Make an HTTP GET request and return the important bits of information as a dictionary.

Parameters:
  • client – The instance of http.client.HTTPConnection for making the request with.
  • domain – The value of the Host field of the GET request.
  • note – The string ‘eoff’ or ‘eon’. Used as part of the keys in the returned dictionary.
ecn_spider.master(num_workers, ecn_on, ecn_on_rdy, ecn_off, ecn_off_rdy)[source]

Master thread for controlling the kernel’s ECN behavior.

This thread synchronizes with the worker threads using the following semaphores:

ecn_on
Master signals the workers that ECN has just been turned on.
ecn_on_rdy
Worker signals the master that ECN may be turned on now.
ecn_off
Master signals the workers that ECN has just been turned off.
ecn_off_rdy
Worker signals the master that ECN may be turned off now.

The five semaphores must have been created before this thread is started, and their values must have been set to zero, i.e. acquiring a token is not possible.

Parameters:
  • num_workers (int) – Number of worker threads (that perform HTTP requests)
  • ecn_on, ecn_on_rdy, ecn_off, ecn_off_rdy, end (SemaphoreN) – The semaphores described above.
ecn_spider.print_platform()[source]

Print information about the platform.

ecn_spider.reporter(queue_)[source]

Periodically report on the length of the job queue.

ecn_spider.retry_count = None

Shared counter instance for keeping track of number of jobs to be retried.

ecn_spider.set_ecn_linux(value)[source]

Use sysctl to set the kernel’s ECN behavior on Linux

This is the equivalent of calling “sudo /sbin/sysctl -w “net.ipv4.tcp_ecn=$MODE” in a shell.

Raises:subprocess.CalledProcessError when the command fails.
ecn_spider.set_ecn_mac(value)[source]

Use sysctl to set the kernel’s ECN behavior on Mac OS X

Raises:subprocess.CalledProcessError when the command fails.
ecn_spider.set_up_logging(logfile, verbosity)[source]

Configure logging.

Parameters:
  • logfile (file) – Filename of logfile.
  • verbosity (verbosity) – Stdout logging verbosity.
ecn_spider.setup_socket(ip, timeout)[source]

Open a socket using an instance of http.client.HTTPConnection.

Parameters:
  • ip – IP address
  • timeout – Timeout for socket operations
Returns:

A tuple of: Error message or None, an instance of http.client.HTTPConnection.

ecn_spider.worker(queue_, timeout, ecn_on, ecn_on_rdy, ecn_off, ecn_off_rdy)[source]

Worker thread for crawling websites with and without ECN.

This thread synchronizes with the master thread using the semaphores described in the documentation of master().

The five semaphores must have been created before this thread is started, and their values must have been set to zero, i.e. acquiring a token is not possible.

Parameters:
  • queue (Queue) – A job queue with elements of type Job.
  • timeout (int) – Timeout for socket operations.
  • ecn_on, ecn_on_rdy, ecn_off, ecn_off_rdy (SemaphoreN) – The semaphores referenced above.
ecn_spider.worker_no_ecn(queue_, timeout)[source]

Worker thread for crawling websites without an ECN cycle.

Parameters:
  • queue (Queue) – A job queue with elements of type Job.
  • timeout (int) – Timeout for socket operations.

CSV Input File Format

ECN-Spider’s CSV input file contains domain names and IP addresses to which connection attempts are made.

Each record in the file has the following format:

name,ipv4,ipv6
The fields have the following meanings:
name:
A domain name. This will be used as is as the value of the HOST header field in HTTP requests. This field must not be empty.
ipv4:
The IPv4 address that is a DNS A record of name. This field may be empty.
ipv6:
The IPv6 address that is a DNS AAAA record of name. This field may be empty.

This is a sample input file snippet:

www.mail.ru,94.100.180.70,
www.ask.com,184.29.106.11,
www.google.it,173.194.43.31,2607:f8b0:4006:802::1018
www.tmall.com,220.181.113.241,
www.sina.com.cn,58.63.236.31,
www.google.fr,173.194.43.23,2607:f8b0:4006:802::1017
www.example.com,,

CSV Output File Format

ECN-Spider’s CSV output file contains information on all the TCP connections and the HTTP traffic it generates. The file format is designed with ease of parsing in mind, and not optimized for minimal file size.

Each record in the file has the following format:

time,ip,domain,ecn_mode,record_type,data
The fields have the following meanings:
time:
A timestamp of when the log record was created. In seconds since Unix epoch.
ip:
The IP address of the web server connected to.
domain:
The URL of the web server connected to.
ecn_mode:
on if this TCP connection uses ECN, off otherwise.
record_type:

What event this log message represents. This also defines the meaning of the data field.

This field has one of the following values:
PRE_CONN:
Immediately before the opening of a TCP connection.
POST_CONN:
Immediately after opening a TCP connection.
PRE_REQ:
Immediately before making an HTTP request.
POST_REQ:
Immediately after having parsed the response to an HTTP request.
REQ_HDR:
The headers of an HTTP response.
data:
Additional data. The meaning of this field depends on the record_type and is defined as follows:
PRE_CONN (Nothing):
Always None.
POST_CONN (Port):
The local port of the open TCP connection. 0 if the connection could not be established.
PRE_REQ (is_dummy):
True, if no actual request will be made (because the connection could not be established), False otherwise.
POST_REQ (Status Code):
The status code of the HTTP response. 0 if no response was made, 418 if the request failed, or the response could not be parsed.
REQ_HDR (Headers):
The headers of an HTTP response.

Each test of a single domain by ECN-Spider generates exactly the following pattern of records in the given order (only record types listed, for display purposes):

PRE_CONN
POST_CONN
PRE_CONN
POST_CONN
PRE_REQ
POST_REQ
REQ_HDR
PRE_REQ
POST_REQ
REQ_HDR

Whereby the occurrence of REQ_HDR type records are optional since it depends on configuration of ECN-Spider.

This is a sample output file snippet:

1401961474.3344474,66.211.160.88,ebay.com,off,PRE_CONN,
1401961474.3347988,206.190.36.45,yahoo.com,off,PRE_CONN,
1401961474.335094,173.194.40.87,google.co.jp,off,PRE_CONN,
1401961474.335769,[2a00:1450:4001:c02::bf],blogspot.com,off,POST_CONN,0
1401961474.3749285,162.243.54.31,fc2.com,on,POST_REQ,200
1401961474.375057,162.243.54.31,fc2.com,on,REQ_HDR,"[('Accept-Ranges', 'bytes'), ('Content-Type', 'text/html'), ('Date', 'Thu, 05 Jun 2014 09:44:01 GMT'), ('ETag', '""683d3b-8818-4fb13875fde40""'), ('Last-Modified', 'Thu, 05 Jun 2014 09:40:01 GMT'), ('Server', 'nginx/1.1.19'), ('Vary', 'Accept-Encoding'), ('Content-Length', '34840'), ('Connection', 'Close')]"
1401961474.375366,54.200.228.182,fc2.com,off,PRE_REQ,False
1401961474.4061024,206.190.36.45,yahoo.com,off,POST_CONN,47885
1401961474.4142444,66.211.160.88,ebay.com,off,POST_CONN,55109
1401961474.4262393,173.194.70.191,blogspot.com,off,POST_CONN,49240
1401961474.4331276,173.194.40.87,google.co.jp,off,POST_CONN,45960
1401961474.4698431,162.243.54.31,fc2.com,off,POST_REQ,200

Table Of Contents

Previous topic

Unique

Next topic

Analysis

This Page