SCRAPER-HELPER

 

NAME

scaper-helper - capture http and https connections  

SYNOPSIS

scraper-helper [ options ] capture-dir

scraper-helper --replay trace-file capture-dir

scraper-helper [ options ] --make-root-cert  

DESCRIPTION

Scraper-helper is a http proxy that writes the http and https requests passing through it to a log file.

The invocation:

scraper-helper -l 3128 capture-dir

Starts an internet proxy listening on port 3128. All data passing through it is written to capture-dir/trace, and the bodies of the requests and responses are written to individual files in capture-dir. The directory capture-dir will be created if it doesn't exist. The logged data is decoded, decompressed and decrypted in the case of https before being written to the log.

To decrypt https scraper-helper must perform a man in the middle attack, meaning it impersonates a Certificate Authority by generating a temporary root certificate and using it to create fake https site certificates as needed. However, since your browser doesn't trust the temporary root certificate it will generate a warning for each fake certificate it sees. To silence the warnings pre-generate the root certificate using --make-root-cert, then your load it into your browsers certificate store and mark it as allowed to sign web site certificates. Pass it to scraper-helper using the --cert option on future runs.

Typically the options are read from the configuration file rather than being supplied on the command line as was done here.

 

COMMAND LINE

-c root-cert.pem, --cert root-cert.pem
cert.pem is the X509 root certificate used to sign site certificates to perform the man-in-the-middle attack on HTTPS. If not present the certificate in ~/.config/scraper-helper/root-cert.pem is used, or if not present a temporary one will be generated. This file can also contain the private key. '-' forces a temporary root certificate to be used.
-k root-key.pem, --key root-key.pem
key.pem is the private key used to sign root-cert.pem and all generated site certificates. If not present the key in ~/.config/scraper-helper/root-key.pem is used, or if not present a temporary one will be generated. Only needed if it isn't present in root-cert.pem.
-l [ip-addr:]port, --listen [ip-addr:]port
Listen for HTTP proxy connections on port. If ip-addr is given listen only on that IP address, otherwise listen for connections all IP addresses. Overrides the setting in ~/.config/scraper-helper/scraper-helper.cfg.

capture-dir
All logged data is written into this directory. See the FILES section.

 

REPLAY AN EARLIER CAPTURE

The second form (--replay trace-file) replays an earlier trace file recorded by an earlier run of scraper-helper. Rather than listening as a proxy for commands to send to send to the server the http requests recorded in trace-file are sent. The result of the replay session is recorded and written to capture-dir exactly in the same way as a normal run. Scraper-helper uses the time stamps record in the trace file to keep interval between requests pretty close to what happened when the recording was made. The proxies in use must be identical otherwise this won't work.

Even identical http requests will generate different responses, the two most common causes being session cookies and authentication. Scraper-helper has primitive support for different cookies during replay. It records incoming cookies, and in requests replaces strings of the form <<{Cookie=NAME=}>> with the recorded cookie value.

Recording a browser run and getting a successful playback, then experimentally removing and/or editing it is a useful technique in discovering what is necessary and what isn't. For example, you can discover whether timing matters by removing the time stamp lines from the trace you recorded. A way of reduce the size of the trace is to do a run so the browser caches what it can, close the tab and clear all site data (eg, cookies), then record the second run.

 

CREATING A ROOT CERTIFICATE

The third form (--make-root-cert) of the command creates a new X509 CA certificate, writing it to the files determined by --cert and --key. Cert.pem should then be installed into your browser as a CA Authority who is trusted to sign web site certificates.

Warning: once installed cert.pem can be used to spy on your SSL connections, so always generate a new one and do not share it with anyone else.

Scraper-helper forces the browser to use weak encryption for SSL connections between them (but not to the outside web) so a tool like wireshark(1) can decrypt the SSL connections if given key.pem. This is useful for looking at the raw (compressed and chunked) HTTP bodies as scaper-helper logs the decompressed and unchunked version.

 

FILES

capture-dir/root-key.pem
Written at the start of every logging run. Contains the key for all X509 server certs used to encrypt https requests between the browser and scraper-helper.py. It is also the key to the root certificate. Tools like wireshark(1) can decrypt the https requests if given this key. Purged at the start of the next run.
capture-dir/root-cert.pem
Written at the start of every logging run. Contains the root certificate used to sign all generated site certificates. Purged at the start of the next run.
capture-dir/trace
All events and every byte and received are logged to this file in the order it happened. Each line starts with a 4 digit connection ID which starts at 0 and is incremented for each new connection. Data passing through the connection is decrypted, decompressed and then logged as quoted strings with non ASCII characters escaped using the Python conventions. Comments on the connection carrying them aren't quoted.
capture-dir/0000.0000-site-uri.ext
Files with names of this format are the bodies of the http requests and responses, decrypted and decompressed. The file name also appears in the capture-dir/trace file under the request that created it. The first four characters of the file name are the connection ID. The file name also appears in the capture-dir/trace file under the request that created it. The extension is derived from the MIME type. Purged at the start of the next run.
capture-dir/0000.0000-site-scraper-helper.py-mitm-cert.pem
The cert used to impersonate site. The first four characters of the file name are the connection ID. Purged at the start of the next run.
~/.config/scraper-helper/root-cert.pem
The root certificate used if --cert is not supplied on the command line.
~/.config/scraper-helper/root-key.pem
The private key used if --key is not supplied on the command line.
~/.config/scraper-helper/scraper-helper.cfg
Scraper-helper reads it's options from this text file at start up. Only one line is recognised:

listen = [ip-addr:]port

 

AUTHOR

Russell Stuart, <russell-debian@stuart.id.au>.