SCRAPER-HELPER
NAME
scaper-helper - capture http and https connections
SYNOPSIS
scraper-helper
[ options ]
capture-dir
scraper-helper --replay
trace-file
capture-dir
scraper-helper
[ options ]
--make-root-cert
DESCRIPTION
Scraper-helper
is a http proxy that writes the http and https requests
passing through it to a log file.
The invocation:
-
scraper-helper -l 3128 capture-dir
Starts an internet proxy listening on port 3128.
All data passing through it is written to
capture-dir/trace,
and the bodies of the requests and responses are written
to individual files in
capture-dir.
The directory
capture-dir
will be created if it doesn't exist.
The logged data is decoded, decompressed and
decrypted in the case of https before being written to the log.
To decrypt https
scraper-helper
must perform a man in the middle attack,
meaning it impersonates a Certificate Authority
by generating a temporary root certificate and
using it to create fake https site certificates as needed.
However, since your browser doesn't trust
the temporary root certificate it will generate a warning
for each fake certificate it sees.
To silence the warnings pre-generate the root certificate using
--make-root-cert,
then your load it into your browsers certificate store
and mark it as allowed to sign web site certificates.
Pass it to
scraper-helper
using the
--cert
option on future runs.
Typically the options are read from the configuration file
rather than being supplied on the command line as was done here.
COMMAND LINE
- -c root-cert.pem, --cert root-cert.pem
-
cert.pem
is the X509 root certificate used to sign site certificates
to perform the man-in-the-middle attack on HTTPS.
If not present the certificate in
~/.config/scraper-helper/root-cert.pem
is used, or if not present a temporary one will be generated.
This file can also contain the private key.
'-'
forces a temporary root certificate to be used.
- -k root-key.pem, --key root-key.pem
-
key.pem
is the private key used to sign
root-cert.pem
and all generated site certificates.
If not present the key in
~/.config/scraper-helper/root-key.pem
is used, or if not present a temporary one will be generated.
Only needed if it isn't present in
root-cert.pem.
- -l [ip-addr:]port, --listen [ip-addr:]port
-
Listen for HTTP proxy connections on
port.
If
ip-addr
is given listen only on that IP address,
otherwise listen for connections all IP addresses.
Overrides the setting in
~/.config/scraper-helper/scraper-helper.cfg.
- capture-dir
-
All logged data is written into this directory.
See the
FILES
section.
REPLAY AN EARLIER CAPTURE
The second form
(--replay
trace-file)
replays an earlier trace file recorded by an earlier run of
scraper-helper.
Rather than listening as a proxy for commands to send
to send to the server
the http requests recorded in
trace-file
are sent.
The result of the replay session is recorded and written to
capture-dir
exactly in the same way as a normal run.
Scraper-helper
uses the time stamps record in the trace file
to keep interval between requests pretty close to
what happened when the recording was made.
The proxies in use must be identical otherwise this won't work.
Even identical http requests will generate different responses,
the two most common causes being session cookies and authentication.
Scraper-helper
has primitive support for different cookies during replay.
It records incoming cookies,
and in requests replaces strings of the form
<<{Cookie=NAME=}>>
with the recorded cookie value.
Recording a browser run and getting a successful playback,
then experimentally removing and/or editing it
is a useful technique in discovering what is necessary and what isn't.
For example, you can discover whether timing matters
by removing the time stamp lines from the trace you recorded.
A way of reduce the size of the trace
is to do a run so the browser caches what it can,
close the tab and clear all site data (eg, cookies),
then record the second run.
CREATING A ROOT CERTIFICATE
The third form
(--make-root-cert)
of the command creates a new X509 CA certificate,
writing it to the files determined by
--cert
and
--key.
Cert.pem
should then be installed into your browser as a CA Authority
who is trusted to sign web site certificates.
Warning: once installed
cert.pem
can be used to spy on your SSL connections,
so always generate a new one and do not share it with anyone else.
Scraper-helper
forces the browser to use weak encryption for SSL connections
between them (but not to the outside web)
so a tool like wireshark(1) can decrypt the SSL connections
if given
key.pem.
This is useful for looking at the raw (compressed and chunked)
HTTP bodies as
scaper-helper
logs the decompressed and unchunked version.
FILES
- capture-dir/root-key.pem
-
Written at the start of every logging run.
Contains the key for all X509 server certs
used to encrypt https requests between the browser and
scraper-helper.py.
It is also the key to the root certificate.
Tools like wireshark(1) can decrypt the https requests if given this key.
Purged at the start of the next run.
- capture-dir/root-cert.pem
-
Written at the start of every logging run.
Contains the root certificate used to sign all generated site certificates.
Purged at the start of the next run.
- capture-dir/trace
-
All events and every byte and received are logged to this file
in the order it happened.
Each line starts with a 4 digit connection ID
which starts at 0 and is incremented for each new connection.
Data passing through the connection is decrypted, decompressed
and then logged as quoted strings
with non ASCII characters escaped using the Python conventions.
Comments on the connection carrying them aren't quoted.
- capture-dir/0000.0000-site-uri.ext
-
Files with names of this format
are the bodies of the http requests and responses,
decrypted and decompressed.
The file name also appears in the
capture-dir/trace
file under the request that created it.
The first four characters of the file name are the connection ID.
The file name also appears in the
capture-dir/trace
file under the request that created it.
The extension is derived from the MIME type.
Purged at the start of the next run.
- capture-dir/0000.0000-site-scraper-helper.py-mitm-cert.pem
-
The cert used to impersonate
site.
The first four characters of the file name are the connection ID.
Purged at the start of the next run.
- ~/.config/scraper-helper/root-cert.pem
-
The root certificate used if
--cert
is not supplied on the command line.
- ~/.config/scraper-helper/root-key.pem
-
The private key used if
--key
is not supplied on the command line.
- ~/.config/scraper-helper/scraper-helper.cfg
-
Scraper-helper
reads it's options from this text file at start up.
Only one line is recognised:
-
listen = [ip-addr:]port
AUTHOR
Russell Stuart, <russell-debian@stuart.id.au>.