Wherein we debug what we started  earlier.

If you ran the example at the end of the post tap.pl you would have seen a blank screen.  Excitement.

But why?

Let’s debug.

Logs:

tap.pl creates a log file, default name tap.log. [Not a creative name; in programing, creativity is less of a virtue than predictability.]

In tap.log you’ll see something like:

Waiting
2012/08/18 13:39:58
Connection accepted from 127.0.0.1:51441
Connecting to benaveling.wordpress.com:80
Connection made
2012/08/18 13:39:58 >>> {{{GET / HTTP/1.1
Host: localhost:1080
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/
14.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

}}} >>> (290 bytes).
2012/08/18 13:39:59 <<< {{{HTTP/1.1 200 OK
Server: nginx
Date: Sat, 18 Aug 2012 03:39:42 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Vary: Cookie
Content-Encoding: gzip

}}} <<< (224 bytes).
2012/08/18 13:39:59 <<< {{{14
^_^@^@^@^@^@^@^C^C^@^@^@^@^@^@^@^@^@
0

}}} <<< (31 bytes).
2012/08/18 13:39:59 >>> {{{GET /favicon.ico HTTP/1.1
Host: localhost:1080
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/
14.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

}}} >>> (301 bytes).
2012/08/18 13:40:00 <<< {{{HTTP/1.1 200 OK
Server: nginx
Date: Sat, 18 Aug 2012 03:39:43 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Vary: Cookie
X-nc: HIT luv 48
Content-Encoding: gzip

14
}}} <<< (246 bytes).
2012/08/18 13:40:00 <<< {{{^_^@^@^@^@^@^@^C^C^@^@^@^@^@^@^@^@^@
0

}}} <<< (27 bytes).
Client disconnected

The first 5 lines are clear enough. A connection is accepted from the client (the browser), and a new connection is made to the server (wordpress). After that, messages go back and forth, and finally, the client disconnects.

Each message between the two is prefixed with “>>>” or “<<<” to indicate if it is from client to server or from server to client. The message text is wrapped with “{{{” and “}}}” and the size of the message is printed after the message.

Looking at the individual messages, we can see that my browser sent a GET and received two responses – one of which is in binary and may look quite different in your editor. And then, much the same again, 1 GET with two responses (strictly speaking, they represent one response that is delivered in two parts). And then, nothing.

Why? What should we do next?

What I did was vary, very very slightly, the url I was pointing the browser at. Instead of localhost:1080, I tried 127.0.0.1:1080.

I wasn’t expecting it to make a difference, but it did. Try it for yourself. Instead of a blank page, it comes back with “127.wordpress.com doesn’t exist”.

So, what happened there?

The browser still connects to tap.pl. Tap.pl still connects to benaveling.wordpress.com.

The difference is in the messages the browser sends.

This was what was sent the first time:

GET / HTTP/1.1
Host: localhost:1080
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

(Your browser is likely to send something a little bit different, but probably not completely different.)

The second time, the message was:

GET / HTTP/1.1
Host: 127.0.0.1:1080
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Connection: keep-alive

The critical difference is in the second line.

That line tells the server what host the browser thinks it is connecting to. You’d be forgiven for thinking that the server knows its own name. In practice, that isn’t always the case.

WordPress don’t have one server per blog. They have many servers, each of which serves many blogs. To know which blog is being accessed, the server reads the Host from the html header.

And although tap.pl sends to wordpress exactly what was sent to it, what was sent to tap.pl wasn’t what would be sent if the browser were connected directly – the Host line is different. That often doesn’t matter, but in this case it does. It prevents wordpress from knowing which blog we want to read.

Fortunately, there is an easy way to repair the situation.

Usage:

tap_wp.pl remote_ip remote_port [local_port [log_file]]

remote_ip, remote_port: the host and port to connect to. No defaults provided.
local_port: the port to listen to. default: remote_port
log_file: name of log file. default: tap.log

Code for Tap_wp.pl:

#! /usr/local/bin/perl -w
#
# (C)opyright Ben Aveling 2012, except for the bits that are taken from the 'Blue Camel book' (See http://c2.com/cgi/wiki?DefinitivePerlBooks)
# 
# This script may be reproduced freely. 
# 
# #################
# General Behaviour
# #################
#
# This program takes a dump copy of an tcp/ip session
#
# It keeps a socket open, and any time that socket is connected to,
# another connection to a hard coded socket is opened.  Any messages
# from either are passed to the other socket, and a copy kept
#
# This is a copy of tap.pl, plus extra magic to persuade wordpress to respond properly
#
# #######
# History 
# #######
# 2012.08.18 First published version
#
# #####
# Usage
# #####
#
my $usage = "
  tap_wp.pl remote_ip remote_port [local_port [log_file]]

	remote_ip,remote_port: the port to connect to
	local_port: the port to listen to. default: remote_port
	log_file: name of log file. default: tap.log

  eg, to watch what happens between your browser and wordpress, run the following then point your browser at http://localhost:1080
	tap_wp.pl benaveling.wordpress.com 80 1080
";
# #######################################################################

#
# initialisation
#

# Tells perl to complain if it sees any code that looks dodgy
use strict;

# Load the libraries we need
require 5.002;
use Socket;
use FileHandle;

# Unbuffer standard output
$|=1;

# We will be spawning child processes. We don't want to create 'zombies' and we don't want to have to 'wait' for our children, so we 'ignore' them.  This next line does that.
$SIG{CLD} = "IGNORE";

# parse parameters

my $remote_ip_address = shift or die $usage;
my $remote_port_num = shift or die $usage;
my $local_port_num = shift or die $usage;
my $log_file = shift || "tap.log";

# open tcp/ip socket - see blue camel book pg 349

my $protocol = getprotobyname('tcp');
socket(LISTEN, PF_INET, SOCK_STREAM, $protocol)
  or die "Can't create socket: $!";
bind(LISTEN, sockaddr_in($local_port_num, INADDR_ANY))
  or die "Can't bind socket: $!";
listen(LISTEN,1)
  or die "Can't listen to socket: $!";

# log file

if( open(LOGFILE, ">>$log_file") )
{
  warn "Logging to $log_file\n";
}
else
{
  warn "Can't open $log_file: $!";
}
binmode(LOGFILE);
select(LOGFILE);$|=1;select(STDOUT);

echo("Waiting\n");

#
# Main Loop
#

my $client_paddr;

# loop forever. when a new connection arrives, spawn a child to handle it then go back to waiting
while(1)
{
  # Accept a new connection - see blue camel book
  $client_paddr = accept(CLIENT, LISTEN);
  select(CLIENT);$|=1;select(STDOUT);
  binmode(CLIENT);
  # call fork to start a new process. Fork is called once, but returns twice, returning different values to the parent and the child. The child process drops out of the while loop.
  last if ! fork();
  # The parent process closes CLIENT, since CLIENT is for the exclusive use of the child. It then goes back to the start of the while(1) loop
  close CLIENT;
}

# from here is on is all the child process

# the child process closes LISTEN because LISTEN is only for the use of the parent process
close LISTEN;

# Report to the user that a new connection has been accepted
my ($client_port, $client_iaddr) = sockaddr_in( $client_paddr );
echo(mydate(),"\nConnection accepted from ", inet_ntoa($client_iaddr), ":$client_port\n");

# Chain to whereever - see blue camel book
socket(SERVER, PF_INET, SOCK_STREAM, $protocol)
  or die "Can't create socket: $!";
my $remote_ip_aton = inet_aton( $remote_ip_address );
my $remote_port_address = sockaddr_in($remote_port_num, $remote_ip_aton )
  or die "Can't get port address: $!";
echo("Connecting to $remote_ip_address\:$remote_port_num\n");
connect(SERVER, $remote_port_address)
  or die "Can't connect to socket: $!";
select(SERVER);$|=1;select(STDOUT);
binmode(SERVER);

echo("Connection made\n");

# use $rin as a 'bit-array' - see blue camel book
my $rin = "";

# Set one bit in $rin for each filehandle we want to listen on
vec($rin, fileno(CLIENT), 1) = 1;
vec($rin, fileno(SERVER), 1) = 1;

while( 1 )
{
  # wait 0.1 seconds, to potentially allow the system to rejoin fragmented messages
  select ( undef, undef, undef, .1 );

  # check for incoming messages from client or server - see blue camel book
  my $rout = $rin;
  select( $rout, "", "", undef ) ;

  # message from client?
  if( vec($rout,fileno(CLIENT),1) )
  {
    # read from CLIENT
    sysread(CLIENT,$_,100000);

    # (0 length message means connection closed)
    if(length($_) == 0)
    { 
      echo("Client disconnected\n");
      close SERVER;
      last ;
    }
    # write to SERVER (and log file). But first, change the Host to what the server expects it to be.
    s/Host: localhost:\d+/Host: $remote_ip_address:$remote_port_num/;
    print LOGFILE (mydate()," >>> {{{", $_, "}}} >>> (",length($_) ," bytes).\n");
    print (mydate()," >>> ",length($_) ," bytes.\n");
    print(SERVER $_) || die;
  }

  # message from server?
  if( vec($rout,fileno(SERVER),1) )
  {
    # Read from SERVER
    sysread(SERVER,$_,100000);

    # (0 length message means connection closed)
    if(length($_) == 0)
    {
      echo("Server disconnected\n");
      close CLIENT;
      last;
    }
    # else, write to CLIENT (and log file)
    print LOGFILE (mydate()," <<< {{{",$_,"}}} <<< (",length($_) ," bytes).\n");
    print (mydate()," <<< ",length($_) ," bytes.\n");
    print(CLIENT $_) || die;
  }
}

echo(mydate(),"Disconnected\n\n");

close CLIENT ;
close SERVER ;

#######
# Subs
#######
sub echo
{
  print @_;
  print LOGFILE @_;
}

sub mydate
{
  my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime();
  # month is returned in the range 0 to 11, where 0=January, 11=December. year is returned in years since 1900.
  return sprintf "%04d/%02d/%02d %02d:%02d:%02d" ,$year+1900,$mon+1,$mday,$hour,$min,$sec;
}

Commentary:

If you compare the two scripts, tap.pl and tap_wp.pl, you’ll see that tap_wp.pl has one extra line:

s/Host: localhost:\d+/Host: $remote_ip_address:$remote_port_num/;

What this line does is change the Host header line, from what the browser sends when connecting to localhost into what the browser would send if connecting directly to the host to which tap_wp.pl is connecting to – in our case, benaveling.wordpress.com.

To watch what now happens:

  1. download tap_wp.pl
  2. stop the existing tap.pl script, if necessary
  3. run:
    tap_wp.pl benaveling.wordpress.com 80 1080
  4. point your browser at http://localhost:1080/2012/08/18/tap_wp-pl

This time, it serves up the expected web page, exactly as wordpress created it.

Of course, using wp.com to demonstrate this concept is less elegant than having a test server under our own control, which will be the subject of the next post.

Exercise for the Reader:

You’ll note that if you display http://localhost:1080/2012/08/18/tap_wp-pl, and then click on any of the links on it, your browser will successfully follow the link and directly access the pages from wordpress, bypassing tap_wp.

If you want to have your browser access those pages via tap_wp, you’ll need to add a line or two of code to tap_wp so that when it returns a page to the browser it tweaks the urls on the page, much as we did for the Host: line that the client sends to the server.

Advertisements