Welcome to the DjaoDjin Blog!

A place to share experiences in building Software-as-a-Service.

Denying comment spam bots

by Sebastien Mirolo on Sat, 23 Apr 2011

It is kind of fun to look through your application logs and find traces of a hacker trying to break in. It might even be intellectually stimulating to play this game of hide and seek with another human being. Unfortunately most malicious attempts hitting your server will come from bots. Those don't get discouraged. Those don't change tactics. They keep trying to brute force passwords, even when you only allow private key login in your ssh daemon. They keep trying to access PHP scripts, even when you do not have any PHP stack running on your web server. Worse, if you allow people to leave comments on your web site, you are almost guarantee to attract spam bots that will waste precious bandwidth and mess up statistics you use to learn about your audience.

You could hire an army of private investigators traveling around the world to unplug those bot machines. It might actually be a very cool job (I would definitely apply for it). You might even think to pull off a good scenario ala "Blade Runner" to sell to Hollywood to offset the investigation cost.

For practical matters (or just because you are not huge on adventure around the world), you might want to try setting-up iptables, fail2ban and spamassassin. The idea is to use spamassassin to categorize comments as spam or not, use fail2ban to dynamically insert rules into the firewall when an IP definitely generates too much spam and of course use iptables to prevent those machines to reach your application stack.

iptables

Iptables comes pre-installed on all official Ubuntu distributions but unfortunately it does not come with logging enabled by default. Since we will want to verify our setup works and drops packets from banned addresses, first thing is to enable iptables logging. I also enjoyed reading Linux Firewalls Using iptables for generic information.


$ ls /etc/iptables.*
/etc/iptables.conf
$ grep -r 'iptables.conf' /etc
/etc/network/if-up.d/load-iptables:iptables-restore < /etc/iptables.conf
$ diff -U 1 /etc/iptables.conf.prev /etc/iptables.conf
--- /etc/iptables.conf.prev	2011-03-15 23:48:03.000000000 +0000
+++ /etc/iptables.conf	2011-03-16 00:27:30.000000000 +0000
@@ -3,2 +3,3 @@
 :FORWARD DROP [0:0]
+:LOGNDROP - [0:0]
 :OUTPUT DROP [0:0]
@@ -14,2 +15,7 @@
 -A OUTPUT -m state --state NEW,RELATED,ESTABLISHED -j ACCEPT 
+-A INPUT -j LOGNDROP
+-A LOGNDROP -p tcp -m limit --limit 5/min -j LOG\
  --log-prefix "Denied TCP: " --log-level 7
+-A LOGNDROP -p udp -m limit --limit 5/min -j LOG\
  --log-prefix "Denied UDP: " --log-level 7
+-A LOGNDROP -p icmp -m limit --limit 5/min -j LOG\
  --log-prefix "Denied ICMP: " --log-level 7
+-A LOGNDROP -j DROP
 COMMIT
$ iptables -N LOGNDROP
$ iptables -A INPUT -j LOGNDROP
$ iptables -A LOGNDROP -p tcp -m limit --limit 5/min -j LOG\
  --log-prefix "Denied TCP: " --log-level 7
$ iptables -A LOGNDROP -p udp -m limit --limit 5/min -j LOG\
  --log-prefix "Denied UDP: " --log-level 7
$ iptables -A LOGNDROP -p icmp -m limit --limit 5/min -j LOG\
  --log-prefix "Denied ICMP: " --log-level 7
$ iptables -A LOGNDROP -j DROP
$ iptables -L
Chain INPUT (policy DROP)
...
LOGNDROP   all  --  anywhere             anywhere            
...
Chain LOGNDROP (1 references)
target     prot opt source               destination         
LOG        tcp  --  anywhere             anywhere\
            limit: avg 5/min burst 5 LOG level debug prefix `Denied TCP: ' 
LOG        udp  --  anywhere             anywhere\
            limit: avg 5/min burst 5 LOG level debug prefix `Denied UDP: ' 
LOG        icmp --  anywhere             anywhere\
            limit: avg 5/min burst 5 LOG level debug prefix `Denied ICMP: ' 
DROP       all  --  anywhere             anywhere            
...

From now on, iptables will use syslog to log drop packets. Since iptables is actually updating the firewall rules inside the kernel, we first figure out how syslog is configured by looking for kern in /etc/syslog.conf.


$ grep kern /etc/syslog.conf
kern.*				-/var/log/kern.log

OK so all kernel messages are going into the /var/log/kernel.log file. We will later look there to correlate fail2ban banned IPs to iptables drop packets.

spamassassin

Spamassassin is a very well regarded spam filter for e-mails. We plan to route all comments to the web site through spamassassin as well. There does not seem any reason to think comment spam is any different from e-mail spam and it will reduce complexity and maintenance cost to rely on a single spam filter daemon.


$ aptitude install spamassassin
$ useradd -m -s /bin/false spamassassin
$ diff -U /etc/postfix/master.cf.prev postfix/master.cf
--- postfix/master.cf.prev	2011-03-16 01:00:46.000000000 +0000
+++ /etc/postfix/master.cf	2011-03-12 22:36:43.000000000 +0000
@@ -10,3 +10,3 @@
 # ==========================================================================
-smtp      inet  n       -       -       -       -       smtpd 
-submission inet n       -       -       -       -       smtpd
+smtp      inet  n       -       -       -       -       smtpd 
+       -o content_filter=spamassassin
+submission inet n       -       -       -       -       smtpd
+       -o content_filter=spamassassin
#  -o smtpd_tls_security_level=encrypt
@@ -81,2 +81,4 @@
   ${nexthop} ${user}
+spamassassin unix  -       n       n       -       -       pipe
+   user=spamassassin argv=/usr/bin/spamc -e /usr/sbin/sendmail -oi\
  -f ${sender} ${recipient}

Postfix is a very versatile Mail Transfer Agent (MTA) that can be configured in many different ways to achieve similar results. Documentation related to spamassassin and filtering that is worth reading include Integrating SpamAssassinwith Postfix, Postfix Virtual Domain Hosting Howto and Postfix After-Queue Content Filter.

We will create a special user account for spamassassin and used the content_filter= method on both smtp (for out of network incoming e-mail) and submission (for local e-mails). As described earlier, the semilla web application submits comments as e-mails through a local account on the mail server. I would have preferred to put the spamassassin filter later, i.e. just before delivery to the local agent but I haven't managed to do that successfully yet. Right now, spamassassin will scan all outgoing e-mails as well (content_filter on submission agent).

At this point, we can see in /var/log/mail.log that messages are filtered through spamassassin. A little bit of testing can be done by sending something like the following e-mail:


$ echo "XJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-EMAIL*C.34X"\
  | sendmail info
$ tail -f /var/log/mail.log
Mar 16 01:19:51 hostname spamd[26879]: spamd: identified spam\
  (1000.0/5.0) for spamassassin:1003 in 0.3 seconds, 1491 bytes. 
Mar 16 01:19:51 hostname spamd[26879]: spamd: result: Y 1000 \
  - GTUBE,HTML_MESSAGE scantime=0.3,size=1491,user=spamassassin,uid=1003,\
  required_score=5.0,rhost=ip6-localhost,raddr=127.0.0.1,rport=55454,mid=\
  <COL117-W56538D5CED9DE43AC69E3BA6CE0@phx.gbl>,autolearn=no 

fail2ban

We will now get fail2ban to dynamically insert rules for bots trying to break into ssh or obviously referencing pages that do not exist on the web site (such as PHP scripts).


$ aptitude install fail2ban
$ diff -U 3 /etc/fail2ban/jail.conf.prev /etc/fail2ban/jail.conf 
--- /etc/fail2ban/jail.conf.prev 2011-03-10 15:49:53.000000000 +0000
+++ /etc/fail2ban/jail.conf	2011-03-12 23:23:34.000000000 +0000
@@ -133,7 +133,7 @@
 
 [apache]
 
-enabled = false
+enabled = true
 port	= http,https
 filter	= apache-auth
 logpath = /var/log/apache*/*error.log
@@ -151,7 +151,7 @@
 
 [apache-noscript]
 
-enabled = false
+enabled = true
 port    = http,https
 filter  = apache-noscript
 logpath = /var/log/apache*/*error.log
@@ -159,7 +159,7 @@
 
 [apache-overflows]
 
-enabled = false
+enabled = true
 port    = http,https
 filter  = apache-overflows
 logpath = /var/log/apache*/*error.log

$ /etc/init.d/fail2ban restart

At this point, if you do see errors like "fail2ban.server : ERROR Unexpected communication error" in /var/log/fail2ban.log, you will need to apply the following patch to /usr/bin/fail2ban-server.


$ diff -U 1 /usr/bin/fail2ban-server.prev /usr/bin/fail2ban-server
--- fail2ban-server	2011-03-16 00:48:29.000000000 +0000
+++ /usr/bin/fail2ban-server	2011-03-15 19:55:13.000000000 +0000
@@ -1,2 +1,2 @@
-#!/usr/bin/python
+#!/usr/bin/python2.5
 # This file is part of Fail2Ban.

$ /etc/init.d/fail2ban restart

We now want to insert a new jail in fail2ban for host that are identified as sending spam but if we look into the /var/log/mail.log for spamd messages, we can see there are no IP associated to the originator of a mail identified as spam. A little patch in /usr/sbin/spamd that will print the first IP found in "Received" header fields of a mail will do. At the same time, I modified the semilla web application to send mail with a specially crafted "Received" header containing the REMOTE_ADDR environment variable.


$ diff -u spamd.org /usr/sbin/spamd
--- spamd.org	2011-04-21 23:35:10.000000000 +0000
+++ /usr/sbin/spamd	2011-04-22 00:11:17.000000000 +0000
@@ -1593,7 +1593,10 @@
 
   my $scantime = sprintf( "%.1f", time - $start_time );
 
-  info("spamd: $was_it_spam ($msg_score/$msg_threshold) for\
  $current_user:$> in"
+  my @from_addrs = $mail->get_pristine_header("Received");
+  join("\n",@from_addrs) =~ m/(\[\d+\.\d+\.\d+\.\d+\])/;
+  my $from_addr = $1;
+  info("spamd: $was_it_spam ($msg_score/$msg_threshold) from\
  $from_addr for $current_user:$> in"
        . " $scantime seconds, $actual_length bytes." );
 
   # add a summary "result:" line, based on mass-check format

The spamd related lines in /var/log/mail.log thus now look like:


Apr 22 21:20:23 hostname spamd[17844]: spamd: identified spam\
   (999.0/5.0) from [remoteaddr] for spamassassin:1003\
   in 0.2 seconds, 2152 bytes.

It is now trivial to add the following filter in /etc/fail2ban/filter.d/spamassassin.conf


[Definition]
failregex = spamd: identified spam .* from [[][]]

ignoreregex = 

and the following jail in /etc/fail2ban/jail.conf


[spamassassin]
enabled  = true
port     = http,https,smtp,ssmtp
filter   = spamassassin
logpath  = /var/log/mail.log

The script fail2ban-regex is very convenient to check your filter expression is doing what you are expecting. Later, while the system is up and running, you can use fail2ban-client to check the status of the jail.


$ fail2ban-regex "Apr 22 21:20:23 hostname spamd[17844]: \
  spamd: identified spam (999.0/5.0) from [remoteaddr] for\
  spamassassin:1003 in 0.2 seconds, 2152 bytes." \
  "spamd: identified spam .* from [[][]]"
$ sudo fail2ban-client status spamassassin

lire

At this point, iptables, spamassassin and fail2ban are configured to ban spam bots from hitting our application stack. It is all great but without generating statistics and reports, there is no easy way to find out how effective the solution is. So I started to investigate log reporting tools. Lire seemed the most promising so I started there. Since lire is present in the Ubuntu repository, that is a breeze to install it.


sudo aptitude install lire

lr_log2report seems to be the major command to generate reports.


lr_log2report --help dlf-converters
...
iptables         Iptables firewall log
...
postfix          postfix log file
...
spamassassin     spamassassin log file
...

If you are running into the following error while running your first report, you will have to apply a little patch into /usr/share/perl5/Lire/DlfStore.pm


$ lr_log2report postfix /var/log/mail.log 
Parsing log file using postfix DLF Converter...
lr_log2report: ERROR store doesn't contain a 'lire_import_log'\
   stream at /usr/share/perl5/Lire/DlfConverterProcess.pm line 170
$ diff -u DlfStore.pm.prev /usr/share/perl5/Lire/DlfStore.pm
 sub dlf_streams {
     my $self = $_[0];
 
     my @streams = ();
-    my $sth = $self->{'_dbh'}->table_info( "", "", "dlf_%", "TABLE" );
-    $sth->execute();
-    while ( my $table_info = $sth->fetchrow_hashref() ) {
-        next unless $table_info->{'TABLE_NAME'} =~ /^dlf_(.*)/;
-        next if $table_info->{'TABLE_NAME'} =~ /_links$/;
-        push @streams, $1;
-    }
-    $sth->finish();
+ # JB : table_info seems to fail
+    my @table_list = $self->{'_dbh'}->tables;
+    foreach my $table ( @table_list) {
+        next unless $table =~ /dlf_(.*)"/;
+ 	 next if $table =~ /_links$/;
+ 	 push @streams, $1;
+     }
      return @streams;
 }
$ lr_log2report iptables /var/log/kern.log
$ lr_log2report postfix /var/log/mail.log
$ lr_log2report spamassassin /var/log/mail.log
$ lr_log2report combined /var/log/apache2/domainname-access.log

We also want to add a converter for fail2ban logs so that we can correlate fail2ban actions to iptables dropped packets. Since fail2ban adds rules into the firewall through iptables, we will base its lire schema of the firewall schema (/usr/share/lire/schemas/firewall.xml). We then also add a Fail2BanConverter.pm perl script based of one of the previously existing converter (for example /usr/share/perl5/Lire/Firewall/IpfilterDlfConverter.pm) and a fail2ban_init to load our converter into the lire executable. Relevant interesting lines are


$ cat /usr/share/perl5/Lire/Firewall/Fail2BanConverter.pm
...
sub process_log_line {
    my ( $self, $process, $line ) = @_;
    my($date, $time, $name, $warning, $jail, $action, $source) 
        = split / /, $line, 7;
    if ( $@ ) {
        $process->error( $@, $line );
        return;
    } elsif ( $action ne 'Ban' ) {
        $process->ignore_log_line( $line, "not a Ban record" );
        return;
    } else {	
	use Time::Local;
        my $dlf_rec = {};
	if( "$date $time" 
          =~ /(\d\d\d\d)-(\d\d)-(\d\d) (\d\d):(\d\d):(\d\d),(\d\d\d)$/) {
	    $year = $1;
	    $month = $2;
	    $day = $3;
	    $hours = $4;
	    $min = $5;
	    $sec = $6;	    
	}	
	my $timestamp = timelocal($sec,$min,$hours,$day,$month,$year); 
        # replace 'timelocal' with 'timegm' if your input date is GMT/UTC
	$dlf_rec->{time} = $timestamp;
	$dlf_rec->{action} = "denied";
	$dlf_rec->{protocol} = "TCP";
	$dlf_rec->{rule} = $jail;		
	$dlf_rec->{from_ip} = $source;
	$dlf_rec->{count} = 1;
	$process->write_dlf( "firewall", $dlf_rec );
    }
    
}
...
$ cat /etc/lire/plugins/fail2ban_init
use Lire::PluginManager;
use Fail2BanConverter;

Lire::PluginManager->register_plugin(
            Fail2BanConverter->new() );

$ lr_log2report fail2ban /var/log/fail2ban.log

On Ubuntu, "aptitude install lire" will setup the appropriate cron jobs to send e-mail reports by running /usr/sbin/lr_vendor_cron.


$ find /etc -name '*lire*'
$ cat /etc/cron.weekly/lire
LIREUSER='lire' /usr/sbin/lr_vendor_cron weekly
$ less /usr/sbin/lr_vendor_cron
...
for d in /etc/sysconfig/lire.d /etc/default/lire.d
do
    test -d $d && CONFDIR=$d && break
done
...
for f in $CONFDIR/*.cfg
do
...

So we will add a few more .cfg files for spamassassin, iptables and fail2ban.

Be careful that testing /usr/sbin/lr_vendor_cron from the command line is a little tricky. You will most likely run into cryptic su errors because of the following line in /usr/sbin/lr_vendor_cron.


eval "$filter" < $logfile | \
    su - $LIREUSER -c \
    "lr_log2mail -s '$rotateperiod $service report from $logfile'\
  $extraopts $service root" 2>&1 | logger -p $PRIORITY -t lire

Conclusion

Voila, we are now running spamassassin on all comments posted through the web interface. Traffic from remote machines dynamically identified as spam originators is actively dropped before reaching our application stack. One last word, we add to setup a second aliases database such that the comment archiver writes files as the correct owner.