【perlメモ】Algorithm::NaiveBayesモジュールを使ったスパム判定

Posted by kumacchi on 2010年11月12日 , No comment

上記の記事を踏まえて、スパムフィルタリングの処理を作ってみた。

準備

まず、spamというフォルダとhamというフォルダを作成。

spamフォルダの中にスパムメール数十通分のテキスト(shiftjis)を格納。
hamフォルダの中に普通のメール数十通分のテキスト(shiftjis)を格納。

スパム情報を格納したテキストファイル(spam.txt)を作成するスクリプト
spam.pl(utf8)

use Strict;
use Warnings;
use utf8;
use Encode qw/from_to decode_utf8 encode_utf8 encode decode/;
use Encode::Guess qw/ascii euc-jp 7bit-jis cp932/;
use Text::Kakasi;

my %ha =();
opendir(DIR,’./spam’);
while(my $file = readdir(DIR)){
    next if($file eq ‘.’);
    next if($file eq ‘..’);
    print "file = [$file]\n";

    $/ = undef;
    open(FILE,"./spam/$file");
    my $text = <FILE>;
    close(FILE);
    $/ = "\n";

$text = decode("Guess",$text);

    $text =~ s/\x0a//g;
    $text =~ s/\x0d//g;
    $text =~ s/ / /g;

$text = encode(‘cp932’,$text);

my $res = Text::Kakasi::getopt_argv(‘-w’);
my $str = Text::Kakasi::do_kakasi($text);

    my @a = split(/ /,$str);
    foreach my $key (@a){
        my $utf8 = decode(‘cp932’,$key);
        next if(length($utf8) < 2);
        next if(length($utf8) > 30);
        $ha{"$utf8"}++;
    }

}
closedir(DIR);

open(FILE,">spam.txt");
binmode FILE,’:utf8′;
foreach my $key(sort {$ha{$b} <=> $ha{$a}} keys %ha){
print FILE join("\t",$key,$ha{$key},"\n");
}
close(FILE);

実行結果

F:\kumacchi\MyProgram\perl\sample\Bayes>perl spam.pl
file = [Export000000.eml]
file = [Export000001.eml]
file = [Export000002.eml]
file = [Export000003.eml]
file = [Export000004.eml]
file = [Export000005.eml]
file = [Export000006.eml]
file = [Export000007.eml]
file = [Export000008.eml]
file = [Export000009.eml]
file = [Export000010.eml]
file = [Export000011.eml]
file = [Export000012.eml]
file = [s1.txt]

F:\kumacchi\MyProgram\perl\sample\Bayes>

作成されたスパム情報を格納したファイルの抜粋。語と出現回数をTABで区切ったテキストを並べたファイルです。

spam.txt

メール    13
！！    13
2010/11/11),    13
-*-    12
男性    12
yahoo.co.jp    12
人妻    12
to    12
融資    12
and    11
登録    10
you    10
下さい    10
Thu,    10
など    10
相手    10

（以下省略）

同様に以下のスクリプトでham.txtを作成。

ham.pl

use Strict;
use Warnings;
use utf8;
use Encode qw/from_to decode_utf8 encode_utf8 encode decode/;
use Encode::Guess qw/ascii euc-jp 7bit-jis cp932/;
use Text::Kakasi;

my %ha =();
opendir(DIR,’./ham’);
while(my $file = readdir(DIR)){
    next if($file eq ‘.’);
    next if($file eq ‘..’);
    print "file = [$file]\n";

    $/ = undef;
    open(FILE,"./ham/$file");
    my $text = <FILE>;
    close(FILE);
    $/ = "\n";

$text = decode("Guess",$text);

    $text =~ s/\x0a//g;
    $text =~ s/\x0d//g;
    $text =~ s/ / /g;

$text = encode(‘cp932’,$text);

my $res = Text::Kakasi::getopt_argv(‘-w’);
my $str = Text::Kakasi::do_kakasi($text);

    my @a = split(/ /,$str);
    foreach my $key (@a){
        my $utf8 = decode(‘cp932’,$key);
        next if(length($utf8) < 2);
        next if(length($utf8) > 30);
        $ha{$utf8}++;
    }
}
closedir(DIR);

open(FILE,">ham.txt");
binmode FILE,":utf8";
foreach my $key(sort {$ha{$b} <=> $ha{$a}} keys %ha){
print FILE join("\t",$key,$ha{$key},"\n");
}
close(FILE);

実行結果

F:\kumacchi\MyProgram\perl\sample\Bayes>ham.pl
file = [Export000000.eml]
file = [h1.txt]
file = [mail_ham1.txt]

F:\kumacchi\MyProgram\perl\sample\Bayes>

できかがったham.txtの抜粋

ham.txt

XP    182
製品    129
.NET    124
Windows    109
キャンペーン    107
Office    106
マイクロソフト    102
Microsoft    97
情報    90
2002    82
Visual    77
ユーザー    76
2001    73
登録    72
電子メール    69

（以下省略）

スパム判定プログラム

上記で作成したデータを元にスパム判定を行なうスクリプト

#!/usr/bin/perl
use Strict;
use Warnings;
use utf8;
use Algorithm::NaiveBayes;
use Encode qw/from_to encode decode decode_utf8/;
use Encode::Guess qw/ascii euc-jp 7bit-jis cp932/;
use Text::Kakasi;
use Data::Dumper;
{
    no warnings ‘all’;
    package Data::Dumper;
    sub qquote { return shift; }
}
$Data::Dumper::Useperl = 1;

my $nb = Algorithm::NaiveBayes->new;

my %ham = &getHash("ham.txt");
$nb->add_instance(attributes => {%ham}, label => ‘ハム’
);

my %spam = &getHash("spam.txt");
$nb->add_instance(attributes => {%spam},label => ‘スパム’);

$nb->train;

my %mail = &getTexthash($ARGV[0]);
my $result = $nb->predict(attributes => {%mail});

my $dump = Dumper($result);

binmode STDOUT,’:encoding(cp932)’;
print $dump;

my $cnt=0;
foreach my $key(sort { ${$result}{$b} <=> ${$result}{$a} } keys %{$result}){
$cnt++;
print "$cnt $key = ${$result}{$key}\n";
}

#====================================================================
#
#====================================================================
sub getHash{
my $file = shift;

my %hash = ();

    open(FILE,$file);
    while(<FILE>){
        chomp;
        my ($key,$num) = split(/\t/);
        $hash{"$key"} = $num;
    }
    close(FILE);

%hash;
}

#====================================================================
#
#====================================================================
sub getTexthash{
my $file = shift;

print $file,"\n";

my %hash = ();

    $/ = undef;
    open(FILE,$file);
    my $text = <FILE>;
    close(FILE);
    $/ = "\n";

$text = decode("Guess",$text);

    $text =~ s/\x0a//g;
    $text =~ s/\x0d//g;
    $text =~ s/ / /g;

$text = encode(‘cp932’,$text);

my $res = Text::Kakasi::getopt_argv(‘-w’);
my $str = Text::Kakasi::do_kakasi($text);

    my @a = split(/ /,$str);
    foreach my $key (@a){
        my $utf8 = decode(‘cp932’,$key);
        next if(length($utf8) < 2);
        next if(length($utf8) > 30);
        $hash{"$utf8"}++;
    }
    %hash;
}

適当なスパムメールのファイルを渡してスパム判定してみたところ

F:\kumacchi\MyProgram\perl\sample\Bayes>perl bayes03.pl mail_spam02.txt
mail_spam02.txt
$VAR1 = {
          スパム => 1,
          ハム => ‘4.75135128115554e-038’
        };
1 スパム = 1
2 ハム = 4.75135128115554e-038

F:\kumacchi\MyProgram\perl\sample\Bayes>

適当は普通のメールで判定してみたところ。

F:\kumacchi\MyProgram\perl\sample\Bayes>perl bayes03.pl mail_ham.txt
mail_ham.txt
$VAR1 = {
          スパム => ‘0.136732677614051’,
          ハム => ‘0.990607982439316’
        };
1 ハム = 0.990607982439316
2 スパム = 0.136732677614051

F:\kumacchi\MyProgram\perl\sample\Bayes>

ちゃんと判定出来ています。帰ってくる値は0~1の値です。

『4.75135128115554e-038』は一見大きな値に見えますが、数値の最後に『e-038』と付いているのは指数表示なので、4.75135128115554×10の-38乗となり限りなく小さな値になります。

掲示板やブログに組み込んでみたいですね。

参考：

(Visited 167 times, 1 visits today)

タグ : perl, perlメモ, ベイズ理論