-*- mode: outline; coding: euc-japan -*-
spamfilter.el -- ٥ե륿饤֥  Susumu Ota (ccbcc@black.livedoor.com)


* 

  ẢήԤäƤ Paul Graham  A Plan for Spam  Better
  Bayesian Filtering  Emacs Lisp Ǽ SPAM ե륿󥰥饤
  ֥ǤEmacs Υ᡼ǻȤȤꤷƤޤɮԤ
  Wanderlust ǻȤäƤޤ

  ܸб٥ե륿(Ҥλ URL 򻲾)Ͽ
  ޤ٤ưʲħޤ

  - Emacs Lisp ΤߤǼƤΤǡ¾Υեȥ(procmail
    ChaSenkakasi )ɬפȤʤ
  - ᡼Ū˽Ǥ롣
  - 󥯥󥿥˥ѥγؽǤ롣
  - ܸñʬ 3 μˡǤ롣
  - ΤȤ Wanderlust  Mew бƤ롣
  - Navi2ch ǤȤ롣

  르ꥺϡҤλ URL 򻲾ȤƤ

  ܸξ硢token ؤʬϱѸΤ褦ñǤϤʤᡢʲ 3
  μˡǤ褦ˤޤ

  1.  (ChaSen)
     ܸǲϥƥ
  2. bigram
     ܸʬ 2 ʸŤڤФ (scbayes ǻȤƤˡ)
  3. block
     ʿ̾Ҳ̾Υ֥åȤڤФ (Mozilla ǻȤ
     ˡ)

  ɮԤδĶƱǡ(spam  250 ̡nospam  1000 )Ȥ˴
  ñ˥ƥȤ¤Ǥϡ

  1. ChaSen
     ٤ѥƤ⸡٤Ϲ⤤ѥϿñ
     ̤ϺǾ
  2. bigram
     ®ѥ礭ʤи٤Ϥ褤ѥ
     Ͽñ̤Ϻ
  3. block
     ®٤Ϥ㤤ѥϿñ̤

  嵭 3 ΤɤμˡǤ

  - spam  nonspam Ƚ (ƨ) :  %
  - nospam  nspam Ƚ () : 0 %

  Ȥ̤ƤޤɮԤΤ bigram Ǥ


* 

** [ץ] 䥡(ChaSen)Υ󥹥ȡ

   ܸ tokenize 뤿ˡ䥡(ChaSen)ѤǤޤɮԤ
   ǧΤ chasen-2.2.9 + ipadic-2.4.4 Ǥɤ
   ./configure; make; make install ɤȻפޤܤ䥥ۡ
   ڡ (http://chasen.aist-nara.ac.jp/index.html.ja)򻲾ȤƤ
   

   ChaSen Ȥäۤ٤夬Ȼפޤ®٤ʤ٤
   ʤޤbigram Ǥ⥳ѥٰ̤ʾ夢Сʬ
   ȽǤޤΤ ChaSen äɬפʤȻפޤ


** [ץ] ѥեѴ

   С 0.11 饳ѥեΥեޥåȤѹޤ
   0.10 ΥѥեȤäƤϡɬ°
   convert_corpus_0.10_to_0.11.pl ǥѥեѴƤ
   Ѵˡϰʲ̤Ǥ

     % cp -p ~/.spamfilter ~/.spamfilter.org
     % perl convert_corpus_0.10_to_0.11.pl ~/.spamfilter > ~/.spamfilter.new
     % mv ~/.spamfilter.new ~/.spamfilter

   0.9 ȤäƤϡconvert_corpus_0.9_to_0.10.pl Ѵ
   ˾嵭ȤԤäƤ


** 󥹥ȡ

   ʤƥȡʤΤǤMakefile °ޤMakefile 
   ƬʬŬԽ塢

     % make
     % make install

   ǥ󥹥ȡǤޤ


** ~/.emacs 

   ʲ ~/.emacs ɲäޤΤ autoload бͽǤ

(add-to-list 'load-path "~/elisp")
(require 'spamfilter)


* Ȥ

** Wanderlust ǻȤ

   ɮԤΥƥȴĶ

   1. Mac OS X 10.2.8, Emacs 21.3.50.1 of 2003-06-01,
      wanderlust-2.10.0 (apel-10.4, flim-1.14.4, semi-1.14.5)

   Ǥ

   Wanderlust  CVS HEAD бƤ褦Ǥgeneric
   function ȤääǤܤϡWanderlust ML 12382,
   12407

   http://lists.airs.net/wl/archive/200310/msg00074.html
   http://lists.airs.net/wl/archive/200310/msg00099.html

   򻲾ȤƤȤ ʲפǤ

   ~/.wl ޤ ~/.emacs 

(require 'spamfilter-wl)
(add-hook 'wl-hook
          #'(lambda ()
              ;; ChaSen Ȥ
              ;; (unless chasen-process
              ;;   (chasen-async-open))
              (spamf-load-corpus "~/.spamfilter")))
(add-hook 'kill-emacs-hook ; 'wl-exit-hook Ǥ褤
          #'(lambda ()
              ;; ChaSen Ȥ
              ;; (chasen-async-close)
              (spamf-save-corpus "~/.spamfilter")))
;; SPAM ե̾
(setq spamf-wl-spam-folder-name "+spam")
;; ե¹ԻˡѥϿԤ鷺̵뤹եΥꥹ
(setq spamf-wl-ignore-register-folder-names '("+trash"))


;;; tokenizer λ

;; ChaSen Ȥ
; (setq spamf-file-for-each-function   #'chasen-file-for-each)
; (setq spamf-buffer-for-each-function #'chasen-buffer-for-each)
; (setq spamf-string-for-each-function #'chasen-string-for-each)
; (setq spamf-tokenize-file-function   #'chasen-tokenize-file)
; (setq spamf-tokenize-buffer-function #'chasen-tokenize-buffer)
; (setq spamf-tokenize-string-function #'chasen-tokenize-string)
;; chasen-process-send-string-limit ̤礭ƥȤƱץȤ
; (setq chasen-process-send-string-limit (* 1024 2))

;; bigram Ȥ
; (setq spamf-file-for-each-function   #'jtoken-bigram-file-for-each)
; (setq spamf-buffer-for-each-function #'jtoken-bigram-buffer-for-each)
; (setq spamf-string-for-each-function #'jtoken-bigram-string-for-each)
; (setq spamf-tokenize-file-function   #'jtoken-bigram-tokenize-file)
; (setq spamf-tokenize-buffer-function #'jtoken-bigram-tokenize-buffer)
; (setq spamf-tokenize-string-function #'jtoken-bigram-tokenize-string)

;; block Ȥ
; (setq spamf-file-for-each-function   #'jtoken-block-file-for-each)
; (setq spamf-buffer-for-each-function #'jtoken-block-buffer-for-each)
; (setq spamf-string-for-each-function #'jtoken-block-string-for-each)
; (setq spamf-tokenize-file-function   #'jtoken-block-tokenize-file)
; (setq spamf-tokenize-buffer-function #'jtoken-block-tokenize-buffer)
; (setq spamf-tokenize-string-function #'jtoken-block-tokenize-string)


   Ƚ񤭤ޤtokenizer ȤƻȤˡˤ碌ƥȥȤ
   ѹƤ(ǥեȤ bigram Ǥ)

   elmo-split ѤϾ嵭˲äơ

(setq elmo-split-folder "+inbox")
(setq elmo-split-rule
      '(((spamfilter) "+spam") ; SPAM  `+spam' 
        (t "+inbox")))         ; ʳ `+inbox' 

   Τ褦˽񤤤Ƥޤ

   嵭Τ褦ˤȡޥ⡼ `o' ޤ `C-o' ¹Ԥ
   SPAM ɤȽԤSPAM ȽǤ "+spam" ޡĤ
   ޤ

   ʤwl-refile-rule-alist Υ롼뤬ͥ褵ޤΤǡ
   wl-refile-rule-alist ǥޡʤäե뤬 spamfilter 
   ݤȤʤޤ


   ޤ`x' ǥե¹ԤȡեоݤΥ᡼Ϥ
   ƥѥγؽԤޤ

   M-x spamf-wl-disable-spamfilter
   M-x spamf-wl-enable-spamfilter

    spamfilter ǽ on/off Ǥޤ

   ƻȤϡޥ⡼

   M-x spamf-wl-register-good-folder
   M-x spamf-wl-register-spam-folder

   ȤäƥեΥ᡼򥳡ѥ˰Ͽ(ʤ
   ֤뤳Ȥиξ)

   M-x spamf-save-corpus

    ~/.spamfilter ¸ƤɤȻפޤ

   elmo-split ȤäåοʬǤޤѥ
   ϴǤΤ `C-u M-x elmo-split' ǥϡԤäƳڤ
   ٤ˤƤۤɤȻפޤ


** Mew ǻȤ

   ɮԤΥƥȴĶ

   1. Mac OS X 10.2.8, Emacs 21.3.50.1 of 2003-06-01, mew-3.2

   Ǥ

   ~/.mew.el ޤ ~/.emacs 

(require 'spamfilter-mew)
(add-hook 'mew-init-hook
          #'(lambda ()
              ;; ChaSen Ȥ
              ;; (unless chasen-process
              ;;   (chasen-async-open))
              (spamf-load-corpus "~/.spamfilter")))
(add-hook 'kill-emacs-hook ; 'mew-quit-hook Ǥ褤
          #'(lambda ()
              ;; ChaSen Ȥ
              ;; (chasen-async-close)
              (spamf-save-corpus "~/.spamfilter")))
;; SPAM ե̾
(setq spamf-mew-spam-folder-name "+spam")
;; ե¹ԻˡѥϿԤ鷺̵뤹եΥꥹ
(setq spamf-mew-ignore-register-folder-names '("+trash"))


;;; tokenizer λ

;; ChaSen Ȥ
; (setq spamf-file-for-each-function   #'chasen-file-for-each)
; (setq spamf-buffer-for-each-function #'chasen-buffer-for-each)
; (setq spamf-string-for-each-function #'chasen-string-for-each)
; (setq spamf-tokenize-file-function   #'chasen-tokenize-file)
; (setq spamf-tokenize-buffer-function #'chasen-tokenize-buffer)
; (setq spamf-tokenize-string-function #'chasen-tokenize-string)
;; chasen-process-send-string-limit ̤礭ƥȤƱץȤ
; (setq chasen-process-send-string-limit (* 1024 2))

;; bigram Ȥ
; (setq spamf-file-for-each-function   #'jtoken-bigram-file-for-each)
; (setq spamf-buffer-for-each-function #'jtoken-bigram-buffer-for-each)
; (setq spamf-string-for-each-function #'jtoken-bigram-string-for-each)
; (setq spamf-tokenize-file-function   #'jtoken-bigram-tokenize-file)
; (setq spamf-tokenize-buffer-function #'jtoken-bigram-tokenize-buffer)
; (setq spamf-tokenize-string-function #'jtoken-bigram-tokenize-string)

;; block Ȥ
; (setq spamf-file-for-each-function   #'jtoken-block-file-for-each)
; (setq spamf-buffer-for-each-function #'jtoken-block-buffer-for-each)
; (setq spamf-string-for-each-function #'jtoken-block-string-for-each)
; (setq spamf-tokenize-file-function   #'jtoken-block-tokenize-file)
; (setq spamf-tokenize-buffer-function #'jtoken-block-tokenize-buffer)
; (setq spamf-tokenize-string-function #'jtoken-block-tokenize-string)


   Ƚ񤭤ޤtokenizer ȤƻȤˡˤ碌ƥȥȤ
   ѹƤ(ǥեȤ bigram Ǥ)

   嵭Τ褦ˤȡޥ⡼ `o' ޤ `M-o' ¹Ԥ
   SPAM ɤȽԤSPAM ȽǤ "+spam" ޡĤ
   ޤ

   ʤmew-refile-guess-alist Υ롼뤬ͥ褵ޤΤǡ
   mew-refile-guess-alist ǥޡʤäե뤬 spamfilter 
   оݤȤʤޤ


   ޤ`x' ǥե¹ԤȡեоݤΥ᡼Ϥ
   ƥѥγؽԤޤ

   M-x spamf-mew-disable-spamfilter
   M-x spamf-mew-enable-spamfilter

    spamfilter ǽ on/off Ǥޤ

   ƻȤϡޥ⡼

   M-x spamf-mew-register-good-folder
   M-x spamf-mew-register-spam-folder

   ȤäƥեΥ᡼򥳡ѥ˰Ͽ(ʤ
   ֤뤳Ȥиξ)

   M-x spamf-save-corpus

    ~/.spamfilter ¸ƤɤȻפޤ


** Navi2ch ǻȤ

   http://pc.2ch.net/test/read.cgi/unix/1065246418/38 
   spamfilter.el  Navi2ch (http://navi2ch.sourceforge.net/) ǻȤ
   ɤޤܤ 2ch UNIX Ĥ Navi2ch åɤȡ

   http://cvs.sourceforge.net/viewcvs.py/navi2ch/navi2ch/contrib/navi2ch-spamfilter.el

   򻲾ȤƤ


** ̤˻Ȥ

   M-x spamf-register-good-directory
   M-x spamf-register-good-file
   M-x spamf-register-good-buffer

   ɤƥ(SPAM ǤϤʤƥ)򥳡ѥϿޤ
   spamf-register-good-directory ϡǥ쥯ȥΥեƵŪ
   Ͽޤ
   spamf-register-good-file ñեϿޤ
   spamf-register-good-buffer ϥȥХåեƤϿޤ

   Ʊͤˡ

   M-x spamf-register-spam-directory
   M-x spamf-register-spam-file
   M-x spamf-register-spam-buffer

    SPAM 򥳡ѥϿޤ

   嵭δؿ chaset β䡢content-transfer-encoding
   νԤʤᡢencoding 줿ʸϤϿǤޤ󡣤
   źեեǥХʥäƤ⤫ޤ鷺Ͽ褦ȤƤ
   ޤ

   ʾμǥѥϿ塢

   M-x spamf-spamness

   ¹ԤȡȥХåեƤ SPAM (spamness)Ƚꤷ
   ͤоݤȤñ `*spamfilter-log*' ȤХåեɽ
   0.9 ʾ夬 SPAM ܰ¤Ǥ

   ޤ

   M-x spamf-save-corpus
   M-x spamf-load-corpus

   ǥѥե˥/ɤޤEmacs ư˥ѥե
   ɤ߹ߤ硢ʲΤ褦 ~/.emacs  spamf-load-corpus
   ¹Ԥ褦ˤȤ褤Ǥ

(spamf-load-corpus "~/.spamfilter")


* ٤夵뤿 Tips

  ɮԤȾǯ᤯٥ե륿ȤäƤǤ Tips Ǥ(
  nonspam 2000 , spam 1000 )

** 롼Ƚ̤Ǥ᡼ϥ롼ˤޤơ٥ե륿ν롣

   ͧ͡ŻطML᡼ޥतΤۤȤɤϥ롼ˤե
   륿(Wanderlust ξ wl-refile-rule-alist)ȤäȽ̤Ǥޤ
   ޤΥ֥Ȥ̤פΥ᡼⡢롼ǤĤޤ
   ޤäơΥ᡼ϥ٥ե륿ǽ
   ƤȽ̤ǤΤǡѥˤϿʤۤ٤夹
   褦Ǥ

   Ϥʴspamf-wl-ignore-register-folder-names ǽ
   եꤷޤ

(setq wl-refile-rule-alist
      '(("From"
         ("@mailmagazine\\.co\\.jp" . "+mailmagazine")
         ("otomodati@foo\\.ne\\.jp" . "+friend")
         ("okyaku@bar\\.co\\.jp" . "+work"))
        (("To" "Cc")
         ("ml@baz\\.org" . "+ml"))
        ("Subject"
         ("̤" . "+trash"))))

(setq spamf-wl-ignore-register-folder-names
      '("+trash" "+mailmagazine" "+friend" "+work" "+ml"))

   ɮԤξϥ롼Ǥۤ 90% Υ᡼ϿʬǤƤޤ
    10% ٥ե륿˰Ѥͤޤ

   Υ롼Ƚ̤᡼ SPAM Ϥ᤿(
    ML  SPAM Ƥꤷ)롼ѹƥ٥ե
   륿ǽʤФʤޤ

   嵭ɮԤδĶǤäǤΤǤʤδĶǤϤޤʤ⤷ޤ

   ¾ˤ⸡٤夵뤿 Tips ɮԤ˶Ƥ


* 

  ɤ Shiro Kawai (shiro@acm.org) ν񤤤 scbayes
  http://www.shiro.dreamhost.com/scheme/wiliki/wiliki.cgi?Gauche%3ASpamFilter&l=jp
   bayesian-filter.scm 򻲹ͤˤƤޤ

  û֤˽񤤤ɤʤΤǡɥȤ̿̾§ƥȡǤ

  ChaSen ϤƥȤΥchasen-process-send-string-limit 
  ξ硢ƥȤöե˥֤ơChaSen Ʊץ
  Ǽ¹ԤޤΤᡢ˻֤礬ޤɮԤδĶ
  (Meadow-1.10) Ʊץ̿ȡ̤礭ƥȤ
  process-send-string ȥե꡼Ƥޤ褦ʤΤǡȤꤢ
  μˤʤäƤޤ

  Mew бϡɮԤʻȤäƤ櫓ǤϤʤΤǡޤ⤷
  ޤ

  0.10 饳ѥեΥեޥåȤѹޤ0.9 ޤǤΥ
  ѥե convert_corpus_0.9_to_0.10.pl ȤäѴƤ
  

  0.11 饳ѥեΥեޥåȤѹޤ0.10 ޤǤΥ
  ѥե convert_corpus_0.10_to_0.11.pl ȤäѴƤ
  

  ѥե򤪤ʤץ(strip_corpus.pl)°
  ޤȤϥȤ򻲾ȤƤ

  ̵ݾڤǤ饤󥹤ϥ򻲾ȤƤ


*  URL

** A Plan for Spam
   http://www.paulgraham.com/spam.html

** A Plan for Spam ()
   http://www.shiro.dreamhost.com/scheme/trans/spam-j.html

** Better Bayesian Filtering
   http://www.paulgraham.com/better.html

** Better Bayesian Filtering ()
   http://www.shiro.dreamhost.com/scheme/trans/better-j.html

** Filters That Fight Back
   http://www.paulgraham.com/ffb.html

** Filters That Fight Back ()
   http://www.shiro.dreamhost.com/scheme/trans/ffb-j.html

** Gauche:SpamFilter
   http://www.shiro.dreamhost.com/scheme/wiliki/wiliki.cgi?Gauche%3ASpamFilter&l=jp

** bsfilter / bayesian spam filter / ٥ ѥ ե륿
   http://www.h2.dion.ne.jp/~nabeken/bsfilter/index.html

** bogofilter + kakasi
   http://www.ono.org/software/bogofilter/

** bogofilter.el
   http://www.teikan.net/hideki/bogofilter/index.ja.html

** POPFile
   http://popfile.sourceforge.net/

** Mozilla
   http://jt.mozilla.gr.jp/

** ʳξ A Plan for Spam 餿ɤ󥯤ˤ󤢤ޤ


* ѹ

  2003/01/22  1.0 --> 1.1
  ñ API(spamf-delete-*) ɲ(spamfilter.el)

  2003/12/10  0.12 --> 1.0
  inline (japanse-tokenizer.el, spamfilter.el)

  2003/11/5  0.11 --> 0.12
  intern ȼ obarray Ȥ褦ˤ(spamfilter.el) thx to
  NIIMI Satoshi 

  2003/10/19  0.10 --> 0.11
  ¹Ի cl ˰¸Ƥʬ(spamfilter.el,
  japanse-tokenizer.el) thx to NIIMI Satoshi 

  2003/10/12  0.9 --> 0.10
  ʣѥб(spamfilter.el)
  ΨޥʥˤʤХν(spamfilter.el) thx to
  http://pc.2ch.net/test/read.cgi/unix/1065246418/55 

  2003/06/26  0.8 --> 0.9
  Mew б(spamfilter-mew.el)

  2003/06/23  0.7 --> 0.8
  ѥեΥХååפ褦˽(spamfilter.el)
  ɤ(*.el)

  2003/04/14  0.6 --> 0.7
  bigram (japanse-tokenizer.el)

  2003/04/13  0.5 --> 0.6
  japanse-tokenizer.el ǡʸ᡼ǥ顼ˤʤХν

  2003/04/12  0.4 --> 0.5
  japanse-tokenizer.el(ܸ tokenizer) ɲá
  ƥȤ̤˱ ChaSen Ʊ/Ʊץڤؤ褦
  ѹ
  ѥñ string  symbol ѹ
  ~/.spamfilter ƤɤʤХνthx to MIYOSHI
  Masanori 


Susumu Ota (ccbcc@black.livedoor.com)
$Id: README,v 1.23 2004/01/22 07:36:23 ota Exp $
