最近做的一公司笔试题里的一道题:
背景:搜索引擎会根据用户搜索的关键字提供对应的广告,一般是通过统计学习实现(不限方法)。
脚本要求:附件articles.tar.bz2中的文本文件已经分好类了,请从每个类别中随机挑选90%文件做为训练集,然后将剩余文件分类并输出分类的正确率。
常用的文本分类有基于统计的方法(bayes)、决策树、神经网络等,实际使用时往往要以具体情况判断或组装。其中,朴素贝叶斯算法是这类应用中的重要一支,由于其实现简单、运算速度快、在良好训练的前提下往往有较高的精度而被广泛使用。
之前用 java 写过,这次要 python 实现,稍改了下,而且由于样本是英文的,中文分词也省下了,不过这样看起来更清晰一点 –//
【chap1】bayes 简述
Pr(A|B) = Pr(B|A)Pr(A)/Pr(B)
如果你看到一个人总是做一些好事,则那个人多半会是一个好人。这就是说,当你不能准确知悉一个事物的本质时,你可以依靠与事物特定本质相关的事件出现的多少去判断其本质属性的概率。用数学语言表达就是:支持某项属性的事件发生得愈多,则该属性成立的可能性就愈大。
贝叶斯法则又被称为贝叶斯定理、贝叶斯规则,指的是当分析样本大到接近总体数时,样本中事件发生的概率将接近于总体中事件发生的概率。是概率统计中的应用所观察到的现象对有关概率分布的主观判断(即先验概率)进行修正的标准方法。
关于基础理论就不多赘述了,标记几个链接,google 也可以得到数以顿记的结果。
基于朴素贝叶斯分类器的文本分类算法(上)
基于朴素贝叶斯分类器的文本分类算法(下)
朴素bayes公式分类器
bayes factor
经典算法之朴素贝叶斯分类
【chap2】demo 实现
tokenizer.py
#stem 使用了 stemming 组件(http://pypi.python.org/pypi/stemming/1.0)
#easy_install stemming
#coding:utf-8
import re
from stemming.porter2 import stem
class tokenizer:
""" 切词器(EN) """
def __init__(self):
self.__stopWords = set(["0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", "about", "after", "all", "also",
"an", "and", "another", "any", "are", "as",
"at", "be", "because", "been", "before",
"being", "between", "both", "but", "by",
"came", "can", "come", "could", "did", "do",
"does", "each", "else", "for", "from", "get",
"got", "has", "had", "he", "have", "her",
"here", "him", "himself", "his", "how", "if",
"in", "into", "is", "it", "its", "just", "like",
"make", "many", "me", "might", "more", "most",
"much", "must", "my", "never", "now", "of",
"on", "only", "or", "other", "our", "out",
"over", "re", "said", "same", "see", "should",
"since", "so", "some", "still", "such", "take",
"than", "that", "the", "their", "them", "then",
"there", "these", "they", "this", "those",
"through", "to", "too", "under", "up", "use",
"very", "want", "was", "way", "we", "well",
"were", "what", "when", "where", "which", "while",
"who", "will", "with", "would", "you", "your",
"a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
"k", "l", "m", "n", "o", "p", "q", "r", "s","t",
"u", "v", "w", "x", "y", "z", ""])
def parse(self, text):
""" 获取词干/清理停用词
@param text 待处理文本
@return 有效文本
"""
sentence = []
for word in re.split("\W", stem(text).lower()):
if word not in self.__stopWords:
sentence.append(word)
return sentence
bayesclassifier.py
#coding:utf-8
import os
import math
import tokenizer
class bayesClassifier:
""" 贝叶斯分类器 """
__trainingdata = {} # 存储训练数据
__tokenizer = tokenizer.tokenizer() # 切词器
def __count(self, cls, lst):
""" 计算指定文本隶属指定分类的概率
@param cls 假定分类
@param lst 待分析文档切词列表
@return 结果预测
"""
total = len(self.__trainingdata[cls])
res = 1
for s in lst:
if s not in self.__trainingdata[cls]:
# 避免乘积累积降低精度,改用对数相加
# 规避语料未经"训练"的情况,也就是p(di/m)=0,做拉普拉斯校准,即令每一个切词基数+1
res += math.log(float(1) / total)
#res *= float(1) / total
else:
res += math.log(float(self.__trainingdata[cls][s] + 1) / total)
#res *= float(self.__trainingdata[cls][s] + 1) / total
return res
def train(self, cls, text):
""" 样本训练
@param cls 样本所属分类
@param text 样本内容
"""
if cls not in self.__trainingdata:
self.__trainingdata[cls] = {}
# 统计各分类单词出现次数
for word in self.__tokenizer.parse(text):
if word not in self.__trainingdata[cls]:
self.__trainingdata[cls][word] = 1
else:
self.__trainingdata[cls][word] += 1
def classify(self, text):
""" 获取分类结果
@param text 待测数据
"""
t = -1 << 32
res = "unkown"
for cls in self.__trainingdata.keys():
c = self.__count(cls, self.__tokenizer.parse(text))
if c > t:
res = cls
t = c
return res
StartUp.py
#!/usr/bin/env python
#coding:utf-8
"""
使用朴素贝叶斯算法对邮件归类 demo
@auth g 2010-05
"""
import os
import bayesClassifier
# 样本数据路经
articles = r"./articles"
# 分类器
classfier = bayesClassifier.bayesClassifier()
if __name__ == "__main__":
# 训练
print "#train.."
for cls in os.listdir(articles):
for f in os.listdir(os.path.join(articles, cls)):
# 抽取文件名末位非0的文本为训练数据
if int(f.rstrip(".txt")) % 10 != 0:
classfier.train(cls, open(os.path.join(articles, cls, f)).read())
# 测试
print "#test.."
# 标记测试总数/失败数
totality = 0
failed = 0
for cls in os.listdir(articles):
for f in os.listdir(os.path.join(articles, cls)):
# 抽取文件名末位非0的文本为训练数据
if int(f.rstrip(".txt")) % 10 == 0:
expect = cls
effect = classfier.classify(open(os.path.join(articles, cls, f)).read())
if effect != expect:
failed += 1
totality += 1
print "%s\t@%s" % ("".join([cls, "/", f]), effect)
# 结论
print "#mark:"
print "total %d tested." % totality
print "failed: %d" % failed
print "hit rate: %f" % (1 - float(failed) / totality)
实际测试中,取测试样本中末位不为0的文本为训练素材,其余为测试数据,正确率接近97%,表现出了较好的精度。
#mark:
total 394 tested.
failed: 12
hit rate: 0.969543
附样本数据:articles