Python核心技术与实战12 实现一个搜索引擎

2019-06-05

To be practice!

基本概念

一个搜索引擎由搜索器、索引器、检索器和用户接口四个部分组成。
搜索器，就是我们常提到的爬虫（scrawler），它能在互联网上大量爬取各类网站的内容，送给索引器。索引器拿到网页和内容后，会对内容进行处理，形成索引（index），存储于内部的数据库等待检索。
用户接口是指网页和App前端界面，例如百度和谷歌的搜索页面。用户通过用户接口，向搜索引擎发出询问（query），询问解析后送达检索器；检索器高效检索后，再将结果返回给用户。

先定义 SearchEngineBase 基类：

class SearchEngineBase(object):
    def __init__(self):
        pass

    def add_corpus(self, file_path):
        with open(file_path, 'r') as fin:
            text = fin.read()
        self.process_corpus(file_path, text)

    def process_corpus(self, id, text):
        raise Exception('process_corpus not implemented.')

    def search(self, query):
        raise Exception('search not implemented.')

def main(search_engine):
    for file_path in ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']:
        search_engine.add_corpus(file_path)

    while True:
        query = input()
        results = search_engine.search(query)
        print('found {} result(s):'.format(len(results)))
        for result in results:
            print(result)

SearchEngineBase可以被继承，继承的类分别代表不同的算法引擎。每一个引擎都应该实现process_corpus()和search()两个函数，对应我们刚刚提到的索引器和检索器。main()函数提供搜索器和用户接口，于是一个简单的包装界面就有了。
具体看这段代码：
- add_corpus() 函数负责读取文件内容，将文件路径作为ID，连同内容一起送到process_corpus中。
- process_corpus 需要对内容进行处理，然后文件路径为ID，将处理后的内容存下来。处理后的内容，就叫做索引（index）。
- search 则给定一个询问，处理询问，再通过索引检索，然后返回。

然后定义一个基本可以工作的搜索引擎：

class SimpleEngine(SearchEngineBase):
    def __init__(self):
        super(SimpleEngine, self).__init__()
        self.__id_to_texts = {}

    def process_corpus(self, id, text):
        self.__id_to_texts[id] = text

    def search(self, query):
        results = []
        for id, text in self.__id_to_texts.items():
            if query in text:
                results.append(id)
        return results

search_engine = SimpleEngine()
main(search_engine)


########## 输出 ##########

simple
found 0 result(s):
little
found 2 result(s):
1.txt
2.txt

SimpleEngine 实现了一个继承SearchEngineBase的子类，继承并实现了process_corpus和search 接口，同时，也顺手继承了add_corpus函数（当然你想重写也是可行的），因此我们可以在main()函数中直接调取。
在新的构造函数中，se1f.__id_to_texts={}初始化了自己的私有变量，也就是这个用来存储文件名到文件内容的字典。
process_corpus()函数则非常直白地将文件内容插入到字典中。这里注意，ID需要是唯一的，不然相同ID的新内容会覆盖掉旧的内容。
search 直接枚举字典，从中找到要搜索的字符串。如果能够找到，则将ID放到结果列表中，最后返回。

Bag of Words and Inverted Index

先实现一个名叫Bag of Words的搜索模型:

import re

class BOWEngine(SearchEngineBase):
    def __init__(self):
        super(BOWEngine, self).__init__()
        self.__id_to_words = {}

    def process_corpus(self, id, text):
        self.__id_to_words[id] = self.parse_text_to_words(text)

    def search(self, query):
        query_words = self.parse_text_to_words(query)
        results = []
        for id, words in self.__id_to_words.items():
            if self.query_match(query_words, words):
                results.append(id)
        return results
    
    @staticmethod
    def query_match(query_words, words):
        for query_word in query_words:
            if query_word not in words:
                return False
        return True

    @staticmethod
    def parse_text_to_words(text):
        # 使用正则表达式去除标点符号和换行符
        text = re.sub(r'[^\w ]', ' ', text)
        # 转为小写
        text = text.lower()
        # 生成所有单词的列表
        word_list = text.split(' ')
        # 去除空白单词
        word_list = filter(None, word_list)
        # 返回单词的 set
        return set(word_list)

search_engine = BOWEngine()
main(search_engine)


########## 输出 ##########

i have a dream
found 3 result(s):
1.txt
2.txt
3.txt
freedom children
found 1 result(s):
5.txt

先来理解一个概念，BOW Model，即Bag of Words Model，中文叫做词袋模型。这是NLP领域最常见最简单的模型之一。
假设一个文本，不考虑语法、句法、段落，也不考虑词汇出现的顺序，只将这个文本看成这些词汇的集合。于是相应的，我们把id to texts 替换成id_to_words，这样就只需要存这些单词，而不是全部文章，也不需要考虑顺序。
process_corpus()函数调用类静态函数 parse_text_to_words，将文章打碎形成词袋，放入set之后再放到字典中。
search()函数则稍微复杂一些。这里我们假设，想得到的结果，是所有的搜索关键词都要出现在同一篇文章中。那么，我们需要同样打碎query得到一个set，然后把set中的每一个词，和我们的索引中每一篇文章进行核对，看一下要找的词是否在其中。而这个过程由静态函数query_match负责。

优化

遍历代价太大，而其实每次查询的query单词量不会很多；其次，词袋模型并不考虑单词间的顺序，但有些人希望单词按顺序出现，或者希望搜索的单词在文中离得近一些，这种情况下词袋模型就无能为力了。

import re

class BOWInvertedIndexEngine(SearchEngineBase):
    def __init__(self):
        super(BOWInvertedIndexEngine, self).__init__()
        self.inverted_index = {}

    def process_corpus(self, id, text):
        words = self.parse_text_to_words(text)
        for word in words:
            if word not in self.inverted_index:
                self.inverted_index[word] = []
            self.inverted_index[word].append(id)

    def search(self, query):
        query_words = list(self.parse_text_to_words(query))
        query_words_index = list()
        for query_word in query_words:
            query_words_index.append(0)
        
        # 如果某一个查询单词的倒序索引为空，我们就立刻返回
        for query_word in query_words:
            if query_word not in self.inverted_index:
                return []
        
        result = []
        while True:
            
            # 首先，获得当前状态下所有倒序索引的 index
            current_ids = []
            
            for idx, query_word in enumerate(query_words):
                current_index = query_words_index[idx]
                current_inverted_list = self.inverted_index[query_word]
                
                # 已经遍历到了某一个倒序索引的末尾，结束 search
                if current_index >= len(current_inverted_list):
                    return result

                current_ids.append(current_inverted_list[current_index])

            # 然后，如果 current_ids 的所有元素都一样，那么表明这个单词在这个元素对应的文档中都出现了
            if all(x == current_ids[0] for x in current_ids):
                result.append(current_ids[0])
                query_words_index = [x + 1 for x in query_words_index]
                continue
            
            # 如果不是，我们就把最小的元素加一
            min_val = min(current_ids)
            min_val_pos = current_ids.index(min_val)
            query_words_index[min_val_pos] += 1

    @staticmethod
    def parse_text_to_words(text):
        # 使用正则表达式去除标点符号和换行符
        text = re.sub(r'[^\w ]', ' ', text)
        # 转为小写
        text = text.lower()
        # 生成所有单词的列表
        word_list = text.split(' ')
        # 去除空白单词
        word_list = filter(None, word_list)
        # 返回单词的 set
        return set(word_list)

search_engine = BOWInvertedIndexEngine()
main(search_engine)


########## 输出 ##########


little
found 2 result(s):
1.txt
2.txt
little vicious
found 1 result(s):
2.txt

新模型继续使用之前的接口，仍然只在init()、pr ocess_corpus()和search()三个函数进行修改。
开头的Inverted Index，即倒序索引，是非常有名的搜索引擎方法: 也就是说这次反过来，我们保留的是word->id的字典。于是情况就豁然开朗了，在search时，我们只需要把想要的query_word的几个倒序索引单独拎出来，然后从这几个列表中找共有的元素，那些共有的元素，即ID，就是我们想要的查询结果。这样，我们就避免了将所有的index过一遍的尴尬,这解决了遍历的问题
process_corpus 建立倒序索引。注意，这里的代码都是非常精简的。在工业界领域，需要一个unique ID生成器，来对每一篇文章标记上不同的ID，倒序索引也应该按照这个uniqueid来进行排序。
search()函数，会根据 query_words 拿到所有的倒序索引，如果拿不到，就表示有的query word不存在于任何文章中，直接返回空；拿到之后，运行一个“合并K个有序数组”的算法，从中拿到我们想要的ID，并返回。注意，这里用到的算法并不是最优的，最优的写法需要用最小堆来存储index。这是一道有名的leetcode hard题，有兴趣请参考：
关于实现搜索单词按顺序出现，或者搜索的单词在文中离得近一些的情况，我们需要在Inverted Index上，对于每篇文章也保留单词的位置信息，这样一来，在合并操作的时候处理一下就可以了。

LRU和多重继承

import pylru

class LRUCache(object):
    def __init__(self, size=32):
        self.cache = pylru.lrucache(size)
    
    def has(self, key):
        return key in self.cache
    
    def get(self, key):
        return self.cache[key]
    
    def set(self, key, value):
        self.cache[key] = value

class BOWInvertedIndexEngineWithCache(BOWInvertedIndexEngine, LRUCache):
    def __init__(self):
        super(BOWInvertedIndexEngineWithCache, self).__init__()
        LRUCache.__init__(self)
    
    def search(self, query):
        if self.has(query):
            print('cache hit!')
            return self.get(query)
        
        result = super(BOWInvertedIndexEngineWithCache, self).search(query)
        self.set(query, result)
        
        return result

search_engine = BOWInvertedIndexEngineWithCache()
main(search_engine)

########## 输出 ##########

little
found 2 result(s):
1.txt
2.txt
little
cache hit!
found 2 result(s):
1.txt
2.txt

LRUCache定义了一个缓存类，你可以通过继承这个类来调用其方法。LRU缓存是一种很经典的缓存（同时，LRU的实现也是硅谷大厂常考的算法面试题，这里为了简单，我直接使用pylru这个包），它符合自然界的局部性原理，可以保留最近使用过的对象，而逐渐淘汰掉很久没有被用过的对象。
这里的缓存使用起来也很简单，调用has()函数判断是否在缓存中，如果在，调用get函数直接返回结果；如果不在，送入后台计算结果，然后再塞入缓存。
我们可以看到，BOWInvertedlndexEngineWithCache类，多重继承了两个类。首先，你需要注意的是构造函数。多重继承有两种初始化方法:
- 第一种方法，用下面这行代码，直接初始化该类的第一个父类：不过使用这种方法时，要求继承链的最顶层父类必须要继承 object。
  1
  super(BOWInvertedIndexEngineWithCache, self).__init__()
- 第二种方法，对于多重继承，如果有多个构造函数需要调用，我们必须用传统的方法LRUCach e._init（self）。
注意，search()函数被子类BOWInvertedlndexEngineWithCache 再次重载，但是我还需要调用BOWInvertedlndexEngine的search()函数，这时该怎么办呢？请看下面这行代码：这样我们可以强行调用被覆益的父类的函数
1
super(BOWInvertedIndexEngineWithCache, self).search(query)

关于私有化：
Python并没有真正的私有化支持，但可用下划线得到伪私有：

_xxx “单下划线 “ 开始的成员变量叫做保护变量，意思是只有类对象和子类对象自己能访问到这些变量，需通过类提供的接口进行访问；
__xxx 类中的私有变量/方法名，只有类对象自己能访问，连子类对象也不能访问到这个数据。
__xxx__ 魔法函数，前后均有一个“双下划线” 代表python里特殊方法专用的标识，如 init() 代表类的构造函数。
如果想要强行访问父类的私有类型，做法是 self._ParentClass__var，这是非常不推荐的 hacky method。以下是示范代码：
1
2
3
4
5
6
7
8
class A:
__a = 1

class B(A):
pass

b = B()
print(b._A__a)

Reference:
https://time.geekbang.org/column/article/98998