注:本文基于Lucene 8.2.0 版本。
搜索是使用Lucene的根本目的,本文介绍Lucene提供的常用查询。下面的讲述中,会以之前《Lucene系列(2)——代码实践》文章中4首短诗的索引数据为例进行查询,你可以先阅读那篇文章构建索引。在Lucene中,Term是查询的基本单元(unit),所有查询类的父类是org.apache.lucene.search.Query,本文会介绍下图中这些主要的Query子类:
DisjunctionMaxQuery主要用于控制评分机制,SpanQuery代表一类查询,有很多的实现。这两类查询不是非常常用,放在以后的文章单独介绍。本文所有、示例的完整代码见这里。
TermQuery是最基础最常用的的一个查询了,对应的类是org.apache.lucene.search.TermQuery。其功能很简单,就是查询哪些文档中包含指定的term。
看下面代码:
/** * Query Demo. * * @author NiYanchun **/ public class QueryDemo { /** * 搜索的字段 */ private static final String SEARCH_FIELD = "contents"; public static void main(String[] args) throws Exception { // 索引保存目录 final String indexPath = "indices/poems-index"; // 读取索引 IndexReader indexReader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath))); IndexSearcher searcher = new IndexSearcher(indexReader); // TermQuery termQueryDemo(searcher); } private static void termQueryDemo(IndexSearcher searcher) throws IOException { System.out.println("TermQuery, search for 'death':"); TermQuery termQuery = new TermQuery(new Term(SEARCH_FIELD, "death")); resultPrint(searcher, termQuery); } private static void resultPrint(IndexSearcher searcher, Query query) throws IOException { TopDocs topDocs = searcher.search(query, 10); if (topDocs.totalHits.value == 0) { System.out.println("not found!\n"); return; } ScoreDoc[] hits = topDocs.scoreDocs; System.out.println(topDocs.totalHits.value + " result(s) matched: "); for (ScoreDoc hit : hits) { Document doc = searcher.doc(hit.doc); System.out.println("doc=" + hit.doc + " score=" + hit.score + " file: " + doc.get("path")); } System.out.println(); } }
上面代码先读取索引文件,然后执行了一个term查询,查询所有包含death关键词的文档。为了方便打印,我们封装了一个resultPrint函数用于打印查询结果。On Death一诗包含了death关键字,所以程序执行结果为:
TermQuery, search for 'death': 1 result(s) matched: doc=3 score=0.6199532 file: data/poems/OnDeath.txt
后面的示例代码会基于上述代码结构再增加。
BooleanQuery用于将若干个查询按照与或的逻辑关系组织起来,支持嵌套。目前支持4个逻辑关系:
使用方式也比较简单,以下的代码使用BooleanQuery查询contents字段包含love但不包含seek的词:
private static void booleanQueryDemo(IndexSearcher searcher) throws IOException { System.out.println("BooleanQuery, must contain 'love' but absolutely not 'seek': "); BooleanQuery.Builder builder = new BooleanQuery.Builder(); builder.add(new TermQuery(new Term(SEARCH_FIELD, "love")), BooleanClause.Occur.MUST); builder.add(new TermQuery(new Term(SEARCH_FIELD, "seek")), BooleanClause.Occur.MUST_NOT); BooleanQuery booleanQuery = builder.build(); resultPrint(searcher, booleanQuery); }
Love's Secret和Freedom and Love两首诗中均包含了love一词,但前者还包含了seek一词,所以最终的搜索结果为Freedom and Love。
PhraseQuery用于搜索term序列,比如搜索“hello world”这个由两个term组成的一个序列。对于Phrase类的查询需要掌握两个点:
Edit distance用于描述两个字符串(词也是一种特殊的字符串)的相似度,其定义有多种,比较常用的是 Levenshtein distance 和其扩展 Damerau–Levenshtein distance。Lucene使用的就是这两种。Levenshtein distance是这样定义edit distance的:如果最少通过n个 增加(Insertion)/删除(Deletion)/替换(Substitution) 单个符号(symbol)的操作能使两个字符串相等,那这两个字符串的距离就是 n。这里有三个注意点:
举几个例子:
Damerau–Levenshtein distance对Levenshtein distance做了一个扩展:增加了一个transposition操作,定义 相邻 symbol的位置交换为1次操作,即distance为1。 这样的话在Levenshtein distance中,cat和cta的距离为2,但在Damerau–Levenshtein distance中,它们的距离就是1了;同理,"good boy"和"boy good"的距离也就是1了。
这就是所谓的Edit distance,PhraseQuery使用的是Levenshtein distance,且默认的slop值是0,也就是只检索完全匹配的term序列。看下面这个例子:
private static void phraseQueryDemo(IndexSearcher searcher) throws IOException { System.out.println("\nPhraseQuery, search 'love that'"); PhraseQuery.Builder builder = new PhraseQuery.Builder(); builder.add(new Term(SEARCH_FIELD, "love")); builder.add(new Term(SEARCH_FIELD, "that")); PhraseQuery phraseQueryWithSlop = builder.build(); resultPrint(searcher, phraseQueryWithSlop); } // 运行结果 PhraseQuery, search 'love that' 1 result(s) matched: doc=2 score=0.7089927 file: data/poems/Love'sSecret.txt
Love‘s Secret里面有这么一句:"Love that never told shall be",是能够匹配"love that"的。我们也可以修改slop的值,使得与搜索序列的edit distance小于等于slop的文档都可以被检索到,同时距离越小的文档评分越高。看下面例子:
private static void phraseQueryWithSlopDemo(IndexSearcher searcher) throws IOException { System.out.println("PhraseQuery with slop: 'love <slop> never"); PhraseQuery phraseQueryWithSlop = new PhraseQuery(1, SEARCH_FIELD, "love", "never"); resultPrint(searcher, phraseQueryWithSlop); } // 运行结果 PhraseQuery with slop: 'love <slop> never 1 result(s) matched: doc=2 score=0.43595996 file: data/poems/Love'sSecret.txt
不论是官方文档或是网上的资料,对于MultiPhraseQuery讲解的都比较少。但其实它的功能很简单,举个例子就明白了:我们提供两个由term组成的数组:["love", "hate"], ["him", "her"],然后把这两个数组传给MultiPhraseQuery,它就会去检索 "love him", "love her", "hate him", "hate her"的组合,每一个组合其实就是一个上面介绍的PhraseQuery。当然MultiPhraseQuery也可以接受更高维的组合。
由上面的例子可以看到PhraseQuery其实是MultiPhraseQuery的一种特殊形式而已,如果给MultiPhraseQuery传递的每个数组里面只有一个term,那就退化成PhraseQuery了。在MultiPhraseQuery中,一个数组内的元素匹配时是 或(OR) 的关系,也就是这些term共享同一个position。 还记得之前的文章中我们说过在同一个position放多个term,可以实现同义词的搜索。的确MultiPhraseQuery实际中主要用于同义词的查询。比如查询一个“我爱土豆”,那可以构造这样两个数组传递给MultiPhraseQuery查询:["喜欢",“爱”], ["土豆","马铃薯","洋芋"],这样查出来的结果就会更全面一些。
最后来个例子:
private static void multiPhraseQueryDemo(IndexSearcher searcher) throws IOException { System.out.println("MultiPhraseQuery:"); // On Death 一诗中有这样一句: I know not what into my ear // Fog 一诗中有这样一句: It sits looking over harbor and city // 以下的查询可以匹配 "know harbor, know not, over harbor, over not" 4种情况 MultiPhraseQuery.Builder builder = new MultiPhraseQuery.Builder(); Term[] termArray1 = new Term[2]; termArray1[0] = new Term(SEARCH_FIELD, "know"); termArray1[1] = new Term(SEARCH_FIELD, "over"); Term[] termArray2 = new Term[2]; termArray2[0] = new Term(SEARCH_FIELD, "harbor"); termArray2[1] = new Term(SEARCH_FIELD, "not"); builder.add(termArray1); builder.add(termArray2); MultiPhraseQuery multiPhraseQuery = builder.build(); resultPrint(searcher, multiPhraseQuery); } // 程序输出 MultiPhraseQuery: 2 result(s) matched: doc=0 score=2.7032354 file: data/poems/Fog.txt doc=3 score=2.4798129 file: data/poems/OnDeath.txt
这三个查询提供模糊模糊查询的功能:
需要注意,WildcardQuery和RegexpQuery的性能会差一些,因为它们需要遍历很多文档。特别是极力不推荐以模糊匹配开头。当然这里的差是相对其它查询来说的,我粗略测试过,2台16C+32G的ES,比较简短的文档,千万级以下的查询也能毫秒级返回。最后看几个使用的例子:
private static void prefixQueryDemo(IndexSearcher searcher) throws IOException { System.out.println("PrefixQuery, search terms begin with 'co'"); PrefixQuery prefixQuery = new PrefixQuery(new Term(SEARCH_FIELD, "co")); resultPrint(searcher, prefixQuery); } private static void wildcardQueryDemo(IndexSearcher searcher) throws IOException { System.out.println("WildcardQuery, search terms 'har*'"); WildcardQuery wildcardQuery = new WildcardQuery(new Term(SEARCH_FIELD, "har*")); resultPrint(searcher, wildcardQuery); } private static void regexpQueryDemo(IndexSearcher searcher) throws IOException { System.out.println("RegexpQuery, search regexp 'l[ao]*'"); RegexpQuery regexpQuery = new RegexpQuery(new Term(SEARCH_FIELD, "l[ai].*")); resultPrint(searcher, regexpQuery); } // 程序输出 PrefixQuery, search terms begin with 'co' 2 result(s) matched: doc=0 score=1.0 file: data/poems/Fog.txt doc=2 score=1.0 file: data/poems/Love'sSecret.txt WildcardQuery, search terms 'har*' 1 result(s) matched: doc=0 score=1.0 file: data/poems/Fog.txt RegexpQuery, search regexp 'l[ao]*' 2 result(s) matched: doc=0 score=1.0 file: data/poems/Fog.txt doc=3 score=1.0 file: data/poems/OnDeath.txt
FuzzyQuery和PhraseQuery一样,都是基于上面介绍的edit distance做匹配的,差异是在PhraseQuery中搜索词的是一个term序列,此时edit distance中定义的一个symbol就是一个词;而FuzzyQuery的搜索词就是一个term,所以它对应的edit distance中的symbol就是一个字符了。另外使用时还有几个注意点:
最后看个例子:
private static void fuzzyQueryDemo(IndexSearcher searcher) throws IOException { System.out.println("FuzzyQuery, search 'remembre'"); // 这里把remember拼成了remembre FuzzyQuery fuzzyQuery = new FuzzyQuery(new Term(SEARCH_FIELD, "remembre"), 1); resultPrint(searcher, fuzzyQuery); } // 程序输出 FuzzyQuery, search 'remembre' 1 result(s) matched: doc=1 score=0.4473783 file: data/poems/FreedomAndLove.txt
前面介绍Field的时候,我们介绍过几种常用的数值型Field:IntPoint、LongPoint、FloatPoint、DoublePoint。PointRangeQuery就是给数值型数据提供范围查询的一个Query,功能和原理都很简单,我们直接看一个完整的例子吧:
/** * Point Query Demo. * * @author NiYanchun **/ public class PointQueryDemo { public static void main(String[] args) throws Exception { // 索引保存目录 final String indexPath = "indices/point-index"; Directory indexDir = FSDirectory.open(Paths.get(indexPath)); IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer()); iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE); IndexWriter writer = new IndexWriter(indexDir, iwc); // 向索引中插入10条document,每个document包含一个field字段,字段值是0~10之间的数字 for (int i = 0; i < 10; i++) { Document doc = new Document(); Field pointField = new IntPoint("field", i); doc.add(pointField); writer.addDocument(doc); } writer.close(); // 查询 IndexReader indexReader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath))); IndexSearcher searcher = new IndexSearcher(indexReader); // 查询field字段值在[5, 8]范围内的文档 Query query = IntPoint.newRangeQuery("field", 5, 8); TopDocs topDocs = searcher.search(query, 10); if (topDocs.totalHits.value == 0) { System.out.println("not found!"); return; } ScoreDoc[] hits = topDocs.scoreDocs; System.out.println(topDocs.totalHits.value + " result(s) matched: "); for (ScoreDoc hit : hits) { System.out.println("doc=" + hit.doc + " score=" + hit.score); } } } // 程序输出 4 result(s) matched: doc=5 score=1.0 doc=6 score=1.0 doc=7 score=1.0 doc=8 score=1.0
完整代码见这里。
TermRangeQuery和PointRangeQuery功能类似,不过它比较的是字符串,而非数值。比较基于org.apache.lucene.util.BytesRef.compareTo(BytesRef other)方法。直接看例子:
private static void termRangeQueryDemo(IndexSearcher searcher) throws IOException { System.out.println("TermRangeQuery, search term between 'loa' and 'lov'"); // 后面的true和false分别表示 loa <= 待匹配的term < lov TermRangeQuery termRangeQuery = new TermRangeQuery(SEARCH_FIELD, new BytesRef("loa"), new BytesRef("lov"), true, false); resultPrint(searcher, termRangeQuery); } // 程序输出 TermRangeQuery, search term between 'loa' and 'lov' 1 result(s) matched: doc=0 score=1.0 file: data/poems/Fog.txt // Fog中的term 'looking' 符合搜索条件
ConstantScoreQuery很简单,它的功能是将其它查询包装起来,并将它们查询结果中的评分改为一个常量值(默认为1.0)。上面FuzzyQuery一节里面最后举得例子中返回的查询结果score=0.4473783,现在我们用ConstantScoreQuery包装一下看下效果:
private static void constantScoreQueryDemo(IndexSearcher searcher) throws IOException { System.out.println("ConstantScoreQuery:"); ConstantScoreQuery constantScoreQuery = new ConstantScoreQuery( new FuzzyQuery(new Term(SEARCH_FIELD, "remembre"), 1)); resultPrint(searcher, constantScoreQuery); } // 运行结果 ConstantScoreQuery: 1 result(s) matched: doc=1 score=1.0 file: data/poems/FreedomAndLove.txt
另外有个知识点需要注意:ConstantScoreQuery嵌套Filter和BooleanQuery嵌套Filter的查询结果不考虑评分的话是一样的,但前面在BooleanQuery中介绍过Filter,其功能与MUST相同,但不计算评分;而ConstantScoreQuery就是用来设置一个评分的。所以两者的查询结果是一样的,但ConstantScoreQuery嵌套Filter返回结果是附带评分的,而BooleanQuery嵌套Filter的返回结果是没有评分的(score字段的值为0)。
这个查询很简单,就是匹配所有文档,用于没有特定查询条件,只想预览部分数据的场景。直接看例子:
private static void matchAllDocsQueryDemo(IndexSearcher searcher) throws IOException { System.out.println("MatchAllDocsQueryDemo:"); MatchAllDocsQuery matchAllDocsQuery = new MatchAllDocsQuery(); resultPrint(searcher, matchAllDocsQuery); } // 程序输出 MatchAllDocsQueryDemo: 4 result(s) matched: doc=0 score=1.0 file: data/poems/Fog.txt doc=1 score=1.0 file: data/poems/FreedomAndLove.txt doc=2 score=1.0 file: data/poems/Love'sSecret.txt doc=3 score=1.0 file: data/poems/OnDeath.txt
本文来源:NYC's Blog,转载请注明出处!
来源地址:https://niyanchun.com/lucene-learning-8.html
最新内容
© 2016 - 2024 chengxuzhixin.com All Rights Reserved.