JAVA连接ES使用IK分词器进行中文分词及elasticsearch的分词器使用说明附官方文档

作者: admin 分类: ELK 发布时间: 2019-04-26 00:42  阅读: 1,306 views

ELK整体环境搭建起来后,感觉还是很好用的。近期有了新需求,要用ES的分词器对一段文字进行分词处理。本以为这么高大上的ES组件应该有较好的中文分词器,但是很遗憾,本身支持较差,需要安装elasticsearch-analysis-ik插件才能对中文有效分词。下面来看看过程。

1.ES本身提供很多分词器,基本对于外语比较友好,如下。默认的一般是Standard Analyzer 分词器

Standard Analyzer
The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation, lowercases terms, and supports removing stop words.
Simple Analyzer
The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms.
Whitespace Analyzer
The whitespace analyzer divides text into terms whenever it encounters any whitespace character. It does not lowercase terms.
Stop Analyzer
The stop analyzer is like the simple analyzer, but also supports removal of stop words.
Keyword Analyzer
The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term.
Pattern Analyzer
The pattern analyzer uses a regular expression to split the text into terms. It supports lower-casing and stop words.
Language Analyzers
Elasticsearch provides many language-specific analyzers like english or french.
Fingerprint Analyzer
The fingerprint analyzer is a specialist analyzer which creates a fingerprint which can be used for duplicate detection.

kibana控制台看下效果

//执行语句
GET _analyze
{
  "analyzer" : "whitespace",
  "text" : "this is a test"
}

//返回结果
{
  "tokens": [
    {
      "token": "this",    //分词结果
      "start_offset": 0,  //原串起始位置
      "end_offset": 4,   //原串结束位置
      "type": "word",    //类型
      "position": 0        //出现位置
    },
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "a",
      "start_offset": 8,
      "end_offset": 9,
      "type": "word",
      "position": 2
    },
    {
      "token": "test",
      "start_offset": 10,
      "end_offset": 14,
      "type": "word",
      "position": 3
    }
  ]
}

更多效果可参考官网:https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis-analyzers.html

 

2. 查看中文分词效果,使用Standard Analyzer

GET /testbbspost/_analyze
{
  "analyzer": "standard",  //选择一个分词器
  "text": "这是一个好的现象么"
}

结果如下,效果不好:

{
  "tokens": [
    {
      "token": "这",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "一",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "个",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "好",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "的",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "现",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    },
    {
      "token": "象",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<IDEOGRAPHIC>",
      "position": 7
    },
    {
      "token": "么",
      "start_offset": 8,
      "end_offset": 9,
      "type": "<IDEOGRAPHIC>",
      "position": 8
    }
  ]
}

3. 下载安装中文分词神器,ik分词器

下载地址为https://github.com/medcl/elasticsearch-analysis-ik/releases,请选择es的对应版本进行下载

解压到elasticsearch/plugin/目录下。重新启动es就可以使用了。(6.x版本直接重启,其他版本可能需要命令安装等,请百度)

4. 查看分词效果

ik分词器有两个ik_smart,ik_max_word原ik在新版本中会去除。ik_max_word会找出所有的分词结果、粒度细。ik_smart是找出最符合原文意思的词,粒度粗。

GET /testbbspost/_analyze
{
  "analyzer": "ik_smart",  //ik_smart(匹配原则) ,ik_max_word(穷举)
  "text": "这是一个好的现象么"
}

效果如下,分词正常:

{
    "tokens":[
        {
            "token":"这是",
            "start_offset":0,
            "end_offset":2,
            "type":"CN_WORD",
            "position":0
        },
        {
            "token":"一个",
            "start_offset":2,
            "end_offset":4,
            "type":"CN_WORD",
            "position":1
        },
        {
            "token":"好",
            "start_offset":4,
            "end_offset":5,
            "type":"CN_CHAR",
            "position":2
        },
        {
            "token":"的",
            "start_offset":5,
            "end_offset":6,
            "type":"CN_CHAR",
            "position":3
        },
        {
            "token":"现象",
            "start_offset":6,
            "end_offset":8,
            "type":"CN_WORD",
            "position":4
        },
        {
            "token":"么",
            "start_offset":8,
            "end_offset":9,
            "type":"CN_CHAR",
            "position":5
        }
    ]
}

5. 使用java客户端程序处理es的ik分词逻辑

由于6.x版本 RestHightLevelClient不支持analyze的设置,这里用 LowLevelClient实现。(TransportClient支持,但在新版本8.x会去掉.所以不推荐使用)

代码如下:

//java代码实现
package com.xxx.service.impl;
import java.io.IOException;

import org.apache.http.HttpHost;
import org.apache.http.util.EntityUtils;
import org.elasticsearch.action.admin.indices.analyze.AnalyzeRequest;
import org.elasticsearch.client.Request;
import org.elasticsearch.client.Response;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestClientBuilder;
import org.elasticsearch.client.RestHighLevelClient;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;

/**
 * @author chenhailong
 * @date 2019年4月25日 上午10:09:07 
 */
public class EsTest extends SpringTest{

  /**
   * @param args
   */
  public static void main(String[] args) {
    new EsTest().testAnalysis();
  }
  
  public void testAnalysis() {
    RestClientBuilder builder = RestClient.builder(new HttpHost[] { new HttpHost("127.0.0.1", 9200) });
    RestHighLevelClient client = new RestHighLevelClient(builder);
    String text = "这是一个号的现象么!";
    AnalyzeRequest an = new AnalyzeRequest().text(text).analyzer("ik_smart");
    
    Request request = new Request("GET","_analyze");
    
    JSONObject entity = new JSONObject();
    entity.put("analyzer", "ik_max_word");  //ik_max_word   //ik_smart
    entity.put("text", text);
    request.setJsonEntity(entity.toJSONString());
    
    try {
      Response response = client.getLowLevelClient().performRequest(request);
      JSONObject tokens = JSONObject.parseObject(EntityUtils.toString(response.getEntity()));
      JSONArray arrays = tokens.getJSONArray("tokens");
      String result = "";
      for(int i = 0;i<arrays.size();i++) {
        JSONObject obj = JSON.parseObject(arrays.getString(i));
        result += obj.getString("token") + " ";
      }
      System.out.println(result);
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

输出结果如下:

{"tokens":[{"token":"这是","start_offset":0,"end_offset":2,"type":"CN_WORD","position":0},
{"token":"一个","start_offset":2,"end_offset":4,"type":"CN_WORD","position":1},
{"token":"一","start_offset":2,"end_offset":3,"type":"TYPE_CNUM","position":2},
{"token":"个","start_offset":3,"end_offset":4,"type":"COUNT","position":3},
{"token":"号","start_offset":4,"end_offset":5,"type":"CN_CHAR","position":4},
{"token":"的","start_offset":5,"end_offset":6,"type":"CN_CHAR","position":5},
{"token":"现象","start_offset":6,"end_offset":8,"type":"CN_WORD","position":6},
{"token":"么","start_offset":8,"end_offset":9,"type":"CN_CHAR","position":7}]}

 

6. 以上只是实现了对一段文字的分词处理,如何处理es索引中的中文字段呢?

其实有两种方案,一是创建索引、或者索引模板时,指定中文字段的分词器。而是在查询索引时,指定查询字段的分词器。官方代码说明很清楚

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_folded": {  //自定义分词器,可重新定义分词器属性等
          "type": "custom",
          "tokenizer": "standard",
          "filter": [   
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "std_folded" 
      }
    }
  }
}

GET my_index/_analyze 
{
  "analyzer": "std_folded", 
  "text":     "Is this déjà vu?"
}

GET my_index/_analyze 
{
  "field": "my_text", 
  "text":  "Is this déjà vu?"
}

a.创建索引时设置了一个分词器std_folded,并在映射字段中进行分词器的使用设定

b.在使用分词器过程中,必须指定索引

c. 可以查询设定的分词器进行使用;也可以查询被设定分词器的字段进行使用。两种皆可。


   原创文章,转载请标明本文链接: JAVA连接ES使用IK分词器进行中文分词及elasticsearch的分词器使用说明附官方文档

如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!

发表评论

电子邮件地址不会被公开。 必填项已用*标注

更多阅读