Jsoup解析html时,gzip编码格式出现的随机中文乱码问题处理

作者: admin 分类: Scrapy 发布时间: 2019-10-08 16:06  阅读: 146 views

问题描述

利用jsoup进行网页抓取时,通过document获取 element元素。但是中文内容会随机出现几个字符乱码。每次出现的位置还会不一样,如下

/禁毒办/艾���病署

处理过程

  1. 刚开始就以为是简单的编码问题,查看网页源码的编码设置为utf-8
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

在看看自己设置的编码格式也是utf-8, 【之前的代码链接】

public static String convertStreamToString(InputStream is) throws IOException {
    if (is == null)
        return "";
    // 将输出流转换编码
    BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
    StringBuilder sb = new StringBuilder();
    String line = null;
    try {
        while ((line = reader.readLine()) != null) {
            sb.append(line);
        }
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    reader.close();
    return sb.toString();
}

  1. 试试缓冲区换成字节数组的方式吧,这个试了依然没有效果。和读取方式没有关系。
StringBuilder sb = new StringBuilder();
ByteArrayOutputStream output = new ByteArrayOutputStream();
byte[] buffer = new byte[4096];
int n = 0;
while (-1 != (n = is.read(buffer))) {
output.write(buffer, 0, n);
}
sb.append(output.toString());
System.out.println("---"+sb.toString());
output.close();
return sb.toString();

  1. 于是又改了下jsoup获取页面元素的方式。问题依然复现。
//这种简单设置没法达到效果
Document doc = null;
try {
    doc = Jsoup.connect(list.get(k).getUrl())
    .header("Accept-Encoding", "gzip, deflate").get();
    System.out.println("=----" + doc);
} catch (IOException e1) {
    // TODO Auto-generated catch block
    e1.printStackTrace();
}

//jsoup内置方法,可设置编码
String url = list.get(k).getUrl();
Document doc = null;
try {
    doc = Jsoup.parse(new URL(url).openStream(), "utf-8", url);
    System.out.println(doc);
    } catch (MalformedURLException e1) {
    // TODO Auto-generated catch block
    e1.printStackTrace();
} catch (IOException e1) {
    // TODO Auto-generated catch block
    e1.printStackTrace();
}

  1. 常规的处理方式不行,打开网页查看器,观察下响应头。发现了gzip。
Cache-Control: private
Content-Encoding: gzip
Content-Length: 259
Content-Type: text/html
Cteonnt-Length: 602
Date: Tue, 08 Oct 2019 06:45:31 GMT
Server: Microsoft-IIS/7.5
Set-Cookie: ASPSESSIONIDQCRTCTQS=PHKNGGBCABJPEGHOHFKHNFBN; path=/
Set-Cookie: citrix_ns_id_x.org_%2F_wat=AAAAAAWKpGbTjswS8c7UnnpvIBEyDbarJlhEw0u8MO67R0dbDZrfQw8XIsv_35qBIP0Tb8WGOnS_XiAfH6SdS-3LyjPlu4kcDYFyIkmWeBvpf4bh-w==&AAAAAAWWhwK1EH07U7YJOVIPRufnOMNDerwqHRsjrYidqxqev4PxoiC-enkxHlWrepwPQIxpkLatgw142D66QjmWYOBNCFwo90yuOARzMERXIhhvyQ==&AAAAAAVtUSOqaAr24-UNw1K9LVmicoNEZOXgqb_Uivxc4zKLmxCnbsZhS49EHy2r5BAVk5VfHepRqS3-agRzoS-_fM9zaZHqCWV1PeSSSRR_etIXmg==&AAAAAAXD0S-jMpbJMv1-P0WSAu9ZWofbOqs7FEutuBbPFNrqCGwWx6BYiU3bovvo1wSOr09HncXVrn5ecIzFIsRCljavKQlgwUQlr1RvY7aRM2s_Bw==&AAAAAAW7Su_ULYyWviFh0moyJgil5haVGMRAVGhPBsT_p5yqM8yn2PnzFikewK-HRUG4tvNOFOdw6QITVU7dW2jTwzK01knebHfjmjndHdAdNjsrxQ==&; Domain=.un.org; Path=/; HttpOnly
Strict-Transport-Security: max-age=7776000

了解seo的人可能知道优化网站的一个方式就是判断该网站是否开启了gzip(一种无损压缩算法,可以大大提高网站访问的速度),之前做java的web项目基本上都是gbk、utf-8编码。这种gzip编码该怎么解析呢,common-io工具包中有对应的GZIPInputStream、GZIPOutputStream类可以使用


5.当使用GZIPInputStream对象尝试输出流对象时

BufferedInputStream bufIn = new BufferedInputStream(is);
GZIPInputStream gzip = new GZIPInputStream(bufIn);
String s =  new String(IOUtils.toByteArray(gzip));
System.out.println(s);

当使用上述代码进行解析时,会报错如下,应该是对象的使用不正确。
没有找具体原因

java.util.zip.ZipException: Not in GZIP format
    at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:165)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
    at com.chl.controller.UnController.convertStreamToByte(UnController.java:231)
    at com.chl.controller.UnController.getUrlDoc(UnController.java:176)
    at com.chl.controller.UnController.attackByDetailURL(UnController.java:66)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:189)
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:138)
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:102)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:800)
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1038)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:942)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1005)
    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:897)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:634)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:882)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:741)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:99)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:92)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:93)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:200)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:200)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:490)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
    at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:408)
    at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
    at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:834)
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1415)
    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:748)

6.最终采用,jsoup的connect方法进行中文的有效获取。

public static void main(String[] args) {
    String url = "https://xxx.xx.x.shtml#2";
    Document doc = null;
    try {
        doc = Jsoup.connect(url)
        .ignoreContentType(true)
        .ignoreHttpErrors(true)
        .timeout(1000 * 30)
        .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36")
        .header("accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8")
        .header("accept-encoding", "gzip, deflate, br")
        .header("accept-language", "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7")
                // .data(paramMap)
        .post();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    System.out.println(doc.toString());
}

没想到后期测试,又出现了乱码,网上查询,https://zhidao.baidu.com/question/713139047397853805.html 这个比较有说服力。
最终利用字节数组读取流内容, byte[] buffer = new byte[8192]; 设置的数组长度大一些,不要截断。暂时正常


上面的各种尝试都会出现不同程度的乱码问题。最后也终于找到了问题所在,是eclipse控制台本身的编码/解码问题。可以按照网上的教程进行设置。如果将流内容直接写入文件,查看文件内容,则不会看到乱码问题。 !~~~~


   原创文章,转载请标明本文链接: Jsoup解析html时,gzip编码格式出现的随机中文乱码问题处理

如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!

发表评论

电子邮件地址不会被公开。 必填项已用*标注

更多阅读