Jsoup解析html时,gzip编码格式出现的随机中文乱码问题处理
问题描述
利用jsoup进行网页抓取时,通过document获取 element元素。但是中文内容会随机出现几个字符乱码。每次出现的位置还会不一样,如下
/禁毒办/艾���病署
处理过程
- 刚开始就以为是简单的编码问题,查看网页源码的编码设置为utf-8
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
在看看自己设置的编码格式也是utf-8, 【之前的代码链接】
public static String convertStreamToString(InputStream is) throws IOException {
if (is == null)
return "";
// 将输出流转换编码
BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
StringBuilder sb = new StringBuilder();
String line = null;
try {
while ((line = reader.readLine()) != null) {
sb.append(line);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
is.close();
} catch (IOException e) {
e.printStackTrace();
}
}
reader.close();
return sb.toString();
}
- 试试缓冲区换成字节数组的方式吧,这个试了依然没有效果。和读取方式没有关系。
StringBuilder sb = new StringBuilder();
ByteArrayOutputStream output = new ByteArrayOutputStream();
byte[] buffer = new byte[4096];
int n = 0;
while (-1 != (n = is.read(buffer))) {
output.write(buffer, 0, n);
}
sb.append(output.toString());
System.out.println("---"+sb.toString());
output.close();
return sb.toString();
- 于是又改了下jsoup获取页面元素的方式。问题依然复现。
//这种简单设置没法达到效果
Document doc = null;
try {
doc = Jsoup.connect(list.get(k).getUrl())
.header("Accept-Encoding", "gzip, deflate").get();
System.out.println("=----" + doc);
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
//jsoup内置方法,可设置编码
String url = list.get(k).getUrl();
Document doc = null;
try {
doc = Jsoup.parse(new URL(url).openStream(), "utf-8", url);
System.out.println(doc);
} catch (MalformedURLException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
- 常规的处理方式不行,打开网页查看器,观察下响应头。发现了gzip。
Cache-Control: private
Content-Encoding: gzip
Content-Length: 259
Content-Type: text/html
Cteonnt-Length: 602
Date: Tue, 08 Oct 2019 06:45:31 GMT
Server: Microsoft-IIS/7.5
Set-Cookie: ASPSESSIONIDQCRTCTQS=PHKNGGBCABJPEGHOHFKHNFBN; path=/
Set-Cookie: citrix_ns_id_x.org_%2F_wat=AAAAAAWKpGbTjswS8c7UnnpvIBEyDbarJlhEw0u8MO67R0dbDZrfQw8XIsv_35qBIP0Tb8WGOnS_XiAfH6SdS-3LyjPlu4kcDYFyIkmWeBvpf4bh-w==&AAAAAAWWhwK1EH07U7YJOVIPRufnOMNDerwqHRsjrYidqxqev4PxoiC-enkxHlWrepwPQIxpkLatgw142D66QjmWYOBNCFwo90yuOARzMERXIhhvyQ==&AAAAAAVtUSOqaAr24-UNw1K9LVmicoNEZOXgqb_Uivxc4zKLmxCnbsZhS49EHy2r5BAVk5VfHepRqS3-agRzoS-_fM9zaZHqCWV1PeSSSRR_etIXmg==&AAAAAAXD0S-jMpbJMv1-P0WSAu9ZWofbOqs7FEutuBbPFNrqCGwWx6BYiU3bovvo1wSOr09HncXVrn5ecIzFIsRCljavKQlgwUQlr1RvY7aRM2s_Bw==&AAAAAAW7Su_ULYyWviFh0moyJgil5haVGMRAVGhPBsT_p5yqM8yn2PnzFikewK-HRUG4tvNOFOdw6QITVU7dW2jTwzK01knebHfjmjndHdAdNjsrxQ==&; Domain=.un.org; Path=/; HttpOnly
Strict-Transport-Security: max-age=7776000
了解seo的人可能知道优化网站的一个方式就是判断该网站是否开启了gzip(一种无损压缩算法,可以大大提高网站访问的速度),之前做java的web项目基本上都是gbk、utf-8编码。这种gzip编码该怎么解析呢,common-io工具包中有对应的GZIPInputStream、GZIPOutputStream类可以使用
5.当使用GZIPInputStream对象尝试输出流对象时
BufferedInputStream bufIn = new BufferedInputStream(is);
GZIPInputStream gzip = new GZIPInputStream(bufIn);
String s = new String(IOUtils.toByteArray(gzip));
System.out.println(s);
当使用上述代码进行解析时,会报错如下,应该是对象的使用不正确。
没有找具体原因
java.util.zip.ZipException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:165)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
at com.chl.controller.UnController.convertStreamToByte(UnController.java:231)
at com.chl.controller.UnController.getUrlDoc(UnController.java:176)
at com.chl.controller.UnController.attackByDetailURL(UnController.java:66)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:189)
at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:138)
at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:102)
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895)
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:800)
at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1038)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:942)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1005)
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:897)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:634)
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:882)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:741)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:99)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:92)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:93)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:200)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:200)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:490)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:408)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:834)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1415)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
6.最终采用,jsoup的connect方法进行中文的有效获取。
public static void main(String[] args) {
String url = "https://xxx.xx.x.shtml#2";
Document doc = null;
try {
doc = Jsoup.connect(url)
.ignoreContentType(true)
.ignoreHttpErrors(true)
.timeout(1000 * 30)
.userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36")
.header("accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8")
.header("accept-encoding", "gzip, deflate, br")
.header("accept-language", "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7")
// .data(paramMap)
.post();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(doc.toString());
}
没想到后期测试,又出现了乱码,网上查询,https://zhidao.baidu.com/question/713139047397853805.html 这个比较有说服力。
最终利用字节数组读取流内容, byte[] buffer = new byte[8192]; 设置的数组长度大一些,不要截断。暂时正常
上面的各种尝试都会出现不同程度的乱码问题。最后也终于找到了问题所在,是eclipse控制台本身的编码/解码问题。可以按照网上的教程进行设置。如果将流内容直接写入文件,查看文件内容,则不会看到乱码问题。 !~~~~
原创文章,转载请标明本文链接: Jsoup解析html时,gzip编码格式出现的随机中文乱码问题处理