使用jsoup将html转换为纯文本时保留换行符

2020-05-17 22:49:51

2705 次阅读

0 个评论

使用如下代码将文本转换时

public class NewClass {
     public String noTags(String str){
         return Jsoup.parse(str).text();
     }


     public static void main(String args[]) {
         String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
         "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";

         NewClass text = new NewClass();
         System.out.println((text.noTags(strings)));
}

输出结果：

hello world yo googlez

但我想输出如下结果：

hello world
yo googlez

方法一

public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    String s = document.html().replaceAll("\\\\n", "\n");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}

它满足以下要求：

如果原始html包含换行符（\n），则会保留它
如果原始html包含br或p标签，它们将被转换为换行符（\n）。

方法二

Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

我们在这里使用这种方法：

public static String clean(String bodyHtml,
String baseUri,
Whitelist whitelist,
Document.OutputSettings outputSettings)
通过传递Whitelist.none()，我们确保删除所有HTML。

通过传递new OutputSettings().prettyPrint(false)，我们确保不重新格式化输出并保留换行符。

方法三

public static String cleanPreserveLineBreaks(String bodyHtml) {

    //获得带有保留的br和p标签的漂亮打印的html
    String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
    // 通过禁用prettyPrint获得带有保留的换行符的纯文本
    return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}

请登录后回答。没有帐号？注册一个。

程序王

0回答
1粉丝
0关注