How To Unescape HTML in Java

I was writing an Html unescape algorithm in Java today. What I came out with is the one below. There is a problem in the algorithm below that eats up some space or characters for some corner cases. Can you figure out what the problem is?

I wrote the class in a way so that you can compile and run it in command prompt and can see the output right away. You can do some trial and error and figure out the issue in the algorithm below.

package com.salesforce.test;


import java.util.*;
/**

 * To compiple: javac -d . StringUtils.java

 * To run: java com.salesforce.test.StringUtils

 *

 * @authot ashik

 */

public class StringUtils {
  private StringUtils() {}
  /**

   * the characters to unescape

   */

  public static HashMap htmlEntities = new HashMap();

  static {

    htmlEntities.put("<","<"); htmlEntities.put("<",""); htmlEntities.put(">",">");

    htmlEntities.put("'","\'"); htmlEntities.put("'","\'");

    htmlEntities.put(""","\""); htmlEntities.put(""","\"");

    htmlEntities.put("&","&"); htmlEntities.put("&","&");

    htmlEntities.put(" "," "); htmlEntities.put(" "," ");

  }
  public static final String unescapeHTML(String source, int start){

     int i,j;
     i = source.indexOf("&", start);

     if (i > -1) {

        j = source.indexOf(";" ,i);

        if (j > i) {

           String entityToLookFor = source.substring(i , j + 1);

           System.out.println("source = " + source + ", entityToLookFor = " + entityToLookFor + ", i = " + i + " and j = " + j);

           if(entityToLookFor.lastIndexOf("&") > entityToLookFor.indexOf("&")) {

	           i = entityToLookFor.lastIndexOf("&") + 1;

	           System.out.println("Before source = " + source + ", entityToLookFor = " + entityToLookFor + ", i = " + i + " and j = " + j);

	           entityToLookFor = entityToLookFor.substring(entityToLookFor.lastIndexOf("&"));

	           System.out.println("After  source = " + source + ", entityToLookFor = " + entityToLookFor + ", i = " + i + " and j = " + j);

           }

           String value = (String)htmlEntities.get(entityToLookFor);

           if (value != null) {

             source = new StringBuffer().append(source.substring(0 , i)).append(value).append(source.substring(j + 1)).toString();

             return unescapeHTML(source, i + 1); // recursive call

           }

         }

     }

     return source;

  }
  public static void main(String args[]) throws Exception {

      // to see accented character to the console

      java.io.PrintStream ps = new java.io.PrintStream(System.out, true, "Cp850");
      ps.println("Finally: Ashik's Quote <Test Ok = " + unescapeHTML("Ashik's Quote <Test Ok", 0));

      ps.println("-----------");

      ps.println("Finally: M& M > 5 = " + unescapeHTML("M& M > 5", 0));

      ps.println("-----------");

      ps.println("Finally: M & M > 5 = " + unescapeHTML("M & M > 5", 0));

      ps.println("-----------");

      ps.println("Finally: M &M > 5 = " + unescapeHTML("M &M > 5", 0));

      ps.println("-----------");

      ps.println("Finally: M& M> 5 = " + unescapeHTML("M& M> 5", 0));

      ps.println("-----------");

      ps.println("Also: \n-->" + unescapeHTML("Apos\'trophie & "quote" is <present>", 0));

      ps.println("-----------");

      ps.println("Also: \n-->" + unescapeHTML("Please check for empty space in Order Review tab.", 0));

      ps.println("-----------");

      ps.println("Also: \n-->" + unescapeHTML("Also check for Billing Information & Subscription Information & Order Review tabs.", 0));

      ps.println("-----------");

      ps.println("Also: \n-->" + unescapeHTML("It's difficult to check today's capitalization strategy for all tabs.", 0));

      ps.println("-----------");

      ps.println("Also: \n-->" + unescapeHTML("Remember that 12 is > 9 is > 4", 0));

      ps.println("-----------");

      ps.println("Also: \n-->" + unescapeHTML("Similarly 8 is < 10 is < 15", 0));

      ps.println("-----------");

      ps.println("Also: \n-->" + unescapeHTML("Shakespeare said,"To be or not to be that is the question."", 0));

      ps.println("-----------");

      ps.println("Also: \n-->" + unescapeHTML("Think & think about New York's <best pizza> as the \"ultimate\" pizza!", 0));

      ps.println("-----------");

      ps.println("Also: \n-->" + unescapeHTML("Apos'trophie & "quote" is <present>", 0));

      ps.println("-----------");

  }

}

4 thoughts on “How To Unescape HTML in Java”

Shabbir March 19, 20103:51 pm Reply

use apache commons StringEscapeUtils.unescapeHtml

–

http://commons.apache.org/lang/api/org/apache/commons/lang/StringEscapeUtils.html

LikeLike
Shabbir (Nitol) March 19, 20103:52 pm Reply

http://commons.apache.org/lang/api/org/apache/commons/lang/StringEscapeUtils.html

LikeLike
ashikuzzaman March 19, 20103:58 pm Reply

Thanks Nitol. Yes, Apache open source is the answer to this small small algorithms. I tried to write it myself just to enjoy and play aorund some data structure and algorithm problems. But while using for production, I should use libraries like this.

LikeLike
Dave August 13, 201011:11 am Reply

You can use \\$ in the replacement string and it will pass correctly.

LikeLike

How To Unescape HTML in Java

Published by ashikuzzaman

4 thoughts on “How To Unescape HTML in Java”

Leave a comment Cancel reply

Share this:

Related

Published by ashikuzzaman

4 thoughts on “How To Unescape HTML in Java”

Leave a comment Cancel reply