Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting a value from an HTML tag attribute in Java

Tags:

java

regex

I have recently taken on a community challenge and I am trying to extract the value of the 'rel' tag in the following line:

<td><a title='Visit Personal Stats Page for ijackk' href='personal.php?name=ijackk&amp;clan=ph_chat_ftw' class='rsn' rel='ijackk' style='color: #FFFFFF;'>ijackk</a></td>

The reason for this is the challenge requires me to extract the names of multiple users off of a memberlist (a list of peoples with attributes relating to their account). I don't HAVE to use regular expressions but I feel that it would be the best. I have seen the classic post of why regular expressions are bad but I have also seen posts saying using it for stuff similar to this isn't a bad thing.

The following is what I have done thus far:

public class Parser {

public static void main(String[] arguments) {
    new Parser().parse();
}

public void parse() {
    try {
        URL url = new URL("http://www.runehead.com/clans/ml.php?clan=ph_chat_ftw");
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(url.openStream()));
        String line;
        StringBuilder stringBuilder = new StringBuilder();
        while ((line = bufferedReader.readLine()) != null) {
            if (line.contains("Visit")) {
                stringBuilder.append(line).append("\n");
                System.out.println(line);
            }
        }
        Matcher matcher = Pattern.compile("\\?rel='([A-Za-z0-9_]*)'").matcher(stringBuilder.toString());
        while (matcher.find()) {
            System.out.println("matched: " + matcher.group(1));
        }
    } catch (MalformedURLException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

}

The output of what I have there is what I have provided above, except for every name. The matcher does not find anything, though. Could I get some help please?


1 Answers

You're assuming that the rel attribute immediately follows the ?, but this isn't necessarily the case. You could use the following:

Pattern.compile("rel=\"([A-Za-z0-9_]*)\"")

That certainly works, but as others have said, you're better off using a proper HTML parser. Here's a jsoup example:

Document doc = Jsoup.connect(
    "http://www.runehead.com/clans/ml.php?clan=ph_chat_ftw").get();
Elements users = doc.select("a[rel]");
for (Element user : users) {
    System.out.println(user.attr("rel"));
}

That's much cleaner (and safer (and more flexible (and maintainable))) than your regex approach.

like image 167
Wayne Avatar answered Nov 19 '25 21:11

Wayne



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!