I have recently taken on a community challenge and I am trying to extract the value of the 'rel' tag in the following line:
<td><a title='Visit Personal Stats Page for ijackk' href='personal.php?name=ijackk&clan=ph_chat_ftw' class='rsn' rel='ijackk' style='color: #FFFFFF;'>ijackk</a></td>
The reason for this is the challenge requires me to extract the names of multiple users off of a memberlist (a list of peoples with attributes relating to their account). I don't HAVE to use regular expressions but I feel that it would be the best. I have seen the classic post of why regular expressions are bad but I have also seen posts saying using it for stuff similar to this isn't a bad thing.
The following is what I have done thus far:
public class Parser {
public static void main(String[] arguments) {
new Parser().parse();
}
public void parse() {
try {
URL url = new URL("http://www.runehead.com/clans/ml.php?clan=ph_chat_ftw");
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(url.openStream()));
String line;
StringBuilder stringBuilder = new StringBuilder();
while ((line = bufferedReader.readLine()) != null) {
if (line.contains("Visit")) {
stringBuilder.append(line).append("\n");
System.out.println(line);
}
}
Matcher matcher = Pattern.compile("\\?rel='([A-Za-z0-9_]*)'").matcher(stringBuilder.toString());
while (matcher.find()) {
System.out.println("matched: " + matcher.group(1));
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
The output of what I have there is what I have provided above, except for every name. The matcher does not find anything, though. Could I get some help please?
You're assuming that the rel attribute immediately follows the ?, but this isn't necessarily the case. You could use the following:
Pattern.compile("rel=\"([A-Za-z0-9_]*)\"")
That certainly works, but as others have said, you're better off using a proper HTML parser. Here's a jsoup example:
Document doc = Jsoup.connect(
"http://www.runehead.com/clans/ml.php?clan=ph_chat_ftw").get();
Elements users = doc.select("a[rel]");
for (Element user : users) {
System.out.println(user.attr("rel"));
}
That's much cleaner (and safer (and more flexible (and maintainable))) than your regex approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With