I have a long string containing a mix of words and characters.
<h4> <a href="/forum?id=SyBPtQfAZ"> Improving Discriminator-Generator Balance in Generative Adversarial Networks </a> <a href="/pdf?id=SyBPtQfAZ" class="pdf-link" title="Download PDF" target="_blank"><img src="/static/images/pdf_icon_blue.svg"/></a> </h4>
I need to extract only the title:
Improving Discriminator-Generator Balance in Generative Adversarial Networks
I know R has the ability to extract words between 2 characters, such as:
sub(">.*<", "", my_string)
But this obviously won't work here as there are a mix of many characters.
You should probably be using an HTML parser here. That being said, the following one liner with gsub might work:
gsub(".*?<a href=[^>]*>\\s*(.*?)\\s*</a>.*", "\\1", input)
I say might because I make many assumptions, including that the title anchor tag is the first one, and that you don't have nested content. In practice, you can try using an HTML/XML parser for greater control.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With