Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get full paper content from PubMed via API and list of IDs

I'm hoping to query the PubMed API based on a list of paper IDs and return the title, abstract and content.

So far I have been able to do the first three things doing the following:

from metapub import PubMedFetcher

pmids = [2020202, 1745076, 2768771, 8277124, 4031339]
fetch = PubMedFetcher()

title_list = []
abstract_list = []
for pmid in pmids:
    article = fetch.article_by_pmid(pmid)
    abstract = article.abstract  # str
    abstract_list.append(abstract)
    title = article.title  # str
    title_list.append(title)

OR get the full paper content, but the query is based on keywords rather than IDs

email = '[email protected]'
pubmed = PubMed(tool="PubMedSearcher", email=email)

## PUT YOUR SEARCH TERM HERE ##
search_term = "test"
results = pubmed.query(search_term, max_results=300)
articleList = []
articleInfo = []

for article in results:
# Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
# We need to convert it to dictionary with available function
    articleDict = article.toDict()
    articleList.append(articleDict)

# Generate list of dict records which will hold all article details that could be fetch from PUBMED API
for article in articleList:
#Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
    pubmedId = article['pubmed_id'].partition('\n')[0]
    # Append article info to dictionary
    articleInfo.append({u'pubmed_id':pubmedId,
                       u'title':article['title'],
                       u'keywords':article['keywords'],
                       u'journal':article['journal'],
                       u'abstract':article['abstract'],
                       u'conclusions':article['conclusions'],
                       u'methods':article['methods'],
                       u'results': article['results'],
                       u'copyrights':article['copyrights'],
                       u'doi':article['doi'],
                       u'publication_date':article['publication_date'],
                       u'authors':article['authors']})
like image 991
msa Avatar asked Sep 05 '25 07:09

msa


1 Answers

Metapub contains that functionality built in; it's called "FindIt":

from metapub import FindIt
import requests

pmids = [2020202, 1745076, 2768771, 8277124, 4031339]

for pmid in pmids:
    src = FindIt(pmid)

    # src.pma contains the PubMedArticle
    print(src.pma.title)
    print(src.pma.abstract)

    # URL, if available, will be fulltext PDF
    if src.url:
        # insert your downloader of choice here, e.g. requests.get(url)
    else:
        # if no URL, reason is one of "PAYWALL", "TXERROR", or "NOFORMAT"
       print(src.reason)

    # use a PDF reader to extract the fulltext from here.

Now -- if you're lucky, the paper fulltext might be in plaintext on PubMedCentral. However, not all papers can be found in PMC, and only a certain subset of papers are available programmatically (the "Open Access Subset") due to inane publisher nonsense, so at this point Your Mileage May Vary.

Meaning: some articles can be downloaded as plain-text (as XML) an above-board manner (read: according to the NIH's wishes), but far from all of them.

This is why metapub's "FindIt" simply tries to find a PDF. PDF is, unfortunately, the only unifying standard across the myriad different ways you'll find papers published on the internet.

If you're happy enough with the open-access subset, though, the canonical way to use it is to follow the instructions on the Open Access Subset page.

Statement of affiliation: I am the main contributor and maintainer of metapub.

like image 147
nthmost Avatar answered Sep 07 '25 19:09

nthmost