I'm using Python wikipedia library to obtain the list of the categories of a page. I saw it's a wrapper of MediaWiki API.
Anyway I'm wondering how to generalize the categories to marco categories, like these Main topic classifications.
For example if I search the page Hamburger there is a category called German-American cousine, but I would like to get its super category like Food and Drink. How can I do that?
import wikipedia
page = wikipedia.page("Hamburger")
print(page.categories)
# how to filter only imortant categories?
>>>['All articles with specifically marked weasel-worded phrases', 'All articles with unsourced statements', 'American sandwiches', 'Articles with hAudio microformats', 'Articles with short description', 'Articles with specifically marked weasel-worded phrases from May 2015', 'Articles with unsourced statements from May 2017', 'CS1: Julian–Gregorian uncertainty', 'Commons category link is on Wikidata', 'Culture in Hamburg', 'Fast food', 'German-American cuisine', 'German cuisine', 'German sandwiches', 'Hamburgers (food)', 'Hot sandwiches', 'National dishes', 'Short description is different from Wikidata', 'Spoken articles', 'Use mdy dates from October 2020', 'Webarchive template wayback links', 'Wikipedia articles with BNF identifiers', 'Wikipedia articles with GND identifiers', 'Wikipedia articles with LCCN identifiers', 'Wikipedia articles with NARA identifiers', 'Wikipedia indefinitely move-protected pages', 'Wikipedia pages semi-protected against vandalism']
I didn't find an api to go through the hierarchical tree of Wikipedia Categories.
I accept both Python and API requests solutions. Thank you
EDIT: I have found the api categorytree which seems to do something similar to what I need.

Anyway I dint't find the way to insert options parameter as said in the documentation. I think that the options can be those expressed in this link, like mode=parents, but I can't find the way to insert this parameter in the HTTP url, because it must be a JSON object, as said in the documentation. I was trying this https://en.wikipedia.org/w/api.php?action=categorytree&category=Category:Biscuits&format=json. How to insert options field?
This is a very hard task, since Wikipedia's category graph is a mess (technically speaking :-)). Indeed, in a tree you would expect to get to the root node in logarithmic time. But this is not a tree, since any node can have multiple parents!
Furthermore, I think that it can't be accomplished only using categories, because, as you can see in the example, you are very likely going to get unexpected results. Anyway I tried to reproduce something similar to what you asked.
Explanation of the code below:
Starting from a given page you are likely getting more than one target category, so I organized the result as a dictionary that tells you how many times a target category you have been met with.
As you may imagine, the response is not immediate, so this algorithm should be implemented in offline mode. And it can be improved in many ways (see below).
import requests
import time
import wikipedia
def get_categories(title) :
try : return set(wikipedia.page(title, auto_suggest=False).categories)
except requests.exceptions.ConnectionError :
time.sleep(10)
return get_categories(title)
start_page = "Hamburger"
target_categories = {"Academic disciplines", "Business", "Concepts", "Culture", "Economy", "Education", "Energy", "Engineering", "Entertainment", "Entities", "Ethics", "Events", "Food and drink", "Geography", "Government", "Health", "History", "Human nature", "Humanities", "Knowledge", "Language", "Law", "Life", "Mass media", "Mathematics", "Military", "Music", "Nature", "Objects", "Organizations", "People", "Philosophy", "Policy", "Politics", "Religion", "Science and technology", "Society", "Sports", "Universe", "World"}
result_categories = {c:0 for c in target_categories} # dictionary target category -> number of paths
cached_categories = set() # monotonically encreasing
backlog = get_categories(start_page)
cached_categories.update(backlog)
while (len(backlog) != 0) :
print("\nBacklog size: %d" % len(backlog))
cat = backlog.pop() # pick a category removing it from backlog
print("Visiting category: " + cat)
try:
for parent in get_categories("Category:" + cat) :
if parent in target_categories :
print("Found target category: " + parent)
result_categories[parent] += 1
elif parent not in cached_categories :
backlog.add(parent)
cached_categories.add(parent)
except KeyError: pass # current cat may not have "categories" attribute
result_categories = {k:v for (k,v) in result_categories.items() if v>0} # filter not-found categories
print("\nVisited categories: %d" % len(cached_categories))
print("Result: " + str(result_categories))
In your example, the script would visit 12176 categories (!) and would return the following result:
{'Education': 21, 'Society': 40, 'Knowledge': 17, 'Entities': 4, 'People': 21, 'Health': 25, 'Mass media': 25, 'Philosophy': 17, 'Events': 17, 'Music': 18, 'History': 21, 'Sports': 6, 'Geography': 18, 'Life': 13, 'Government': 36, 'Food and drink': 12, 'Organizations': 16, 'Religion': 23, 'Language': 15, 'Engineering': 7, 'Law': 25, 'World': 13, 'Military': 18, 'Science and technology': 8, 'Politics': 24, 'Business': 15, 'Objects': 3, 'Entertainment': 15, 'Nature': 12, 'Ethics': 12, 'Culture': 29, 'Human nature': 3, 'Energy': 13, 'Concepts': 7, 'Universe': 2, 'Academic disciplines': 23, 'Humanities': 25, 'Policy': 14, 'Economy': 17, 'Mathematics': 10}
As you may notice, the "Food and drink" category has been reached only 12 times, while, for instance, "Society" has been reached 40 times. This tells us a lot about how weird the Wikipedia's category graph is.
There are so many improvements for optimizing or approximating this algorithm. The first that come to my mind:
{"Food and drinks" : 37, "Economy" : 4}, you may want to keep only "Food and drinks" as label. For doing this you can:
Something a bit different you can do is getting the machine-predicted article topic, with a query like https://ores.wikimedia.org/v3/scores/enwiki/?models=articletopic&revids=1000459607
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With