Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get more general Category from the Category of a Wikipedia page

I'm using Python wikipedia library to obtain the list of the categories of a page. I saw it's a wrapper of MediaWiki API.

Anyway I'm wondering how to generalize the categories to marco categories, like these Main topic classifications.

For example if I search the page Hamburger there is a category called German-American cousine, but I would like to get its super category like Food and Drink. How can I do that?

import wikipedia
page = wikipedia.page("Hamburger")
print(page.categories)
# how to filter only imortant categories?
>>>['All articles with specifically marked weasel-worded phrases', 'All articles with unsourced statements', 'American sandwiches', 'Articles with hAudio microformats', 'Articles with short description', 'Articles with specifically marked weasel-worded phrases from May 2015', 'Articles with unsourced statements from May 2017', 'CS1: Julian–Gregorian uncertainty', 'Commons category link is on Wikidata', 'Culture in Hamburg', 'Fast food', 'German-American cuisine', 'German cuisine', 'German sandwiches', 'Hamburgers (food)', 'Hot sandwiches', 'National dishes', 'Short description is different from Wikidata', 'Spoken articles', 'Use mdy dates from October 2020', 'Webarchive template wayback links', 'Wikipedia articles with BNF identifiers', 'Wikipedia articles with GND identifiers', 'Wikipedia articles with LCCN identifiers', 'Wikipedia articles with NARA identifiers', 'Wikipedia indefinitely move-protected pages', 'Wikipedia pages semi-protected against vandalism']

I didn't find an api to go through the hierarchical tree of Wikipedia Categories.

I accept both Python and API requests solutions. Thank you

EDIT: I have found the api categorytree which seems to do something similar to what I need.

enter image description here

Anyway I dint't find the way to insert options parameter as said in the documentation. I think that the options can be those expressed in this link, like mode=parents, but I can't find the way to insert this parameter in the HTTP url, because it must be a JSON object, as said in the documentation. I was trying this https://en.wikipedia.org/w/api.php?action=categorytree&category=Category:Biscuits&format=json. How to insert options field?

like image 598
Paolo Magnani Avatar asked Nov 02 '25 18:11

Paolo Magnani


2 Answers

This is a very hard task, since Wikipedia's category graph is a mess (technically speaking :-)). Indeed, in a tree you would expect to get to the root node in logarithmic time. But this is not a tree, since any node can have multiple parents!

Furthermore, I think that it can't be accomplished only using categories, because, as you can see in the example, you are very likely going to get unexpected results. Anyway I tried to reproduce something similar to what you asked.

Explanation of the code below:

  • Start from a source page (the hardcoded one is "Hamburger");
  • Go back visiting recursively all the parent categories;
  • Cache all the met categories, in order to avoid visiting twice the same category (and this solves also the cycles problem);
  • Cut the current branch if you find a target category;
  • Stop when the backlog is empty.

Starting from a given page you are likely getting more than one target category, so I organized the result as a dictionary that tells you how many times a target category you have been met with.

As you may imagine, the response is not immediate, so this algorithm should be implemented in offline mode. And it can be improved in many ways (see below).

The code

import requests
import time
import wikipedia

def get_categories(title) :
    try : return set(wikipedia.page(title, auto_suggest=False).categories)
    except requests.exceptions.ConnectionError :
        time.sleep(10)
        return get_categories(title)

start_page = "Hamburger"
target_categories = {"Academic disciplines", "Business", "Concepts", "Culture", "Economy", "Education", "Energy", "Engineering", "Entertainment", "Entities", "Ethics", "Events", "Food and drink", "Geography", "Government", "Health", "History", "Human nature", "Humanities", "Knowledge", "Language", "Law", "Life", "Mass media", "Mathematics", "Military", "Music", "Nature", "Objects", "Organizations", "People", "Philosophy", "Policy", "Politics", "Religion", "Science and technology", "Society", "Sports", "Universe", "World"}
result_categories = {c:0 for c in target_categories}    # dictionary target category -> number of paths
cached_categories = set()       # monotonically encreasing
backlog = get_categories(start_page)
cached_categories.update(backlog)
while (len(backlog) != 0) :
    print("\nBacklog size: %d" % len(backlog))
    cat = backlog.pop()         # pick a category removing it from backlog
    print("Visiting category: " + cat)
    try:
        for parent in get_categories("Category:" + cat) :
            if parent in target_categories :
                print("Found target category: " + parent)
                result_categories[parent] += 1
            elif parent not in cached_categories :
                backlog.add(parent)
                cached_categories.add(parent)
    except KeyError: pass       # current cat may not have "categories" attribute
result_categories = {k:v for (k,v) in result_categories.items() if v>0} # filter not-found categories
print("\nVisited categories: %d" % len(cached_categories))
print("Result: " + str(result_categories))

Results for your example

In your example, the script would visit 12176 categories (!) and would return the following result:

{'Education': 21, 'Society': 40, 'Knowledge': 17, 'Entities': 4, 'People': 21, 'Health': 25, 'Mass media': 25, 'Philosophy': 17, 'Events': 17, 'Music': 18, 'History': 21, 'Sports': 6, 'Geography': 18, 'Life': 13, 'Government': 36, 'Food and drink': 12, 'Organizations': 16, 'Religion': 23, 'Language': 15, 'Engineering': 7, 'Law': 25, 'World': 13, 'Military': 18, 'Science and technology': 8, 'Politics': 24, 'Business': 15, 'Objects': 3, 'Entertainment': 15, 'Nature': 12, 'Ethics': 12, 'Culture': 29, 'Human nature': 3, 'Energy': 13, 'Concepts': 7, 'Universe': 2, 'Academic disciplines': 23, 'Humanities': 25, 'Policy': 14, 'Economy': 17, 'Mathematics': 10}

As you may notice, the "Food and drink" category has been reached only 12 times, while, for instance, "Society" has been reached 40 times. This tells us a lot about how weird the Wikipedia's category graph is.

Possible improvements

There are so many improvements for optimizing or approximating this algorithm. The first that come to my mind:

  • Consider keeping track of the path length and suppose that the target category with the shortest path is the most relevant one.
  • Reduce the execution time:
    • You can reduce the number of steps by stopping the script after the first target category occurrence (or at the N-th occurrence).
    • If you execute this algorithm starting from multiple articles, you can keep in memory the information which associates eventual target categories to every category that you met. For example, after your "Hamburger" run you will know that starting from "Category:Fast food" you will get to "Category:Economy", and this can be a precious information. This will be expensive in terms of space, but eventually will help you reducing the execution time.
  • Use as label only the target categories that are more frequent. E.g. if your result is {"Food and drinks" : 37, "Economy" : 4}, you may want to keep only "Food and drinks" as label. For doing this you can:
    • take the N most occurring target categories;
    • take the most relevant fraction (e.g. the first half, or third, or fourth);
    • take the categories which occurr at least N% of times w.r.t. the most frequent one;
    • use more sophisticated statistical tests for analyzing statistical significance of frequency.
like image 169
horcrux Avatar answered Nov 04 '25 06:11

horcrux


Something a bit different you can do is getting the machine-predicted article topic, with a query like https://ores.wikimedia.org/v3/scores/enwiki/?models=articletopic&revids=1000459607

like image 32
Tgr Avatar answered Nov 04 '25 08:11

Tgr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!