Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search Wikipedia, get the first paragraph of the first entry found in all available languages using C#?

Suppose, I have a list of sightseeings in one language and want to enrich this list with some data from Wikipedia.

So, I have the following data -- city is Munich and it has the following attractions:

  • Frauenkirche
  • Marienplatz
  • Karlsplatz

I need to perform the following:

  1. Send a query to Wikipedia in the given language (this case it is German, for it is more likely that German wiki has a respective article).
  2. Once the article is found I want to get its page title and first 2-3 paragraphs.
  3. I want to strip-down any Wiki-markup and get only the text.
  4. It would be nice to have the text of this article along with the title in the original ("de") and in some other languages.

I tried Linq-to-Wiki from NuGet Repository, but I can't get this scenario to run... Here is my code that justly times out:

var enwiki = new Wiki("LinqToWiki.Samples", "en.wikipedia.org", "/w/api.php");
var result = enwiki.Query.allpages()
              .Pages
              .Select (
              page =>
              new
              {
                Title = page.info.title,
                Text = page.revisions()
                        .Where( r => r.section == "0")
                        .Select( r => r.value)

              );
like image 621
Alexander Galkin Avatar asked Nov 19 '25 21:11

Alexander Galkin


1 Answers

If you know the titles of the articles in question, you can do something like:

var titles = wiki.CreateTitlesSource(
    "Munich Frauenkirche", "Marienplatz", "Karlsplatz (Stachus)");
var pages =
    titles.Select(
        page => new
        {
            Title = page.info.title,
            Text = page.revisions()
                       .Where(r => r.section == "0" && r.parse)
                       .Select(r => r.value)
                       .FirstOrDefault(),
            LangLinks = page.langlinks().ToEnumerable()
        }).ToEnumerable();

LangLinks will contain titles of the article in different languages.

Text will contain HTML of the first section. If you think wikitext would be better, you could get that instead by removing && r.parse.

There is also extracts module that seems to support getting actual plaintext, but that module is currently not supported by LinqToWiki.

like image 92
svick Avatar answered Nov 21 '25 10:11

svick