Is there a clean wikipedia API just for retrieve content summary?


Answers

There is actually a very nice prop called extracts that can be used with queries designed specifically for this purpose. Extracts allow you to get article extracts (truncated article text). There is a parameter called exintro that can be used to retrieve the text in the zeroth section (no additional assets like images or infoboxes). You can also retrieve extracts with finer granularity such as by a certain number of characters (exchars) or by a certain number of sentences(exsentences)

Here is a sample query http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow and the API sandbox http://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow to experiment more with this query.

Please note that if you want the first paragraph specifically you still need to do some additionally parsing as suggested in the chosen answer. The difference here is that the response returned by this query is shorter than some of the other api queries suggested because you don't have additional assets such as images in the api response to parse.

Question

I need just to retrieve first paragraph of a Wikipedia page. Content must be html formated, ready to be displayed on my websites (so NO BBCODE, or WIKIPEDIA special CODE!)




If you are just looking for the text which you can then split up but don't want to use the API take a look at en.wikipedia.org/w/index.php?title=Elephant&action=raw




My approach was as follows (in PHP):

$url = "whatever_you_need"

$html = file_get_contents('https://en.wikipedia.org/w/api.php?action=opensearch&search='.$url);
$utf8html = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $html), ENT_NOQUOTES, 'UTF-8');

$utf8html might need further cleaning, but that's basically it.




You can also get content such as the first pagagraph via DBPedia which takes Wikipedia content and creates structured information from it (RDF) and makes this available via an API. The DBPedia API is a SPARQL one (RDF-based) but it outputs JSON and it is pretty easy to wrap.

As an example here's a super simple JS library named WikipediaJS that can extract structured content including a summary first paragraph: http://okfnlabs.org/wikipediajs/

You can read more about it in this blog post: http://okfnlabs.org/blog/2012/09/10/wikipediajs-a-javascript-library-for-accessing-wikipedia-article-information.html

The JS library code can be found here: https://github.com/okfn/wikipediajs/blob/master/wikipedia.js




Yes, there is. For example, if you wanted to get the content of the first section of the article , use a query like this:

http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Stack%20Overflow&rvprop=content&rvsection=0&rvparse

The parts mean this:

  • format=xml: Return the result formatter as XML. Other options (like JSON) are available. This does not affect the format of the page content itself, only the enclosing data format.

  • action=query&prop=revisions: Get information about the revisions of the page. Since we don't specify which revision, the latest one is used.

  • titles=Stack%20Overflow: Get information about the page . It's possible to get the text of more pages in one go, if you separate their names by |.

  • rvprop=content: Return the content (or text) of the revision.

  • rvsection=0: Return only content from section 0.

  • rvparse: Return the content parsed as HTML.

Keep in mind that this returns the whole first section including things like hatnotes (“For other uses …”), infoboxes or images.

There are several libraries available for various languages that make working with API easier, it may be better for you if you used one of them.




Since 2017 Wikipedia provides a REST API with better caching. In the documentation you can find the following API which perfectly fits your use case. (as it is used by the new Page Previews feature)

https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow returns the following data which can be used to display a summery with a small thumbnail:

{
  "title": "",
  "displaytitle": "",
  "pageid": 21721040,
  "extract": " is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky. It was created to be a more open alternative to earlier question and answer sites such as Experts-Exchange. [...]",
  "extract_html": "<p><b></b> is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky. It was created to be a more open alternative to earlier question and answer sites such as Experts-Exchange. [...]",
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/en/thumb/f/fa/Stack_Overflow_homepage%2C_Feb_2017.png/320px-Stack_Overflow_homepage%2C_Feb_2017.png",
    "width": 320,
    "height": 149
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/en/f/fa/Stack_Overflow_homepage%2C_Feb_2017.png",
    "width": 462,
    "height": 215
  },
  "lang": "en",
  "dir": "ltr",
  "timestamp": "2018-01-30T09:21:21Z",
  "description": "website hosting questions and answers on a wide range of topics in computer programming"
}

By default, it follows redirects (so that /api/rest_v1/page/summary/ also works), but this can be disabled with ?redirect=false