Is there a clean wikipedia API just for retrieve content summary?api


There is actually a very nice prop called extracts that can be used with queries designed specifically for this purpose. Extracts allow you to get article extracts (truncated article text). There is a parameter called exintro that can be used to retrieve the text in the zeroth section (no additional assets like images or infoboxes). You can also retrieve extracts with finer granularity such as by a certain number of characters (exchars) or by a certain number of sentences(exsentences)

Here is a sample query and the API sandbox to experiment more with this query.

Please note that if you want the first paragraph specifically you still need to do some additionally parsing as suggested in the chosen answer. The difference here is that the response returned by this query is shorter than some of the other api queries suggested because you don't have additional assets such as images in the api response to parse.


I need just to retrieve first paragraph of a Wikipedia page. Content must be html formated, ready to be displayed on my websites (so NO BBCODE, or WIKIPEDIA special CODE!)

How to get Wikipedia content using Wikipedia's API?

See this section on the MediaWiki docs

These are the key parameters.


rvsection = 0 specifies to only return the lead section.

See this example.

To get the HTML, you can use similarly use action=parse

Note, that you'll have to strip out any templates or infoboxes.

I do it this way:

The response you get is an array with the data, easy to parse:

    "Bees are flying insects closely related to wasps and ants, known for their role in pollination and, in the case of the best-known bee species, the European honey bee, for producing honey and beeswax."

To get just the first paragraph limit=1 is what you need.

You probably need to urlencode the parameters that you are passing in the query string ; here, at least the "Main Page" requires encoding -- without this encoding, I get a 400 error too.

If you try this, it should work better (note the space is replaced by %20) :

$str = file_get_contents($url);

With this, I'm getting the content of the page.

A solution is to use urlencode, so you don't have to encode yourself :

$url='' . urlencode('Main Page');
$str = file_get_contents($url);

Parse first paragraph from Wikipedia article?

I'd definitely say you're looking for this.

If you want to retrieve everything in the first section (not just the first paragraph):

// action=parse: get parsed text
// page=Baseball: from the page Baseball
// format=json: in json format
// prop=text: send the text content of the article
// section=0: top content of the page

$url = '';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "TestScript"); // required by server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

$json = json_decode($c);

$content = $json->{'parse'}->{'text'}->{'*'}; // get the main text content of the query (it's parsed HTML)

// pattern for first match of a paragraph
$pattern = '#<p>(.*?)</p>#s'; //
if(preg_match_all($pattern, $content, $matches))
    // print $matches[0]; // content of the first paragraph (including wrapping <p> tag)
    print strip_tags(implode("\n\n",$matches[1])); // Content of the first paragraph without the HTML tags.