online - Scrape and generate RSS feed




rss feed url (4)

How can I make this code simpler?

I know it's not exactly what you're asking, but do you know about [http://pipes.yahoo.com/pipes/](Yahoo! pipes)?

I use Simple HTML DOM to scrape a page for the latest news, and then generate an RSS feed using this PHP class.

This what I have now:

<?php

 // This is a minimum example of using the class
 include("FeedWriter.php");
 include('simple_html_dom.php');

 $html = file_get_html('http://www.website.com');

foreach($html->find('td[width="380"] p table') as $article) {
$item['title'] = $article->find('span.title', 0)->innertext;
$item['description'] = $article->find('.ingress', 0)->innertext;
$item['link'] = $article->find('.lesMer', 0)->href;     
$item['pubDate'] = $article->find('span.presseDato', 0)->plaintext;     
$articles[] = $item;
}


//Creating an instance of FeedWriter class. 
$TestFeed = new FeedWriter(RSS2);


 //Use wrapper functions for common channel elements

 $TestFeed->setTitle('Testing & Checking the RSS writer class');
 $TestFeed->setLink('http://www.ajaxray.com/projects/rss');
 $TestFeed->setDescription('This is test of creating a RSS 2.0 feed Universal Feed Writer');

  //Image title and link must match with the 'title' and 'link' channel elements for valid RSS 2.0

  $TestFeed->setImage('Testing the RSS writer class','http://www.ajaxray.com/projects/rss','http://www.rightbrainsolution.com/images/logo.gif');


foreach($articles as $row) {

    //Create an empty FeedItem
    $newItem = $TestFeed->createNewItem();

    //Add elements to the feed item    
    $newItem->setTitle($row['title']);
    $newItem->setLink($row['link']);
    $newItem->setDate($row['pubDate']);
    $newItem->setDescription($row['description']);

    //Now add the feed item
    $TestFeed->addItem($newItem);
}

  //OK. Everything is done. Now genarate the feed.
  $TestFeed->genarateFeed();

?>

How can I make this code simpler? Right know there is two foreach statements, how can I combine them?

Because the news scraped is in Norwegian, I need to apply the html_entity_decode() on the title. I've tried It here, but I couldn't get it to work:

foreach($html->find('td[width="380"] p table') as $article) {
$item['title'] = html_entity_decode($article->find('span.title', 0)->innertext, ENT_NOQUOTES, 'UTF-8');
$item['description'] = "<img src='" . $article->find('img[width="100"]', 0)->src . "'><p>" . $article->find('.ingress', 0)->innertext . "</p>";    
$item['link'] = $article->find('.lesMer', 0)->href;     
$item['pubDate'] = unix2rssdate(strtotime($article->find('span.presseDato', 0)->plaintext));
$articles[] = $item;
} 

Thanks :)


It seems that you loop through the $html to build an array of articles, then loop through these adding to a feed - you can skip a whole loop here by adding items to the feed as they're found. To do this you'll need to move you FeedWriter contstructor up a bit in the execution flow.

I'd also add a couple of methods in to help with readability, which may help maintainability in the long run. Encapsulating your feed creation, item modification etc should make it easier if you ever need to plug a different provider class in for the feed, change parsing rules, etc. There are further improvements that can be made on the below code (html_entity_decode is on a separate line from $item['title'] assignment etc) but you get the general idea.

What is the issue you're having with html_entity_decode? Have you a sample input/output?

<?php

 // This is a minimum example of using the class
 include("FeedWriter.php");
 include('simple_html_dom.php');

 // Create new instance of a feed
 $TestFeed = create_new_feed();

 $html = file_get_html('http://www.website.com');

 // Loop through html pulling feed items out
 foreach($html->find('td[width="380"] p table') as $article) 
 {
    // Get a parsed item
    $item = get_item_from_article($article);

    // Get the item formatted for feed
    $formatted_item = create_feed_item($TestFeed, $item);

    //Now add the feed item
    $TestFeed->addItem($formatted_item);
 }

 //OK. Everything is done. Now generate the feed.
 $TestFeed->generateFeed();


// HELPER FUNCTIONS

/**
 * Create new feed - encapsulated in method here to allow
 * for change in feed class etc
 */
function create_new_feed()
{
     //Creating an instance of FeedWriter class. 
     $TestFeed = new FeedWriter(RSS2);

     //Use wrapper functions for common channel elements
     $TestFeed->setTitle('Testing & Checking the RSS writer class');
     $TestFeed->setLink('http://www.ajaxray.com/projects/rss');
     $TestFeed->setDescription('This is test of creating a RSS 2.0 feed Universal Feed Writer');

     //Image title and link must match with the 'title' and 'link' channel elements for valid RSS 2.0
     $TestFeed->setImage('Testing the RSS writer class','http://www.ajaxray.com/projects/rss','http://www.rightbrainsolution.com/images/logo.gif');

     return $TestFeed;
}


/**
 * Take in html article segment, and convert to usable $item
 */
function get_item_from_article($article)
{
    $item['title'] = $article->find('span.title', 0)->innertext;
    $item['title'] = html_entity_decode($item['title'], ENT_NOQUOTES, 'UTF-8');

    $item['description'] = $article->find('.ingress', 0)->innertext;
    $item['link'] = $article->find('.lesMer', 0)->href;     
    $item['pubDate'] = $article->find('span.presseDato', 0)->plaintext;     

    return $item;
}


/**
 * Given an $item with feed data, create a
 * feed item
 */
function create_feed_item($TestFeed, $item)
{
    //Create an empty FeedItem
    $newItem = $TestFeed->createNewItem();

    //Add elements to the feed item    
    $newItem->setTitle($item['title']);
    $newItem->setLink($item['link']);
    $newItem->setDate($item['pubDate']);
    $newItem->setDescription($item['description']);

    return $newItem;
}
?>

You could also check in the response to a HEAD request, if there is no Last-Modified line, for the presence and value of ETag and Content-Length lines. If neither of these match the prior values (which you've stored), then the content has likely changed. You could add to those any other response header lines that would indicate change.


You could have a crontab running that checks if the site has updated (either by checking the last modified headers, if available, or by checking the content you are interested in).

If when your crontab checks the site, it detects change in content, it could append a message to a queue (something like Zend_Queue http://framework.zend.com/manual/en/zend.queue.example.html for example), then you could have a worker which just works through the messages either until a time / data limit has been reached, or until the queue is empty.