How To

Page Render Errors and How to Fix Them

If you ever look in your Error Log then you have probably seen the message Error Rendering Page.  This post discusses why this error happens and some things you can do about it.

To filter and categorize articles, MyCurator needs to read the whole article.  Most RSS feeds just provide a title and a summary.  To get around the limits of RSS feeds, MyCurator uses a service called diffbot.  This service is part of the cloud processing that happens in the background with MyCurator.

Extracting the full text of an article is not easy.  As you might imagine, every site seems to use their own way to format and display articles.  The possibilities are endless.

For us humans though, we can quickly scan a web page, even if its cluttered with ads and pictures, and find the relevant text article.  The diffbot service uses machine vision techniques to do something similar when they extract an article.

They are very accurate, ranking at the top of some studies comparing various services and their ability to extract the text and images from a web page.  The comparison study found them to be over 90% accurate.  Unfortunately, that does mean that some articles are not captured.  When this happens, MyCurator logs the Error Rendering Page and the article is skipped.

MyCurator skips articles that could not be rendered because without the text, it cannot perform filtering and classifying correctly.  You can always check the article yourself by clicking on the URL in the Error Log and seeing if it is an article you like.  You would then have to cut and paste some text into a post manually if you like it.

Reasons for Error Rendering Page

The main cause of the Error Rendering Page is that the website is operating slowly and the diffbot service times out trying to get the web page.  You can recognize this error when you find that other articles from the same site are showing up in your training page.  MyCurator will retry the article each time it runs (up to 7 days).  Eventually it usually is able to get the site to respond and is able to convert the page.

Another reason for render errors is that the page is actually a PDF or some other ‘encoded’ document type.  A similar but related problem is pages that are behind a pay-wall.  Even if the page is rendered it usually is just the message telling you you have to pay for the article!

Finally, some websites are coded in a way that diffbot can’t figure them out.  For these sites, you will usually see that all of the articles from the site have the page rendering error.  In these cases, we can use a correction feature for the diffbot service to try and get the site to render pages correctly.

Fixing a Site Which Always has Render Errors

If a site is not rendering correctly for all of its articles, and it is an important site for you, sometimes we can correct it.  Use our Contact Us form to send us the URL’s of a couple of articles from the site that are getting errors.  You can just copy the URL from the Error Log page.  We will attempt to fix the site’s article rendering and if we are successful, it should start providing articles to your training page.

The good news is that diffbot is always learning, especially from the fixes that their clients add.  I have seen many pages that were unsuccessfully rendered end up working a few weeks later after diffbot learns how to read it.