Connection not Content

A Blog for MOOCs and Other Animals

Archive for November 2012

Designing a Comment Scraper for MOOCs (and other animals)

with 9 comments

I published the daily output of a basic Comment Scraper program during the last week of the Change11 MOOC. Since then there has been some further interest and I have been happy to make the original program (in Python 2.7) available on request. I have now written a new and I hope improved version. Although still experimental and only intended for WordPress blogs it is at the stage where realistic testing with user feedback would be very useful.

The Comment Scraper is intended to bring together brief summarised versions of recent blog posts along with with resulting comments (See A ‘Comment Scraper’ for Aggregating Blog Posts with Comments in a MOOC and the update) and FAQ. The idea is to provide nothing more than a quick impression of current MOOC activity – what it’s about and where it’s at. In principle, any online activity where discussion is distributed over a number of blogs could be treated in a similar way.

I would now like to experiment further by scraping a number of blogs associated with the Current/Future State of Education MOOC and publishing the output here: MOOC Comment Scraper Output (2). There is also an RSS feed available.

It is not realistic for me to ask all the bloggers and commenters that may be involved for permission to publish. (I have little idea of the legalities – can anyone advise?). Following encouragement received during and following Change11 I’m  assuming that publication is acceptable but any request by a blog author not to scrape their blog will of course be respected. I should point out that I have absolutely no commercial interest in comment scraping!

Operation of the Comment Scraper:

The scraper works by downloading post and comment RSS files from each blog and creating post headings for each post along with the date, first line and any comments arising from that post. The form the output takes for a single post is illustrated below. The comments are always displayed in date order, earliest first.

Date: ‘Post Title’ by Post Author
—————————————————— First line of Post —————-
—————————— …
————————————— First line of comment 1 ———————-… (Comment1 Author1, Date1)
————————————— First line of comment 2 ————————-… (Comment2 Author2, Date2)
————————————— First line of comment 3 ———————–… (Comment3 Author3, Date3)
etc    etc

Outputs for all posts are displayed in date order with the latest first so that new posts always appear at the top of the display window and old posts drop off the end when their age exceeds a setting for maximum display time.

Some Considerations affecting design:

  • Paring Down– RSS feeds contain a considerable amount of machine-readable information but to achieve a brief and convenient human-readable display the Comment Scraper allocates only 2 lines to the post heading (in bold) and 1 for each comment along with the commenter’s name and date. An additional line in the heading could provide more information such as the title of the blog but repeating this for every post seemed excessive.
  • ‘Pingbacks’– These appear as comments in WordPress RSS feeds and usually record the existence of a link to a particular post from a different blog. The Scraper ignores pingbacks to avoid interrupting the flow of comments following the post heading. Pingbacks can vary from casual references to very significant linkages but they are often quite unrelated to the direct comments – maybe there is a case for displaying them independently.
  • Aggregation – The RSS feeds available via WordPress blogs are essentially updates (intended for RSS readers) so the initial comments for a particular post can rapidly vanish from feeds if there are numerous later comments and pingbacks. As the Scraper ignores pingbacks a cascade of pingbacks following an initial posting can push direct comments out of the feeds surprisingly quickly. The latest version of the Scraper therefore aggregates comments locally so that the early comments on a post are still displayed when absent from later feeds.
  • Maximum Display Time – Subtracting this from the current date sets a cutoff date for any display of posts and comments. Scanning a large number of blogs will tend to generate lengthy displays unless max display time is reduced. It might be useful to vary this automatically so that a fixed number of posts, comments – or even an approximately constant page length is achieved.
  • Ordering – Posts are displayed strictly in order of their publication date and consequently posts from any one blog will tend to be interspersed with posts from the others. An alternative would be to group together all posts from the same blog – this of course would alter the natural date order for posts as a whole.
  • No Comment – Although the Scraper can assemble headings for all posts within the allowable range of dates it only displays posts with at least 1 comment – it is after all a Comment Scraper! In some circumstances it could be useful to draw the line at more than 1 comment in order to display a smaller number of posts attracting more than several comments.
    As there may be numerous posts without comments their inclusion can considerably lengthen the display. Martin Hawksey (in an impressive post on blog post comments) mentions in the context of MOOCs, that, ” … it might be useful to know where the inactive nodes are so that moderators might want to either respond or direct others to comment”. This is a fair point and perhaps there is also a case for displaying commentless post headings independently.
  • HTML in Text – The Scraper extracts raw text from the first line of posts or comments but tries to ignore any HTML that may be present. In some cases it will insert a message in italics: eg ‘ {link} ‘  when a link is detected.
  • Language Translation – If a post is written in an unfamiliar language, automatic translation services such as Google Translate can, at the very least, provide vital clues about content and comments. In principle, a Comment Scraper incorporating language translation could be effective in highlighting such posts.

I will try to keep the Scraper output updated in the hope that it begins to be useful as a means of tracking MOOC activity. Your feedback and comments would be greatly appreciated!

Written by Gordon Lockhart

November 12, 2012 at 4:42 pm

Posted in Uncategorized

Tagged with