Connection not Content

A Blog for MOOCs and Other Animals

Designing a Comment Scraper for MOOCs (and other animals)

with 9 comments

I published the daily output of a basic Comment Scraper program during the last week of the Change11 MOOC. Since then there has been some further interest and I have been happy to make the original program (in Python 2.7) available on request. I have now written a new and I hope improved version. Although still experimental and only intended for WordPress blogs it is at the stage where realistic testing with user feedback would be very useful.

The Comment Scraper is intended to bring together brief summarised versions of recent blog posts along with with resulting comments (See A ‘Comment Scraper’ for Aggregating Blog Posts with Comments in a MOOC and the update) and FAQ. The idea is to provide nothing more than a quick impression of current MOOC activity – what it’s about and where it’s at. In principle, any online activity where discussion is distributed over a number of blogs could be treated in a similar way.

I would now like to experiment further by scraping a number of blogs associated with the Current/Future State of Education MOOC and publishing the output here: MOOC Comment Scraper Output (2). There is also an RSS feed available.

It is not realistic for me to ask all the bloggers and commenters that may be involved for permission to publish. (I have little idea of the legalities – can anyone advise?). Following encouragement received during and following Change11 I’m  assuming that publication is acceptable but any request by a blog author not to scrape their blog will of course be respected. I should point out that I have absolutely no commercial interest in comment scraping!

Operation of the Comment Scraper:

The scraper works by downloading post and comment RSS files from each blog and creating post headings for each post along with the date, first line and any comments arising from that post. The form the output takes for a single post is illustrated below. The comments are always displayed in date order, earliest first.

Date: ‘Post Title’ by Post Author
—————————————————— First line of Post —————-
—————————— …
————————————— First line of comment 1 ———————-… (Comment1 Author1, Date1)
————————————— First line of comment 2 ————————-… (Comment2 Author2, Date2)
————————————— First line of comment 3 ———————–… (Comment3 Author3, Date3)
etc    etc

Outputs for all posts are displayed in date order with the latest first so that new posts always appear at the top of the display window and old posts drop off the end when their age exceeds a setting for maximum display time.

Some Considerations affecting design:

  • Paring Down– RSS feeds contain a considerable amount of machine-readable information but to achieve a brief and convenient human-readable display the Comment Scraper allocates only 2 lines to the post heading (in bold) and 1 for each comment along with the commenter’s name and date. An additional line in the heading could provide more information such as the title of the blog but repeating this for every post seemed excessive.
  • ‘Pingbacks’– These appear as comments in WordPress RSS feeds and usually record the existence of a link to a particular post from a different blog. The Scraper ignores pingbacks to avoid interrupting the flow of comments following the post heading. Pingbacks can vary from casual references to very significant linkages but they are often quite unrelated to the direct comments – maybe there is a case for displaying them independently.
  • Aggregation – The RSS feeds available via WordPress blogs are essentially updates (intended for RSS readers) so the initial comments for a particular post can rapidly vanish from feeds if there are numerous later comments and pingbacks. As the Scraper ignores pingbacks a cascade of pingbacks following an initial posting can push direct comments out of the feeds surprisingly quickly. The latest version of the Scraper therefore aggregates comments locally so that the early comments on a post are still displayed when absent from later feeds.
  • Maximum Display Time – Subtracting this from the current date sets a cutoff date for any display of posts and comments. Scanning a large number of blogs will tend to generate lengthy displays unless max display time is reduced. It might be useful to vary this automatically so that a fixed number of posts, comments – or even an approximately constant page length is achieved.
  • Ordering – Posts are displayed strictly in order of their publication date and consequently posts from any one blog will tend to be interspersed with posts from the others. An alternative would be to group together all posts from the same blog – this of course would alter the natural date order for posts as a whole.
  • No Comment – Although the Scraper can assemble headings for all posts within the allowable range of dates it only displays posts with at least 1 comment – it is after all a Comment Scraper! In some circumstances it could be useful to draw the line at more than 1 comment in order to display a smaller number of posts attracting more than several comments.
    As there may be numerous posts without comments their inclusion can considerably lengthen the display. Martin Hawksey (in an impressive post on blog post comments) mentions in the context of MOOCs, that, ” … it might be useful to know where the inactive nodes are so that moderators might want to either respond or direct others to comment”. This is a fair point and perhaps there is also a case for displaying commentless post headings independently.
  • HTML in Text – The Scraper extracts raw text from the first line of posts or comments but tries to ignore any HTML that may be present. In some cases it will insert a message in italics: eg ‘ {link} ‘  when a link is detected.
  • Language Translation – If a post is written in an unfamiliar language, automatic translation services such as Google Translate can, at the very least, provide vital clues about content and comments. In principle, a Comment Scraper incorporating language translation could be effective in highlighting such posts.

I will try to keep the Scraper output updated in the hope that it begins to be useful as a means of tracking MOOC activity. Your feedback and comments would be greatly appreciated!

Written by Gordon Lockhart

November 12, 2012 at 4:42 pm

Posted in Uncategorized

Tagged with

9 Responses

Subscribe to comments with RSS.

  1. Marcius Herbert gave me the link to your scraper page.

    If new learning is about (socially) constructing knowledge, rather than passively receiving knowledge, then these scraping and aggregating tools (such as Storify) become increasingly important. So I’m working with my senior high school students to develop this practice of collaborative constructing knowledge. We’ve started with Twitter as a note-taking tool–you can follow our progress on my blog here (follow the links after the jumpt and here (

    How did you build the scraper? My students and I could use such a tool.


    • You have a very interesting experiment there Brad and it seems to be taking off in spite of the constraints imposed by traditional schooling – best of luck! I try to be pragmatic about tools such as Storify but tend to resent the manipulation and lack of freedom that usually comes with the inevitable monetization.

      The Scraper started off as an exercise in learning Python so its hardly a professional job but I’d be happy to send the original version (the current one is a mess at present) to you or any of your students who might be interested. It’s written in Python 2.7 and would need at least some basic programming know-how to get going or to modify for different purposes.

      Gordon Lockhart


      November 12, 2012 at 9:20 pm

  2. Reblogged this on MOOC Madness and commented:
    OK all you mooc rats & obsessive compulsive serial mooc-ers out there…getting lost, losing track of comments? Fear not: help is on the way. Time to get scraped. Don’t forget: Gordon wants feedback. & comments.


    November 12, 2012 at 6:39 pm

    • Thanks Vanessa – much appreciated! There seems to be considerably more interest in the idea this time round.


      November 12, 2012 at 9:26 pm

  3. There is definitely more interest, Gordon, and it’s pretty exciting to see interest at the high school level. I’m tweeting news of Brad’s work mentioned in comment above to my English teacher-grad students to encourage them.

    Wanted to add that I agree with Martin Hawksey that it’s helpful for instructors and students/participants to see which posts don’t have comments. It’s a sad thing for someone’s post to be ignored, especially a newbie who needs the encouragement.

    You had kindly share your original scraper version and I’m still hoping to use it in my new course under development now for the spring. Sort of hit a wall though — it looks like Tumblr will work best for my students’ blogs. I think you said other blog services may be problematic if even possible at all. Am I out of the scraping business already if I go with Tumblr?


    November 13, 2012 at 5:18 am

    • Thanks for the encouragement Cris. I’ve come to the conclusion that good or even excellent blog posts are often ignored for no good reason. With so many blogs around getting comments may be more to do with being well-known or just luck in catching the right commenter at the right time. This may not be appreciated by newbie bloggers – particularly very young ones! Having said that, it can also be discouraging to place a comment on a post and be ignored by the blogger. Maybe what’s needed is some sort of commonly-accepted ‘moociquette’!

      I had a quick look at Tumblr and although it has a ‘posts’ feed I couldn’t see one for comments but I’ll investigate further. I had a version of the Scraper for Blogger feeds previously and together Blogger and WordPress must account for a very large proportion of all blogs so anything else tends to be a special case.


      November 13, 2012 at 4:26 pm

      • I actually include a criterion on my blogging rubric about serving as a gracious host and responding thoughtfully to commentors, Gordon. It’s something new bloggers don’t often think of but no one likes for their comments to end up in the dead letter bin.

        Now if there’s a tool that would help us keep up with where we’ve posted so we’ll be sure not to ignore a blogger’s response to our comments, that would be great. I worry I may be seen as rude when I’ve unfortunately forgotten to check the “Notify me of follow-up comments via email.” I think there was a tool “Co-comments” but it never worked well on a Mac.

        Thanks for taking a look at Tumblr. It’s appeal for me is the ability to embed multimedia easily — something that WordPress is making more and more difficult for non-paying customers. And the Blogger CAPTCHA gives everybody on WordPress fits. I’m also told that Tumblr is even easier for newbies than WordPress.

        Not forgetting to check the follow-up comments 😉

        Cris Crissman

        November 13, 2012 at 8:31 pm

  4. “I actually include a criterion on my blogging rubric about serving as a gracious host and responding thoughtfully to commentors” – good idea Cris. Hmmm – I guess a local comment scraper could be ‘tuned in’ to auto-detect specific follow-up comments.


    November 14, 2012 at 7:40 am

  5. […] course forums for blogs and comments by participants. I’m blowing the cobwebs off my experimental Comment Scraper and if I can find enough activity I’ll try to bring it all together here in a summarised format […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: