Connection not Content

A Blog for MOOCs and Other Animals

MOOC Comment Scraper – Program Details

with 4 comments

The current version of the Scraper abbreviates posts in WordPress or Blogger blogs along with their comments and details such as date, authors etc. – see sample output.  My approach is briefly described below along with a summarised version of the program.

Input/ Output – The input is a text file containing the list of blogs to be scraped. Each blog must be numbered and its URL and type, (WordPress or Blogger), specified. The final output is a single HTML page suitable for embedding on a blog or website and formatted so that latest posts are displayed first followed by comments. The oldest comments for a post are displayed first in order to preserve some degree of conversational flow.

Basic Operation – Typically, blogs have separate RSS feeds for posts and comments respectively such as,

WordPress:  example.wordpress.com/feed  and  example.wordpress.com/comments/feed
Blogger: example.blogspot.co.uk/feeds/posts/default  and  example.blogspot.co.uk/feeds/comments/default

The feeds contain the text of the latest posts and comments and the Scraper combines, abbreviates and processes these, normally by generating a header for each post followed by its comments. This is facilitated by passing the feeds to a Python feedparser to extract items from the feeds including the post ids. Every post from the post feed has a unique id and every comment from the comment feed carries the id of its associated post. The Scraper therefore matches up ids from both feeds and then processes each comment adding it underneath the correct post header. Comments in the form of ‘pingbacks’ are ignored.

Some of the earlier comments for a given post may not be present in a comment feed so the Scraper archives output after every run and any later comments are added the next time the Scraper runs. Time limits can be set so that posts are deleted if too old or not output if no comments appear after a certain time limit. Setting a tag (eg a MOOC hashtag), ensures that only posts bearing this tag will appear in the output.

Summarised Version of the Scraper Program – The action of the Python program is summarised below and gives some idea of how the Scraper was implemented. It does not specify an efficient, polished program! If necessary, I will (shortly) make the source code available on request for non-commercial purposes.

Definition of terms: (highlighted in the program summary below)

BlogNo – Unique number associated with each blog on the input file
Comid – Unique comment id for a comment in the comment feed
ComidList –List of ‘Active’ comment ids
ComPostid – id of comment’s post from the comment feed
ComStatus – Status of a post in Posts. Can be, ‘No Com’, ‘New Post’ or ‘Has Com’
Html – HTML post header in Posts plus any comments
Hpage – Final HTML output file containing formatted posts and comments
NoComLim – Number of days posts with no comments appear in output
OutputList – Date sorted version of Posts used to generate the HTML output file
Posts – List containing post and comment info in the form:
Posts[BlogNo][PostNo] = [Postid, ComStatus, [list of Comids], date, Html]
Postid – Unique post id for a post in the post feed
PostidList – ‘Active’ post list

PROGRAM SUMMARY: (# indicates comment)

get BlogNo, URL and blog type for each blog from the input file
get archived Posts list from the archive file 

for each BlogNo:
     
     # Examine Posts list and create lists of 'active' post and comment ids in
     # PostidList and ComidList respectively. Update Posts as necessary -

     clear ComidList and PostidList
     for each post in Posts:
          delete post if out of date
          if ComStatus is 'New Post' and post is older than NoComLim days then:
               set ComStatus to 'No Com'     # has had no comments for NoComLim days
          otherwise:                         # is 'active' comment
          add its Comids and Postid to ComidList and PostidList respectively
     # -----------------------------------

     # Now go online to fetch this blog's latest feeds -

     get URL for this blog
     fetch its latest post and comment feeds and pass these to feedparser
     extract post and comment entries from the parsed feeds

     # Now process this blog's post entries -

     for each post:
          dismiss post if:
               out of date
               without correct tag
               Postid is already in PostidList       # from past run
          otherwise:                                 # is new post
               add Postid to PostidList
               set ComStatus to 'New Post' if post younger than NoComLim days
                    otherwise to 'No Com'
               create HTML header for title, link, author, date, 2 lines post text
               set header as Html in Posts
     # -------------------------------------------------------------

     # Now process this blog's comment entries and find matching post ids -

     for each comment entry extract comments in reverse time order:
          dismiss if 'pingback' or if Comid is already in ComidList   # old comment
          otherwise:                                              # new comment
               get author, date, text, Comid and ComPostid
               for each post in Posts:
                    if Postid matches ComPostid  then:            # matching post found
                         add Comid to ComidList                   
                         copy Comid to comment list               # in Posts
                         set ComStatus to 'Has Com'               # ditto
                         create HTML for comment and add to Html  # ditto
     # ------------------------------------------------------

save Posts to archive file                             # for next Scraper run

# Now copy 'active' posts to OutputList and sort -

for each BlogNo:
     clear OutputList
     for each post in Posts:
          add post to OutputList if ComStatus is 'Has Com' or 'New Post'
Sort OutputList by date                                # latest posts first
# -----------------------------------------------------------

# Now create web page for output -

create HTML header for web page in Hpage
for each post in OutputList:
     append Html to Hpage
save Hpage to output file
# --------------------------------- END OF PROGRAM -------------

Creative Commons Licence
[This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.]

Advertisements

Written by Gordon Lockhart

March 27, 2014 at 8:30 pm

4 Responses

Subscribe to comments with RSS.

  1. […] was something he had done previously for #rhizo14 What I did not realize that I learned from reading his documentation was that Blogger provided RSS feeds for comments- for some reason I thought that only WordPress had […]

  2. This is a brilliant attempt to help us find blog posts that more people relate to, and discuss about. One suggestion: Since you mentioned that only a few blogs have been considered, can you please also publish these lists separately:
    1. Blogs considered (and not considered as well. This may help the author find why)
    2. Blogs with atleast 1 Post that received comments in last 15 days
    3. Blogs with atleast 1 New post (without comment) in last 3 days

    This is only for me ( and similar other blog authors) to see where my blog fits, and even if my blog is considered or not.

    Aparna Nagaraj

    October 29, 2014 at 4:00 am

    • Thanks for your kind words and interesting suggestions Aparna – difficult to choose exactly what to publish but I will certainly consider.
      1) I’ve tried to scan all WP and Blogger blogs appearing in the ccourses list that have ‘normal’ RSS post and comment feeds. (I’m very sorry that I seem to have omitted yours – not sure why – but should be OK now!) The Collector can’t deal with other types of commenting such as G+, tumblr etc. At present over 80 blogs are being scanned and probably covers a majority of active participants. Frequent alterations make it difficult to be definitive but I’ve been happy to look at particular cases on request.
      2) This is immediately evident from the Collector’s daily output ( iberry.com/cc.htm ) – a blog’s name appears on the top left of a post’s entry and the title of the post links to the blog itself.
      3) Similar to 2) ‘New Posts’ are labelled and lack of comments can be identified.
      Thanks again for your interest.

      Gordon Lockhart

      October 29, 2014 at 2:09 pm

  3. […] develop the Collector. I have no intention to develop the Collector for any commercial purpose and Programming details are openly […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: