Chart Against The X Factor

Posted by – 2009/12/06

For the last four years running Simon Cowell’s plastic karaoke acts have held the Christmas #1 spot in the UK singles charts thanks to ITV’s hit show The X Factor. People have been complaining that this has ruined the great British tradition of betting on which artist will take the number one slot, as it’s traditionally the only time of year when the chart is dominated by wacky Christmas songs rather than the latest boy bands and whoever else thirteen year old girls spend their pocket money on.

I’m not too bothered about popular music, the singles chart or who gets the Xmas #1 slot, but last week I was invited to join a growing group on Facebook who are campaigning to knock the X Factor winner from the top spot by mass purchasing Rage Against The Machine’s classic track Killing in the Name. The sound of rebellion to conquer the airwaves, political rap metal on future Christmas compilation albums, all for the princely sum of 79p? I don’t usually buy digital downloads but this time you can count me in!

According to Sky News the group had 43,000 members sometime on Friday, but by the time I got home on Saturday night there were 180,000 members and rising. As the media coverage increases so do the new members, which made me interested: how does a phenomenon like this evolve, how will it turn out next Sunday? What happens when the UK Charts people decide that it’s against the rules and disqualify the single?

So I decided to log and graph the group’s membership, every fifteen minutes I grab the page using wget, I extract the number of users and dump that into a text file along with the current date and time. Then I cut through it using a couple of awk and sed one liners, dump the results into an HTML file, graph it using Google Charts and upload the output to my file dump.

Update: These graphs are no longer live! Click for the live versions which are updated much more often using a different script

Click for the source data Members per hour

Here’s the scraping script:

#!/bin/bash
 
cd /home/gaz/ratm/
 
# get the timestamp
timestamp=`date "+20%y/%m/%d %H:%M:%S"`
 
# get the file
wget --max-redirect 2 -O temp.html http://www.facebook.com/group.php?gid=2228594104 --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
 
# extract user count from the file
usercount=`sed -n -e "s/.* of \(.*\) members.*/\1/p" temp.html`
 
# remove any commas from the string
usercount=${usercount//[,]/}
 
# it must have a length, or it will cause problems when Facebook is having problems!
# in this case, we just give a -1 (not good practice from a stats PoV, but it keeps it simple) 
if [ "${#usercount}" -eq "0" ]
then
    usercount="-1"
fi
 
# remove the temporary file
rm temp.html
 
# write the output in CSV format
echo "$timestamp,$usercount" >> data.dat
 
# next I run the graph generating script

And this one (no longer in use) creates the two above charts from the data:

#!/bin/bash
 
# gets a column from a line of a CSV file. The first index is 1, not 0.
getElement() {
    RESULT=0
    local p=`echo "$1"p`
    RESULT=$(echo $2 | sed 's/,/\n/g' | sed -n $p)
}
 
# get the start and end times
 
getElement 1 "$(tail -1 data.dat)"
end=$RESULT
getElement 1 "$(sed -n '1p' data.dat)"
start=$RESULT
 
# get the current minimum and maximum values
min=$(cat minval)
max=$(cat maxval)
 
# get the last value
getElement 2 "$(tail -1 data.dat)"
lastval=$RESULT
 
# set new max value
 
if [ "$lastval" -gt "$max" ]
then
    echo "$lastval" > maxval
    maxval=$lastval
    echo New maximum, $lastval
fi
 
# and the new min value
 
if [ "$lastval" -gt 0 ]
then
    if [ "$lastval" -lt "$min" ]
    then
        echo "$lastval" > minval
        min=$lastval
    fi
fi
 
# get values for the Y axis
quart=$((($max - $min) / 4))
q1=$(($min + $quart * 1))
q2=$(($min + $quart * 2))
q3=$(($min + $quart * 3))
 
# extract the data using regexp:
# 1. get every 4th line of the file, meaning hourly
# 2. take all the values from the file
# 3. remove the trailing comma
 
data=$(awk 'NR%4==0' data.dat | sed -n -e "s/.*,\([0-9]*\)/\1/p" | tr "\n" "," | sed -e "s/\(.*\),/\1/")
 
# build the URL to the total members chart
total_members="http://chart.apis.google.com/chart?chtt=Total+Members&chs=600x300&cht=ls&chxt=x,y&chxl=0:|$start|$end|1:|$min|$q1|$q2|$q3|$max&chds=$min,$max&chd=t:$data"
 
# now let's do members per hour
 
lastval=$min
min=0
max=0
data=""
inputList=$(awk 'NR%4==0' data.dat | sed -n -e "s/.*,\([0-9]*\)/\1/p")
while read line; do
    if [ "$line" -gt "0" ]
    then 
        val=$(($line - $lastval))
        lastval=$line
    else
        val=0
    fi
 
    if [ "$val" -gt "$max" ]
    then
        max=$val
    fi
 
    data="$data,$val"
done <<< "$inputList"
 
# remove comma prefix
data=$(echo "$data" | sed -e "s/,\(.*\)/\1/g")
 
# build the per hour chart
members_per_hr="http://chart.apis.google.com/chart?chtt=Members+per+hr&chs=600x300&cht=ls&chxt=x,y&chxl=0:|$start|$end|1:|$min|$max&chds=$min,$max&chd=t:$data"
 
# I then create an HTML file from some templates and upload everything to my dump
17 Comments on Chart Against The X Factor

Respond | Trackback

  1. Jon Morter says:

    It’s my group….WOW! that is some graph!!!

    J :)

  2. RAGE says:

    Legend

  3. Lolly says:

    Metioric!!!!!!!!!!!!!!! xx

  4. James French says:

    Good work Gaz. Love it that you posted the bash script too :)

  5. Gaz Davidson says:

    I’m afraid that it’s only since saturday night at 12, the flat spot is because everyone was out partying or in bed, hopefully it will get an interesting shape this week!

    I may make this a bit more robust and user friendly, then do a proper release so everyone can graph their Facebook groups :)

  6. Elliot says:

    This group FTW! Just seeing this graph makes you struck with awedom on how quickly a few words can spread around :D

    RATM! :D

  7. Dan Akers says:

    amazing man, this is awesome, posting it to my fb page asap

  8. Dave Holden says:

    The new versions of your graphs are awesome! Thanks for creating them.

    I’m curious about the number of members being predicted today, the curve seems to go up steeply. It’s predicting 70399 new members now. Where do these figures come from please?

  9. Gaz Davidson says:

    It assumes that there are similar volumes of users joining at the same time each day and predicts based on how many have joined so far today compared to the same time over the last three days:

     
    secs_today = int(total_range) % 86400
     
    # start time was 12am on Sunday, there are 86400 secs in a day
    for i in range( int(start), int(start) + int(day_count+1) * 86400, 86400):
        days.append(
                     # tuple: day name
                     (strftime('%b%d', gmtime(i) ),
                     # total that day
                     data[i+86400] - data[i],
                     # amount at the same time on that day
                     data[i+secs_today] - data[i])
                    )
     
    # take the average values of the last three full days
    tot     = 0.0
    partial = 0.0
    for i in days[-4:-1]:
        tot = tot + i[1]
        partial = partial + i[2]
     
    factor = tot / partial
     
    # make a prediction
    prediction = days[-1][2] * factor

    “data” is my InterpoList class which does linear interpolation to fill in the gaps between measurements, you can get it here:
    http://svn.bitplane.net/misc/trunk/py/InterpoList.py

  10. Dave Holden says:

    Many thanks!

  11. James French says:

    This is too cool Gaz. I’m impressed. I think I’m going to learn a thing or two from these scripts – cheers!

    James

  12. [...] messing with Python to output some better graphs for my Facebook group scraper I stumbled upon an interesting problem. What happens if you have missing samples, or want to change [...]

  13. nick rose says:

    Excellent graph :D We’re an internet machine!!!

  14. Dimitri says:

    Hey Gaz,

    Great work on the script, and thanks for sharing the code. Just noticed, the index page doesn’t seem to load normally anymore (firefox tries to download it instead of interpreting the html). Any idea why?

    Thanks,
    Dimitri

  15. Gaz Davidson says:

    It should be working again now Dimitri, it was caused by a change to the group’s name, I’ve fixed the regexp like so:

    # extract user count from the file
    usercount=`sed -n -e "s/.*>[0-9]* of \(.*\) members<.*/\1/p" temp.html`
  16. nick rose says:

    Hi, is there any plans to publish the full details once the campaign is over? Id be really interested to see all the data collated.

  17. Gaz Davidson says:

    Yes, the raw data is currently available here:

    http://dump.bitplane.net/ratm/data.dat

    Sometime after Sunday I’ll move the latest graphs back to my blog along with the results, the scripts used to generate them and maybe even a bit of analysis. Watch this space!

    If you can think of any cool things to do with the data then please just go ahead and use it, post me a link and I’ll link back to it :)

Respond

Comments

Comments

Powered by WP Hashcash