Month: December 2009

Website update

Posted by – 2009/12/30

I’ve finally got around to editing the Simplish template and adding pages for some of my old software projects. There’s some links at the top up there, hover over them and you get a nice drop-down menu with a list of projects.

The meat of this is JavaScript stolen from another theme and some PHP code gratuitously pasted into my header, this isn’t IP theft, and not just because it’s the season of sharing! This is what’s great about free software licenses.

More Sidewiki Leaks

Posted by – 2009/12/14

Apparently the British press have been barred from publishing any of Tiger Woods’ sex pictures, and the block somehow extends to Internet news sites. They aren’t even allowed to mention what the block is about, so I’ve taken the liberty of filling in the blanks on Google SidewikiLeaks and will update it to include links to pictures when they are inevitably released:

Google SideWiki Leaks

It makes you glad that America has freedom of the press, even if Britain doesn’t.

Interpolated list class for Python

Posted by – 2009/12/12

While messing with Python to output some better graphs for my Facebook group scraper I stumbled upon an interesting problem. What happens if you have missing samples, or want to change the sampling frequency half way through your log? Well, you could use a proper math library and have extra dependencies or generate the missing data manually, but the simplest answer I could think of was to write a list class which interpolates between missing values, so you can do something like this:

>>> a = InterpoList()
>>> a[0]   = 0
>>> a[100] = 200
>>> a[200] = 0
>>> a[50]
100.0
>>> a[125]
150.0

And then store the time values as seconds past the epoch. Now I can inject data from multiple sources taken at different frequencies and retrieve an interpolated value for any time, which means more nice graphs like this:

As usual I also decided to share it with the Internet, because I’m nice like that! You can get a copy of the InterpoList module here.

Chart Against The X Factor

Posted by – 2009/12/06

For the last four years running Simon Cowell’s plastic karaoke acts have held the Christmas #1 spot in the UK singles charts thanks to ITV’s hit show The X Factor. People have been complaining that this has ruined the great British tradition of betting on which artist will take the number one slot, as it’s traditionally the only time of year when the chart is dominated by wacky Christmas songs rather than the latest boy bands and whoever else thirteen year old girls spend their pocket money on.

I’m not too bothered about popular music, the singles chart or who gets the Xmas #1 slot, but last week I was invited to join a growing group on Facebook who are campaigning to knock the X Factor winner from the top spot by mass purchasing Rage Against The Machine’s classic track Killing in the Name. The sound of rebellion to conquer the airwaves, political rap metal on future Christmas compilation albums, all for the princely sum of 79p? I don’t usually buy digital downloads but this time you can count me in!

According to Sky News the group had 43,000 members sometime on Friday, but by the time I got home on Saturday night there were 180,000 members and rising. As the media coverage increases so do the new members, which made me interested: how does a phenomenon like this evolve, how will it turn out next Sunday? What happens when the UK Charts people decide that it’s against the rules and disqualify the single?

So I decided to log and graph the group’s membership, every fifteen minutes I grab the page using wget, I extract the number of users and dump that into a text file along with the current date and time. Then I cut through it using a couple of awk and sed one liners, dump the results into an HTML file, graph it using Google Charts and upload the output to my file dump.

Update: These graphs are no longer live! Click for the live versions which are updated much more often using a different script

Click for the source data Members per hour

Here’s the scraping script:

#!/bin/bash
 
cd /home/gaz/ratm/
 
# get the timestamp
timestamp=`date "+20%y/%m/%d %H:%M:%S"`
 
# get the file
wget --max-redirect 2 -O temp.html http://www.facebook.com/group.php?gid=2228594104 --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
 
# extract user count from the file
usercount=`sed -n -e "s/.* of \(.*\) members.*/\1/p" temp.html`
 
# remove any commas from the string
usercount=${usercount//[,]/}
 
# it must have a length, or it will cause problems when Facebook is having problems!
# in this case, we just give a -1 (not good practice from a stats PoV, but it keeps it simple) 
if [ "${#usercount}" -eq "0" ]
then
    usercount="-1"
fi
 
# remove the temporary file
rm temp.html
 
# write the output in CSV format
echo "$timestamp,$usercount" >> data.dat
 
# next I run the graph generating script

And this one (no longer in use) creates the two above charts from the data:

#!/bin/bash
 
# gets a column from a line of a CSV file. The first index is 1, not 0.
getElement() {
    RESULT=0
    local p=`echo "$1"p`
    RESULT=$(echo $2 | sed 's/,/\n/g' | sed -n $p)
}
 
# get the start and end times
 
getElement 1 "$(tail -1 data.dat)"
end=$RESULT
getElement 1 "$(sed -n '1p' data.dat)"
start=$RESULT
 
# get the current minimum and maximum values
min=$(cat minval)
max=$(cat maxval)
 
# get the last value
getElement 2 "$(tail -1 data.dat)"
lastval=$RESULT
 
# set new max value
 
if [ "$lastval" -gt "$max" ]
then
    echo "$lastval" > maxval
    maxval=$lastval
    echo New maximum, $lastval
fi
 
# and the new min value
 
if [ "$lastval" -gt 0 ]
then
    if [ "$lastval" -lt "$min" ]
    then
        echo "$lastval" > minval
        min=$lastval
    fi
fi
 
# get values for the Y axis
quart=$((($max - $min) / 4))
q1=$(($min + $quart * 1))
q2=$(($min + $quart * 2))
q3=$(($min + $quart * 3))
 
# extract the data using regexp:
# 1. get every 4th line of the file, meaning hourly
# 2. take all the values from the file
# 3. remove the trailing comma
 
data=$(awk 'NR%4==0' data.dat | sed -n -e "s/.*,\([0-9]*\)/\1/p" | tr "\n" "," | sed -e "s/\(.*\),/\1/")
 
# build the URL to the total members chart
total_members="http://chart.apis.google.com/chart?chtt=Total+Members&chs=600x300&cht=ls&chxt=x,y&chxl=0:|$start|$end|1:|$min|$q1|$q2|$q3|$max&chds=$min,$max&chd=t:$data"
 
# now let's do members per hour
 
lastval=$min
min=0
max=0
data=""
inputList=$(awk 'NR%4==0' data.dat | sed -n -e "s/.*,\([0-9]*\)/\1/p")
while read line; do
    if [ "$line" -gt "0" ]
    then 
        val=$(($line - $lastval))
        lastval=$line
    else
        val=0
    fi
 
    if [ "$val" -gt "$max" ]
    then
        max=$val
    fi
 
    data="$data,$val"
done <<< "$inputList"
 
# remove comma prefix
data=$(echo "$data" | sed -e "s/,\(.*\)/\1/g")
 
# build the per hour chart
members_per_hr="http://chart.apis.google.com/chart?chtt=Members+per+hr&chs=600x300&cht=ls&chxt=x,y&chxl=0:|$start|$end|1:|$min|$max&chds=$min,$max&chd=t:$data"
 
# I then create an HTML file from some templates and upload everything to my dump