Chat Channels — get your Skype chats as RSS feeds or web pages

Feb 04, 2006

I had this idea the other day, just out of the blue. What if you could read your Skype chats as RSS feeds?

Short answer: with Chat Channels, you can now. If you just want to get the program, jump to the Installation part. The following blabber is about how I got there.

Well… not really out of the blue. At Skype, we’ve always been toying with the idea of attention management. You already see elements of that in the current interface, such as the „Recent Chats” concept or event management. But a lot remains to be done.

So it really struck me when I listened to this attention podcast. A lot of work is happening in the attention management space there, such as attention.xml and other initiatives. So if we had Skype chats available as RSS feeds and had people read them with an attention-supporting feed reader, we could perhaps track attention for chats, and do a whole lot of fun things based on that. But obviously, before any attention stuff, you should just get the chats into an RSS, and that's what this post is about.

Another motivation for viewing Skype chats as something for which RSS channels are a suitable delivery medium, is that we designed them to be persistent, so they are more similar to forums and web sites than, say, IRC or MSN Messenger, Jabber and those other IM clients where you only get messages during the time that you’re actually connected, and to get the full history, you must have a „bot” that is always connected and collects the history. Skype chat contents is consistent to all participants, regardless of when they individually connect, so the chat contents is much more uniform and coherent because everyone gets the same picture. And you can now deliver this picture to „outsiders” as well through the RSS and web views.

There are other reasons why you would want to have your chat content in a database or available as an RSS read. You could publish the chat to wider audiences than just the immediate membership – such as stream the work-related chats straight to your intranet. You could do fulltext search across chats. RSS (or Atom, it doesn’t really matter) are becoming the „glue” technologies of the web, enabling easy cross-site and aggregation interactions and content publishing/retrieval. So why not jump on the bandwagon.

Plus, it sounded like fun. I haven’t hacked together something like this for quite a while. I just wanted to see if I could do it and what it would take. (short answer: yes I could, it took about 12 hours of raw coding out of my own time which turned out to be quite fun, and another two hours of whipping up this text.)

Architecture

I figured I would need to structure the app in two basic pieces: one is a „collector” that talks to Skype over the API and pulls the chat messages into a central location, say, the database. The other is the „renderer” that works off the chats that are in the database and displays them as, say, RSS feeds or simple web pages.

I picked Django as both the web frontend and data/object model backend. Why? I’ve played with it and I just like it, mostly for its small code footprint and elegant code. Plus I wanted to learn more about Python. A Skype API Python wrapper was conveniently available, so I just went with that.

As for the data model, I just copied it straight off Skype API doc. It’s as simple as can be. There are two types of objects: chats and messages. Both have a bunch of properties. Each message belongs to a chat with a chatID foreign key type relation (more about identifiers later).

I had some design objectives, which in short, were:

  • the whole thing should be robust. It must allow for data backends to disappear and recover gracefully. (This is the least polished part – when either the MySQL server dies or Skype API connection breaks, you must restart the collector, it doesn’t recover gracefully. The web frontends are pretty robust.)
  • data must be consistent. There cannot be data duplication or loss – output (database contents) must exactly match input (what Skype client knows about the chats). This must be true (and actually is) across multiple invocations of the collector. (Read below for more – this is only really doable if you’re running collector in one machine.)
  • the code should be elegant, short, to the point. No duplication and copy-paste.

Now that I had the high-class logic, design objectives, tools and data model in place, it was time to get some messages going.

Implementing the data model, getting the loop going

I figured I should do the collector first, since without any data, it would be pretty much useless to render anything to the web. Collector took the bulk of the work, something like 3 or 4 hours over 3 days each, so overall work effort was somewhere around 10 to 12 hours, or one very long working day.

Defining the initial data model in Django was nothing more than copy-pasting the relevant parts of Skype API documentation into Python code and adding some round brackets around it. I had to do a few purges and remakes as I polished the model, but the basic idea took just a few minutes. Then „django-admin.py install chatchannels” and I had the data skeleton in my database together with an autogenerated web-based admin interface, even though there was nothing interesting happening there yet.

So now on to actual data gathering. Connecting to the API and getting the message flow going was pretty straightforward and the example bundled with the wrapper was quite helpful. I hadn’t really done anything with threads and I’m still not sure how threads, Python and the API wrapper work together, but the wrapper neatly hides any complexity there is. I had soon a message loop going where I was able to trigger API notifications and process the incoming events. I had some early days experience with the Skype API where I remember some messages being processed in synchronous blocking mode – you fired off a request and got a response. In the current Skype API or at least the Python wrapper, all of the API is asynchronous and nonblocking – you fire off a request, and after a while, the response comes, among other messages that are initiated by Skype client itself. You can tag your requests with an ID and the responses carry the same ID, but as I found, it paid off to treat all the notifications, whether triggered by the collector or Skype, as equal and not tag them. At places where I needed to simulate blocking API behaviour, I just inserted conditionals (and had initially „pass” as the loop operation but this was very bad to the CPU, so „sleep” made it much friendlier to the computer).

Data format magic – chat ID mangling and message ID-s

It now became a bit tricky. Grabbing properties for a chat was easy by knowing its ID, but how do you identify a chat in the database? Chats in Skype API have a unique identifier that looks something like this: „#echo123/$pamela;897aefba432c”. This is a very cool way to construct an ID, but neither the # nor / characters are „web-safe”, i.e they cannot be use in an URL as the primary identifier, as # is page bookmark ID and / is path delimiter, and both would horribly break and confuse Django’s URL mapper. So should I use the boring integer object ID usually found in these homegrown web apps? I decided to still go with the chat ID, but mess around a little bit. So I figured there’s going to be a „real” chat ID, which is what the Skype API knows, and a „db” or „web” chat ID, which replaces # with ( and / with ), both of which are OK to use in an URL. (Well to be anally correct you’d want to URL-encode them, but at least they don’t break anything.) I chose them simply because I looked at my keyboard and these were the first URL-safe characters that stood out. And the transformation between „db” and „real” chat ID-s is dead simple.

I figured you’d never want to collect the data about all the chats you have on Skype – there’s always a limited subset of chats you want to work with in Chat Channels, and others you don’t want to touch. So the first thing the collector does is it queries the database for chat ID-s to work with, and then starts collecting data about these. So the first thing that you need to do (and I did) is to insert the „db” versions of chat ID-s into the database.

Notice how little database-related code there is in the collector. No SQL and just a few object calls. Who said you can use Django only in apps that are run in web server nad generate pages? Here we use it just as an object mapper, where it shares objects with the web views that we’ll get to later.

So. You now have a bunch of chat ID-s. You can connect to the API, query data about them, such as the topic and timestamp, and drop it back into the database. The next thing you want to do is to collect the messages for all of these chats, query their individual properties and insert them into the database if they’re not already there. Herein was a problem – unlike chats, chat messages in Skype do not have a unique global ID. They have a volatile integer ID that you address them with, but using it would not be really elegant and if you cleared your history, chances are you’d get different ID for the same message, and would end up with duplicate data. Which would be really evil. So you need a way around that.

I decided to construct my own unique message ID. It’s a SHA-1 hash over most of the message properties (timestamp, author, body etc). It works pretty well for the purposes of this app, but has two problems.

First, it can violate the data consistency design goal if there are two messages posted by the same person with the same body and timestamp. They would then have the same hash and be counted as one, and only one copy of them would end up in the database. In practice, I haven’t seen this kind of messages (same poster, body, timestamp) at all, but in some scenarios, especially if some „bot” posts some automated data, it may happen, so watch out.

Second, due to the way that Skype chats work over P2P, the timestamps are almost guaranteed to differ across PC-s and across different Skype history instances in the same PC (if the chat is „synced back” to you after you clear history). Timezones are not a problem because the times are normalized to UTC before saving. Instead, this is because the timestamp recorded is a function of poster computer time and delivery time... or something. I’m not sure about the details, but expect the timestamps for the same message to differ across PC-s and users by a few seconds, that’s for sure. This means that if you ran the collector in different PC-s, you’d have different timestamps and thus different hashes for the same message. Umm... I guess the resolution to this, for now, is to run the collector in only one PC for each chat.

Notice how the message status is not part of message ID and actually isn’t in the data model at all. This is because from a „channel” perspective, it doesn’t really matter if the current user has read the message or not, or whether it is SENDING or SENT. At the end, all the messages will be in the same „read” state. Also, unlike other message properties, it can change in time, so you’d have two different hash values for the same message if the status changes in time, and again you’d have duplicates and be screwed.

The hash also proved useful to serve as bookmark anchor to individual messages in the web view – for example, when you do search and see an individual message and then click on it, you’re taken to the relevant place on the web view of the corresponding chat’s full history.

The hash has its downsides too. You can only know a message’s hash if you know all its properties. So if I wanted to know if a message was already in a database or not, the only way was to query all its properties from the API, hash them and try to look up the hash from database. So there are a lot of wasted API calls, especially during initialization. But hey, if you don’t have to pay for them (except in a bit longer execution time), you can live with it.

So... now I was pulling chats and messages and saving them. It looked fine. But could we make this beast faster, leaner, meaner?

Optimizing

Turns out the first time you write code, at least if you’re as rookie as myself, you end up with horribly inefficient stuff. Rather than trying to get it right the first time, it seemed to make sense to get the functional part right first and then go over and iron things out. There was plenty to iron.

Initially there wasn’t any chats or messages cache – every time I wanted to know if a chat or message ID exists in the database, I’d query the database directly. This was obviously ugly and non-elegant. As the chat and message ID-s are immutable, you can just cache them into memory upon initialization for the given chat, and only need to query it from database if you have a „cache miss”.

The other big thing, as I mentioned, was using „pass” instead of „sleep” which drove up the CPU and left less time for the actual processing. „sleep” frees the CPU up for the indicated period and is fine when you need to simulate a blocking API.

Now I had the collector done and it was on to the web view.

Web views, search and RSS renderer

Compared to collector, doing the web views was pure fun – much simpler, easier and quicker – literally just an hour or so. There’s thus also much less to document. There are three main views – chat index (with possible search results), individual chat web view, or RSS (actually Atom – the RSS generator wants you to have email address for each message author) feed for a chat. Nothing to say about these – just read the code. I tried to use Django’s „high-level” RSS framework to generate the feeds, but became frustrated and found it much easier to just make a custom view and use the „low-level” feed generator which was actually much cleaner.

I don’t even pretend the web views are anywhere near complete. This is why they look so undone – because they are. You could do many more fun things with it – limit search to posters, dates and any other criteria, have nicer cosmetics etc, but I just wanted to get the basics and RSS going, so if you need anything more, code it away yourself.

Installation and using

You need to complete roughly the following steps to get this thing working and start viewing your Skype chats as RSS and webpages and search across them. Note that you can run the collector and web parts in different machines. The collector only runs on Windows, the web parts should run everywhere where Django and its data backends run. (I’ve only tested all of this on Windows.)

On both:

  1. install Python. For Windows, ActivePython is good. I used ActivePython 2.4.2.
  2. install Django. I used v 0.91.
  3. grab the chatchannels zip
  4. copy „chatchannels” into MYPROJECT/apps (MYPROJECT is your Django project – it can obviously be called anything else)

Where the collector runs:

  1. install Skype API Python wrapper and its dependencies
  2. put the chatchannels-collector.py somewhere where it is convenient to run it.
  3. run „django-admin.py install chatchannels” to initialize the database schema.
  4. UPDATED: you need to make sure that the collector is talking to the Skype API using protocol 5. It's not included in my ZIP because I patched skypeserver.py to send "PROTOCOL 5" request on initialization and forgot about it initially. Either do the same to skypeserver.py, or add a trigger_get_property requests that sends "PROTOCOL 5" somewhere before the chat initialization.

Where the web views run:

  1. copy „templates/chatchannels” into TEMPLATE_DIRS
  2. add „(r'^chatchannels/', include('myproject.apps.chatchannels.urls')),” into MYPROJECT urls.py
  3. review that all the urls are good

You should now go to http://your_django_host/admin/chatchannels/ and you should see the empty chats list. Add the chats here that you want to work with – click „Add chat” and enter the „db name” of the relevant chat. To get chat name, open the relevant chat in Skype and enter „/chatname”. It tells you something like #echo123/$pamela;897aefba432c, so enter (echo123)$pamela;897aefba432c into the chat name field. You don’t need to enter anything else, just hit Save.

After entering the chat ID-s, run the collector and it should start collecting all the chats data. If all is good, just leave it running so that all new messages are also automatically and immediately collected, and you’re done with the collector – it works now.

Now you can run http://your_django_host/chatchannels/ and you’ll see list of the chats. You can view them as web view or Atom feed or search across them.

Disclaimer and final words

Although I work for Skype, this is not an official Skype product and is not supported by Skype – it is purely an off-hours experiment for now. If you do anything with it and find it useful or rather think anything of it at all, drop a comment here.

This is not really tested -- I've only checked it works on one machine :) But I don't see any machine dependencies here, other than depending on win32 so you have to be on Windows, so it should be fine.

I don’t really have any further plans with Chat Channels for the time being beyond posting it here and using it as is for myself, so if you find it useful, feel free to use the original or modified code in your own projects, and drop me a line.