Monday 28 May 2012

Protein Interactions

Hello there...

This one took a little longer than expected, but here I'm with another post.

This time is about a collaboration with some of the guys in my lab, that between ideas, data and some work we have created a neat Web Interface to visualize protein-protein Interaction. For now this is Tuberculosis (TB) data, but the idea can be apply to any other protein interaction dataset. This was one in about a week, so don't expect anything more than a prototype.

So, two weeks ago in our weekly seminar, one of the other students of CBIO was presenting something about the interaction of proteins between TB and human. This is an approach used to try to understand how the mycobacteria affects the normal behaviour of a human, or something like that :-P

Anyways, she was presenting this graphs representing the interactions and some questions appear about the surrounding proteins, and how those interact and if there is any correlation between the unconnected parts of the graph. Obviously I was not understanding much of it, but I though that an interactive visualisation of that data may help this clever people to answer their questions.

The graphics that my friend was showing were generated by cytoscape, a very well known tool for this kind of data. I don't know much details of this tool, I kind of remember playing with it a while ago, so to be honest I cannot talk much about its strengths or weaknesses. The only think was the in that moment it was not available, and therefore the valid biological questions were not answered in the seminar room.

Part of the discussion went about the available visualisation tools for this data and how useful (or not) they are. I thought that a web visualisation of this should be, not just possible but simple enough for me to implement, obviously hoping to find a generic graph library. And once displaying the data, it would be possible to play with the representation using web techniques.

After the talk a very enthusiastic friend came to me and started talking about the same idea, plus he actually knew a couple of libraries to displays networks: protovis and Jit. I also did my homework and found out that protovis have become d3 and found another option as well called arbor.

Before seen those examples my worry in this was an algorithm to organize the layout of the nodes in the network, to avoid overlapping and to see the network in a nice way. But if you check those examples the are using a very cool algorithm called force, which pretty much simulates repulsion forces among nodes and tension forces using the links and it does it on the flight, creating a very cool effect.

While I was busy playing with different libraries my friend got some of the data and create a dummy example using protovis. One of the ways to input data in protovis is to have a json file with a particular structure. So, what my friend did was to create a python script to convert a subset of the data into json, and use that to create the graph. Simple and nice as I like it! Below is an image of a part of the generated graph, and here is the link for you to see the example.

15 - Network of 2000 human protein interacting.
But here is our first real challenge, that was a subset of 2000 proteins, and the performance of that is very poor as you can see. Clearly we were asking too much to the SVG engine of the browser. The big problem is that the original file have over 300000 interactions, we were not using 1% of the data and we were reaching the limits of the browser.

In Web development there is a pattern that I have followed for a while, try to make it happen in the client, but if not possible, simply do it in the server. OK, thats not exactly a massive piece of wisdom, specially when in web development there are virtually those two options: server or client. However I know of many developers who insist in doing stuff in the server that now is totally possible in the client, were the latter gaves lots of potential advantages, specially in the usability point of view.

So what I did was to create a Solr server with the protein data, this will open the possibility of doing interesting queries with very short response time, and really easy to implement. Here I make a parenthesis, If you are a developer and you dont know what is Solr, make yourself a favor and go to learn about it, is totally worth the effort.

Back in track, for this proof of concept I am using a very simple dataset of just over 66000 interactions, each of those been the 2 protein interacting and a score of how reliable is this information. in conclusion, for now is just possible to query for protein or score, but the potential of Solr is there and is just a matter of putting more info, such as annotation of each protein.

There is a nice javascript library to deal with Solr queries called AJAX Solr, I have used before in other projects, so I was familiar with it, besides is easy to use and it has a well organize code structure. Its concept is basically to have widgets that react to the moment a Solr response is catched. Moreover, those widgets are also able to execute new queries.

So what I did was to create a widget for AjaxSolr, that uses d3 to generate the graph based in a Solr response, for example, if I query for all the interactions were the protein with UniProt Id Q10387 is presented, I got 45 interactions, so the widget creates nodes for each protein and links for each interaction and just let d3 to do its job!

The rest was just to adapt some of the widgets that the AJAX Solr tutorial has, like the autocomplete or the list of current queries. Here the main change is that the tutorial work joining the queries as a disjunction, I mean using AND operators: query1 AND query2 AND query3. but in our case we require to use conjunctions (OR). and that changes a little of the logic there, but is not big deal.

So here is How it looks, and if you want to play with it you can go to this URL: http://biosual.cbio.uct.ac.za/interactions/. We have many ideas to implement here, and for that we need to put more data in the server, and also the way of drawing the graph can get improved, however we think this is a very nice start point.

16- TB protein protein interactions viewer

And you might noticed I didn't put any code in this blog, oh well if you want some code the good news is that because is a collaborative project with my friends at the lab we set up a google code repository, so you can go and get all the code of this, and everytime any of us make some improvement is gonna be there for you: http://code.google.com/p/biological-networks/

Chao gente!!!

PS: After having develop the prototype a friend told me about a HTML5 library of cytoscape, I guess i didn't did my state of the art homework as good as I though, so I am going to have to check this and see if is worth to use it instead of d3.

6 comments:

  1. Hi there,

    Long time reader, first time poster. I don't know if it'd be useful to the biologists, but wouldn't it be cool if you could pick two random proteins and get the shortest path(s) between them... Does it already do that?

    Keep up the good work!
    Dane.

    ReplyDelete
    Replies
    1. Hey Dane,
      Thanks for reading! I've though about it... and what i think it might be useful is to do that in the whole dataset, but that;s not gonna happen in the client, and solr is smart for the queries but not that smart, at least not that i know.
      Should be relatively easy to implement in the client a simple version of Dijstrak algorithm to work with the loaded proteins, i suppose that might be interest, dont you think?
      Another guy on the lab promised to give user case scenarios for this can of data, so I will mentioned to him and see if this idea makes any biological sense!
      Cheers!

      Delete
  2. Hey Tavo,

    Long time follower and first time poster (like Dane).. I know I can read up and find out myself but I'm being lazy.

    I was just wondering what is used as the basis for the protein-protein interaction and what determines the score and reliability of the scoring?

    I also like Dane's idea, similar to that app where you can pick any two organisms and it will tell you how evolutionary related they are.

    Ciaobye,
    Nat

    ReplyDelete
  3. Hey Nat,

    Thanks for reading... the scores I'm using are a compendium file that Gaston created using Kenneth's MSc project. Where he compiles interaction information from different sources, each one has its own score and there is a aggregate score. Dont have much information about it but i know the different scores are: neighborhood, fusion, cooccurrence, txt-mining, microarray, similarity, domain, experimental, knowledge, pdb, interlogs.
    Where he got it from and its way of calculation is out of my league!! :-)

    About Dane's idea, do you think that is useful even if is just using the proteins that are loaded on the client??
    Cheers,
    Tavo.

    ReplyDelete
  4. Hey Tavo,

    The reason I was asking about the scores ties in with Dane's question in a way. It depends on the question you're asking.

    To know the shortest path between two proteins is just pretty cool. :) But if you have any kind of hypothesis prior to finding the shortest path between two seemingly random proteins, then the information can serve as a guideline. This brings me back to my initial question on what the scores are. If neighbourhood scores highly then it may suggest that experimental may (I have no idea if this is actually the case). So it depends on the interpretation of the figures, for instance there may be something to be taken from the ratio of that which makes up the score for each interaction.

    I don't know if that makes sense. This is always easier over a jug.

    Ciaobye,
    Nat

    ReplyDelete
  5. Cool Nat, honestly thats the kind of scenarios I;m looking for here. For now I'm just playing with the library because it looks cool, but the whole point is to find cases where is useful.
    Today i updated http://biosual.cbio.uct.ac.za/interactions/ and is now colouring different the requested proteins and is painting in red the paths between two requested proteins.
    I'm thinking in develop a panel to create rules like:
    Highlight the paths of the interactions with score > 0.5
    Hide the proteins with a single interaction
    etc.
    Maybe that's the next blog post!! :-)

    ReplyDelete