The ICM-GB Project

The Interactive Chromatin Model (ICM) is developed by Dr. Tom Bishop at LATech. ICM is an interactive tool that allows users to rapidly assess nucleosome stability and fold sequences of DNA into putative chromatin templates. We recently collaborated on a NIH proposal to transform ICM into a more advanced high performance epigenetic analysis application. We propose to integrate existing bioinformatics tools (Genome browsers) and existing physics based 3D models of chromatin (ICM-tk) with distributed cyber infrastructure through science gateway technology.

A specific aim of this proposal, the web application interface which connects ICM, GB, and HPC, is called ICM-GB.

Strategy

Web-based genome browsers, such as Jbrowse, GenomeMaps, and BioDalliance, are gateways to genome projects such as the genome browser at UCSC or Ensembl. They use the newest web technologies HTML5 and CSS, Javascript, AJAX and SVG . They support integration of data from a wide variety of sources, and can integrate data either from data exchange servers using the Ajax technology or directly from popular genomics file formats using the FileReader API. These genome browsers allow users to browse large amount of data and retrieve data of interest with very short response time. The data servers return the results using standard XML or JSON data format, allowing the sequence browser to integrate the annotations and render them in generalized graphical or tabular form directly in the web browser with HTML5 canvas and SVG techniques. The SVG and canvas rendering performance in most web browsers (Google Chrome 20+, Apple Safari5+, Opera 11+, Internet Explorer 10 and Mozilla Firefox16+) can reach thousands of elements per second, thus results in smooth and dynamic user interaction.

Our ICM-Tk will have no inherent ability to connect to isolated data sources with different data exchange protocols to retrieve data nor the ability to manage large amounts of data. To bridge between the generalized genome browser which has outstanding data accessing capability, and the ICM-Tk which will not, we will implement a web application gateway that connects the embedded genome browser to ICM-Tk for structure generation and use Jmol for 3D molecular display.

Our ICM web application will be developed using the Sencha Ext JS 4 framework. Sencha provides development tools and services for building, managing, and deploying powerful and cross-device web applications. Internally, Sencha uses combined programming languages HTML5, CSS, SVG, and Javascript just like a native web application. Genome browsers such as BioDalliance, JBrowse, GenomeMaps are open source web-based genome browsers implemented with the same programming tools so they can be co-exist with ICM-Tk in our web application.

To implement the sequence data selection interface, a transparent data picking layer is added on top of the genome browser using Javascript. This transparent layer accepts a preset mouse-key combination event for data picking and draws a selection window mask directly on top of the embedded genome browser. The selection window retrieves the data of interest from the sequence data already downloaded by the genome browser through AJAX or hosted locally using the FileReader API. The selected sequence data will be encoded into a standard data format such as JSON, saved internally by the interface as cached data and asynchronously transferred to ICM-Tk for computation upon request.

As the ICM-Tk (Aim 1) and distributed computing infrastructure capabilities (Aim 3) are developed, the user interface will be updated accordingly to ensure correct tool configuration and resource connection. The web application will provide the necessary information and data visualization to facilitate user experience, e.g. energy level representation, chromatin placement, computing statistics, etc. Sencha Ext JS offers flexible layout configuration and an extraordinary range of user interface widgets, such as scalable grids, trees, menus, forms, tabs, and more, make the UI design an effective experience. For 2D visualization, we will use SVG visualization library D3. For possible requirement of 3D visualization, we will use web graphics technology WebGL and its abstraction Javascript libraries.

Planning for interface design

  1. Two layers of genome browser : first one is a normal browser to choose a segment of genome (0 to 10,000), second one is ICM specific, including 6 tracks of translation/orientation data from icm.par, 1 track of energy data from E.dat, and 1 track of folding position data from position.dat
  2. buttons: ‘make default’ button: start with default parameters, create a default position track. A corresponding position widget will be created for users to manipulate, then the ‘make custom’ button will send a new position data to server and fold with this position. For the interface, Tom wants for the user to be able to choose a nuc dat file for each position, which means the length of each nuc can be different ( right now 147 ). But this will not be used to fold the chromatin, so only for further use. Right now only the position (start, end) can be applied.
  3. default nuc lenght: 147. default minimum separation of each nuc (linker?) : 20 – can be changed in a global configuration panel later
  4. nuc widget: a slider, when choose a nuc dat file, can change length. can move around but not cross on top of each other, and maintain the minimal separation. Can also add/delete a nuc widget
  5. Temperature and Occupancy:
    • Temperature: used in free DNA, temperature = 0 means very straight DNA
    • Occupancy: decide how many nucs to put in the sequence, evenly distributed. If Occupancy is set to 0, then the nucs are packed as much as possible, means only leave minimal separatoin (20) between two nucs (147).
  6. computation on the server:
    • run-icm.tcsh : use default parameters to create XYZ
    • run-fold.tcsh : use input positions to create XYZ
    • mkBigWigs.tcsh icm.par : create bigwig data from par data
    • run-minimize.tcsh : minimize....
  7. files: (click ‘get all data’ to show the Files)
    • icm.occ.xyz : folded XYZ
    • E.dat : energies
    • seqin.txt : sequence
    • icm.par : folded helical parameter data
    • position.dat : nuc positions – No this file doesn’t exist yet, think it’s because the nuc positions are just evenly distributed, controled by parameters such as occupancy?
    • icm.dat : look at the third column, if it’s 0, then the dna is free, otherwise folded.
  8. Sessions: