BioCloud: Evaluating Google App Engine for Bioinformatics

Introduction
As more and more information becomes available through research and other scientific inquiries, it becomes more difficult to store and disseminate the data to others. Cloud computing could offer an effective solution to this information sharing problem. Cloud computing involves all relevant information being stored and made available on the Internet. By storing data on the Internet, it is accessible to anyone that needs it at anytime. Cloud computing not only addresses the problem of sharing information; it alleviates the cost of storing mass amounts of data on one’s own computer. Since everything is stored online, users won’t have to fill up there own hard drives with data. Cloud computing is a fairly new idea, but the demand for it will increase as the data flow continues to increase. Andre Mendes, chief information officer of the Special Olympics, argues that cloud computing helps abstract another layer of complexity from the organization and concentrate on providing the higher layers value (Synder). We aim to focus and experiment with cloud computing for bioinformatics.

Background and Motivation
There are a limited number of cloud computing services that are available to the public. Google offers the Google App Engine, a free service that allows you to build and maintain applications on Google’s infrastructure. The App Engine allows your application to run reliably under heavy traffic and with large amounts of data. It also provides an application environment that purports to make it easy to build an application. During development, an application is ran and tested locally on the programmer’s computer, which may cause the program to run slowly. Once the program has been finished, it can be uploaded to the Google App Engine where it should run much quicker. One feature of the App Engine, the Datastore Application Program Interface (API), allows users to store data on the app engine. It also provides a query engine that allows a user to perform searches on the data that has been stored and transactional storing that allows data to be stored through multiple transactions instead of all at once. Every object in the datastore must be stored as an entity, where each entity has zero or more properties. We chose to use Google because it’s free and is implemented with a new programming language that is steadily growing popular called Python. We were also interested in learning the capabilities and limitations of the app engine. The app engine would allow us to have hands on experience with cloud computing and see if it is a sufficient solution to the information sharing problem.
The biology research at the Center for Coastal Margin of Observation and Prediction (CMOP) on the Columbia River water system consists of many processes and techniques. First, water is collected from some aspect of the water system and ran through a filter to capture any organisms or particles that may be in the water. The filters are then brought back to the lab where DNA and RNA are extracted. From the DNA, they can tell what organisms were present in the water sample. From the RNA, they can tell what the organism can do from the genes that are expressed. After being extracted, the DNA and RNA are sent to a separate lab at Washington University where it is sequenced. The sequencing process consists of using biological methods to determine the order of nucleotide bases in a strand of DNA or RNA. Washington University sends the digital sequences back to CMOP as FASTA files via email. Each file contains the plate number of the sample that was sent, the date when the sample was sequenced, and the sequence itself. Once CMOP has had a chance to look over the sequences, they are sent by mail to Pacific Northwest National Laboratory (PNNL) where they are compared to thousands of other sequences in a database to find any possible matches using a scalable implementation of the BLAST function (Oehen). The possible matches are posted on a website in the form of a hit table to be reviewed by the CMOP staff. In the hit table, each row represents a BLAST result that matches the sequence sent to PNNL. Each column represents the attributes of each match, such as the subject, gap openings, and mismatches between the sequence sent and the BLAST result. This process has been partially unmanaged leaving copies of various files in multiple places. By implementing a cloud computing environment like the Google App Engine, we can provide a central workspace for all steps in the process.
As CMOP continues to collect data about the Columbia River, it is becoming more necessary for a reliable and efficient means to manage the data. Using the Google App Engine, we have created an application to collect, store, and organize CMOP’s continuous flow of biological information online. Once online, the data is easier for users to access. Users do not have to store data locally on their own hard drives and run the risk of losing all of their information if their computer crashes. Storing the data online will allow everyone to be on one accord and will eliminate person-to-person transfer of information. Users will no longer have to participate in the archaic process of waiting on colleagues to email them a particular batch of data; they can simply gain access from their own desktop. When emailing data, users run the risk of the recipient not being able to access the information because the file may not be compatible with their system. A cloud computing environment can potentially improve the quality of information sharing and handling at CMOP.

BioCloud Initial Design
Our goal to build an application that is interactive, user-friendly and helps improve the dataflow at CMOP using the Google App Engine. BioCloud includes features that help collect, store and organize information. One feature allows users to upload files containing hit tables and store each match in the table as an entity on the Google datastore. The program prompts the user to upload a file containing a hit table. It then parses the file on each newline dividing the table into rows. Next it loops through the rows and each one becomes a Match object. A Match object’s properties are the column headers of the table. Once the Match object has been created and values have been assigned to each property, it is stored in the datastore. Then the number of entities that were stored would be returned to the screen. Once the hit tables have been uploaded, another feature allows users to search through the Match entities that have been stored. The user is presented with an HTML form that contains a text field for every property of a Match object. Within each blank the user enters an inequality expression for the search they want to perform. The application returns all the Match objects meet the criteria of the user’s search. For example, if the user enters ‘>50’ in the gap openings field, BioCloud returns all the Match entities that meet that criteria.
Perhaps the most important feature of the application would allow the user to blast their own sequences instead of sending them to PNNL. To help with this feature, we imported the Bio module from the Biopython library that contains functions made for handling FASTA files and blasting sequences. The user is prompted to upload a zip file containing only FASTA files. For each file that is pulled from the zip folder, a File entity is created. Each File entity will contain a name and content property and will be stored in the datastore. After uploading the files from the zip folder, the application displays each file to the screen with a checkbox next to each one. The user then selects which files contain the sequences they want to perform BLAST search on. When the user clicks submit, the sequences they select are sent to the National Center for Biotechnology Information (NCBI) website and the results of the BLAST algorithm are returned to the screen for the user to peruse.

Reality Check
As we began to implement the programming necessary to achieve the goals of the initial design, we began to discover some of the limitations of the Google App Engine. First, when trying to upload hit tables, we found that the application was taking too long to upload Match entities. After testing this several times, we concluded that it was taking so long because the hit tables contained thousands of records and the program was taking awhile to store each one in the datastore as a Match entity. We tried using a hit table with only 100 records and the application only took a few seconds to put them in the datastore. We were hoping that it was taking so long because we were running the program locally on our own hard drive. When we uploaded the application to Google’s infrastructure and ran it from there, it ran much faster and confirmed our suspicions.
When planning out the query page, we planned for users to be able to enter inequality expressions in one or more of the input fields, according to the results they wanted to get back from the search. Unfortunately, the app engine only allowed one inequality expression per search. We were not able to come up with a way to get around this problem and as a result, the functionality of the program was compromised. The app engine does allow more than one equality expression to be used, but that option can be ineffective when working with a variety of data.
Originally, after the user checked what files they wanted to blast, the application would blast them on the website and return the results. We figured the blasting portion would take awhile because the app engine had to access another website to get the results. But a few seconds after the program had started the blasts, we received a ‘time out’ error. Apparently the app engine has a specific amount of time allotted for each request. If the request takes longer than that time, then an error is raised and the program is exited. To get around this problem, we had to go from blasting multiple sequences at once to having the user manually blast one sequence at a time. For every file that the user selected, on another screen the file name would appear on a table along with an option to view the sequence in that file and another option to blast that sequence. By blasting one sequence at a time, we hoped we would stay under the time allotted for each requests. Although it causes more work for the user, it’s a small price to pay to be able to blast sequences from you own computer.
Once we fixed the time out error, we tried to blast some sequences individually. But when trying to blast individual sequences, we were thrown another error. Seemingly, some of the libraries that were apart of the special biology module were not compatible with the Google app engine. We tried to modify some of the libraries, but the problems were too internal and each aspect we changed caused another one to fail. We concluded that trying to fix this problem was time consuming and might become another project in itself.
Google also did not offer a method to check if an entity had been stored in the datastore. So we had to create a function that checks by an entity’s key if it has been stored. If the entity had been stored, we deleted the old copy and inserted the new one. If it had not been stored, it was put into the Google datastore normally.
Learning the inhibitions of the Google App Engine set in the reality of what the application would actually be able to do. We didn’t achieve all the goals of the initial design, but we were able to get some of the features working after making a few changes.

Recommendations
The Google App Engine includes many features that make cloud computing a convenient methodology, but a few measures should be taken in order to improve the it. First, the time out error that is received when a request takes more than a few seconds should be eliminated or the time for each request should be lengthened. It is expected that some requests, such as accessing another website as with the BLAST function, would take several seconds, maybe even minutes. Some users would be willing to wait the extra time if it meant being able to BLAST sequences on their own hard drives.
Google should also take the query engine a step further and allow multiple inequalities for each search. The BioCloud would be more worthwhile if users were able to query based on any criteria. With the current app engine, if users wanted to query the Match entities with two or more stipulations, they would have to perform two separate queries and then manually find which Matches meet their criteria. This is a redundant process and it defeats the purpose of the query feature in the BioCloud application.
Google also needs to provide a mechanism that would check to see if an entity has been stored already. Since the App Engine we used was a trial version we only had 500 megabytes of memory. In order to maximize the amount of data we could store, an entity-checking mechanism would have been a great help. There would be less wasted space because we could be sure there was only one copy of each entity in the Datastore.
Finally, Google should be more receptive to different modules that users might want to import into their application. It was very disappointing to find out we couldn’t BLAST sequences because of a difference in modules. Allowing various modules to be importing would allow developers to broaden the capabilities of their applications.

Future Works
The BioCloud includes features that benefit the data flow at CMOP, but more steps can be taken to improve the application as a whole. One important step would be to get the application to BLAST sequences. After a little debugging of the Biopython module, this step should be completely feasible. Unfortunately, we discovered the problem with the modules too late and ran out of time, but the debugging of the module shouldn’t take long. This essential feature would have the most benefit to the data flow for sample processing at CMOP.
Another feature that should be added to BioCloud is a query tracker that keeps up with the queries a user performs and saves them. This would allow users to have ‘favorite’ queries that they can come back to for later references or that they can perform repeatedly.
There are additional features that can be added to BioCloud to make it more proficient and tailor-made for bioinformatics. As CMOP continues to expand its research into new areas of interest, those features should be added.

Conclusion
Despite the limitations and setbacks we experienced with the Google App Engine, it proved to be an efficient cloud computing service. It enabled us to explore some of the advantages of cloud computing, such as hosting a web application without the stress of maintaining a server. With the BioCloud application, we only scratched the surface of the App Engine’s capability. Continued research of this tool is strongly encouraged as cloud computing continues to become more accepted. The continued research will help expand the capabilities of BioCloud and move it towards becoming a permanent aspect in CMOP’s data flow.

References

Oehmen, C. Nieplocha, J.. (2006) ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis. IEEE Transactions on Parallel and Distributed Systems. Vol. 17.8

Synder, Bill. (2008) “Cloud Computing: Tales from the front” Retrieved on August 8 2008 from

“What is Google App Engine?” Retrieved on August 8 2008 from .