You are here

Notes from eScience 2008

Last week, I had the opportunity to give two talks in Indianapolis: one at the IEEE eScience 2008 conference, and another at the co-located Microsoft eScience Workshop.

All the presentations were recorded and will soon be available online.

The event brought together a very diverse community, but managed to remain remarkably focused on the core research: new platforms for data-intensive science.

Key themes

Cloud architectures
What do we need to make commercial cloud offering suitable for science? For example, interconnect speeds are not usually included in SLAs, which poses a problem for tightly-coupled parallel science apps such as our own ocean circulation models. Are clouds fractal? That is, will there always be smaller, local clouds at individual institutions, or will Watson turn out to be right? Is it cloud or cloud + client? (I think a local presence is necessary.)
Cloud programming models
MPI, Workflow, Relational Algebra, MapReduce, and Microsoft's Dryad are all points along a spectrum of parallel programming abstractions for manipulating massive datasets. What are the limitations in performance and expressiveness with respect to specific domain applications?
Visualization
As data volumes explode, visualization is no longer a luxury but a necessity. There is simply no other way to convey details of large datasets except by harnessing the high bandwidth of the human ocular system. However, visualization is not enough. There is no effective method for visualizing high-dimensional data (50+ dimensions), so visualization techniques must be combined with dimension-reduction techniques such as multi-dimensional scaling or PCA. This area is referred to a visual analytics.

Applications
As computational technology becomes increasingly sophisticated, the skills required to operate it stretch furhter otu of reach of non-specialists. Computer Scientists can no longer simply throw generic tools over the wall for domain scientists to use in applications. We must build and deploy end-to-end applications as experiments, then extract the general techniques as they become apparent. The number and quality of application talks at this conference demonstrates that the eScience community has fully internalized this idea.

Notes

Excellent talk from George Djorgovski, Astronomer at CalTech.

Some highlights:

"All science will be eScience within a few years"

"Most data will never be seen by humans"

"Most data (and data constructs) are too complex to be comprehended by a human"

"Visualization is the bridge from quantitative information to our understanding."

"Data-driven science is not about data, it is about knowledge extraction"

"Computer science is the new mathematics"

"Teaching scientists and their students to think computationally"

--

Excel Tools for eScience on MS codeplex.
Joins and UnNest operators for spreadsheets.
Not just for eScience. Very useful.

http://www.codeplex.com/eScienceExcel

--
"On-the-fly environmental data visualization using wavelets"
Cyrus Shahabi, Kai Song and Farnoush Banaei-Kashani

Coining the cute term WOLAP (in reference to ROLAP and MOLAP) for wavelet online analytical processing.

Relevant to CMOP, as it provides a technique for efficiently answering range queries over timeseries data at varying degrees of resolution, which is exactly what we need to make the product factory more efficient.

Previous work in this area:
Wavelets for data compression:
Agrawal CIKM 00
Garofalakis VLDB 00
Vitter SIGMOD 99

New work: compress data AND query
Schmidt PODS 02
Schmidt EDBT 02
Jahangiri SIGMOD 05

We should try this with both timeseries observations and maybe with model results. Wavelet representations of unstructured grids seems to have a fair amount of support in the literature.

Seems to require uniform time intervals, but the author claims otherwise. Update: I spoke with Farnoush after the talk and managed to convince him that his technique requires O(N) time unless you assume uniform time intervals or a special (but straightforward) index structure that can tell you the number of records between the start and end of any range query.

So we can still make use of it, but we need to build and maintain the ordinal position of each record in the table in addition to the time. That way, when you ask for the records between t1 and t2, we know that there will be id(t2) - id(t1) records between them, so we can build the bitmap. Easy to implement in postgres; just use a sequence to tag each record.

Seems promising!

--

more to come...