MIDDLEWARE
Digging deep to unlock the Grid
Unlocking the true power of the Grid for data mining is a long-cherished aim of computer scientists and researchers are making important strides to achieve that goal by developing the necessary tools and techniques. The real power of Grid computing lies in sharing resources across a network. These can be CPU cycles, storage, peripherals, network bandwidth, data and software. Ultimately, this will lead to the grand goal envisioned by Grid researchers in which Grid users will be able to seamlessly access and harness geographically-widely distributed computing resources as if they were using a local system. However, “trust, security, data privacy and reliability [or quality of service] in Grid computing is still a largely unresolved problem,” says Dr Werner Dubitzky, Professor of Bioinformatics at the School of Biomedical Sciences at the University of Ulster and co-coordinator of the EU's IST-funded DataMiningGrid project. “These issues are particularly important when commercial computing jobs are distributed across sites not belonging to the company that issued the jobs.” DataMiningGrid is investigating some of these problems for the specific field of, predictably, data mining. This is important for two reasons. Data mining is a technology that has been developed to analyse and interpret large quantities of data. It is one of the most powerful technologies used in astronomy, finance, and biological sciences. One of the key objectives of the project is to build technologies that facilitate the Grid-enabling of data mining technology, ranging from data pre-processing, analysis and post-processing techniques, even if these intrinsically reside in widely dispersed locations. It is hoped that this technology will eventually help to improve the effectiveness and performance of data mining applications and provide a much wider access to data mining technology. By using a series of mature or near mature tools to manage issues like scheduling, workflow management, and data access and integration, DataMiningGrid does not reinvent the wheel and can focus on the core problem: extracting relevant information from vast data sets across a Grid. "We're a research project, so we're not going to be producing a commercial product. We're putting together some demonstrators to show how the tools we develop can effectively mine data across a Grid," says Dr Dubitzky. “Having said that, we are of course highly interested to bring this technology to the market using suitable exploitation channels.” The project faces several critical challenges. First, the requirements for data mining applications vary widely across different domains and sectors. To bring them all under a unified systems architecture is difficult. Second, in many data mining problems the data must remain at its source, because of the volume of data, for privacy or other reasons. In this case analysis must be executed close to where the data resides. In addition to this, one logical data set may be physically distributed across different locations. These requirements and constraints pose a significant challenge. Dr Dubitzky says the project is on target to produce a selected set of demonstrator applications by the summer 2006, including a demonstrator for text mining. The DataMiningGrid is an important effort in realising the true potential of Grid computing.