Data-intensive research is changing the way African researchers can work and the impact they can have. It is also opening up new career paths in the field of data science.
By increasing the volume of data that researchers can analyze and work with at any given time, data-intensive technology allows them to make bigger strides in less time in their chosen disciplines. Data scientists assist this process by providing the skills to help researchers and managers first analyze large volumes of data and then use that analysis to make effective decisions. Big data is already making a big difference in fields ranging from banking and social media to healthcare and astronomy.
Data-intensive research, or big data technology, has come to Africa by way of the stars: the establishment of the Square Kilometre Array (SKA) pointed to the need for the continent to be able to analyze the extremely high volumes of data to be generated by the network of telescope dishes that will ultimately be placed across remote regions of southern Africa.
The SKA project is an internationally renowned effort to build the world’s largest radio telescope with more than a square kilometer of collecting area. It is one of the largest scientific endeavors in history and drives one of the world’s most significant big data challenges of the coming decade.
Three South Africa-based universities involved in the SKA project—North-West University, the University of Cape Town (UCT) and the University of the Western Cape (UWC)—established a partnership in 2015 to form the Inter-University Institute for Data Intensive Astronomy (IDiA).
IDiA is mobilizing researchers in fields such as astronomy, computer science, statistics and eResearch technologies to create data science capacity for leadership in SKA precursor projects such as MeerKAT, which is scheduled to achieve full operation in early 2018. MeerKAT marks the beginning of a radio big data revolution in Africa. It will be operated as a South African national facility for about five years before it is incorporated into the SKA dish array.
The IDiA is also establishing a data-intensive research and training program to develop capacity on the continent to use the data that MeerKAT will deliver. On its own, radio astronomy data is raw; it requires analysis to provide the kinds of answers astronomers and astrophysicists are seeking about the origins of the cosmos. The astronomy project will also involve developing data systems and tools for analysis with multi-wavelength astronomy data.
The SKA is a multinational project involving researchers and data scientists around the world. Thus, one of IDiA’s projects is to create a data platform that will allow remote teams to access the data: the African Research Cloud (ARC). IDiA will also develop and apply processing algorithms that allow for analysis of the data so that we can turn high volumes of information into knowledge we can apply and use.
Fuelling collaborations and solutions
The ARC involves collaborators from around the world. Much of this work is also part of a collaboration with SKA partners in the Netherlands to establish an Advanced European Network of E-infrastructure for astronomy.
The ARC is the first stage of a three-phase plan to address specific uses of data-intensive research. One such application is the African Research Cloud Astronomy Demonstration project (ARCADE), which will specifically serve MeerKAT teams. The MeerKAT large surveys will produce a terrifying deluge of data. Observations are expected to produce almost 100-terabytes worth of data each day — orders of magnitude more than the conventional volume from a radio telescope. This data will have to be transported, calibrated, imaged, processed and analysed by dozens of astronomers around the world.
ARCADE, thus, focuses on two important aspects of scientific utility: data processing of radio data and large-scale scientific collaboration. A proof-of-concept approach is used: compact and incisive interventions are developed for well-defined technological problem statements.
One such successful intervention involved a large-scale collaborative project, which used a second-year astronomical techniques class at UCT as a test-subject. The project focused on practical learning outcomes for the class of 50. Students had to perform a simple, yet challenging set of analyses on radio and optical images, which included inspection, statistical analyses, plotting and documentation.
A cloud-based hub was created for the project and a beefy virtual machine was populated with state-of-the-art software tools that are the contemporary standards in open source big data initiatives. Students could log onto the ARC via a web browser in a computer lab during a supervised session, but they could also have completed the exercise anywhere in UCT, on their own laptops and mobile devices.
The power of big data
This successful case study demonstrated the power of big data solutions and the advantages of cloud-based technologies, and resulted in two very important findings.
First, the ARC and IDiA provide an unprecedented opportunity for training and collaboration in scientific analyses. The test-subject students were exposed to critical skills in mathematics, statistics and programming in an immersive and collaborative environment.
They were at liberty to discuss, share and work on their projects in a safe and robust programming environment. This sort of intervention can be deployed at a larger scale, and can provide a training environment for anyone with an internet connection. Additionally, the students experienced a first glimpse of tools and techniques that will provide them with an advantage in their future careers in academic institutions or industry.