Data science is an emerging field and plays an intricate part in the so-called ‘big data’ drive, where the challenge is to extract value from vast amounts of data. This article aims to provide a backdrop and case study for the application of data science thinking in the energy distribution sector.
Technology industry giants such as Google, Apple, Facebook, Amazon, and social media companies, generate in the order of petabytes of user data daily and these volumes are growing rapidly. Moreover, the Internet of Things (IoT) is also contributing to high volumes of data as a wide variety of devices, sensors, systems, and services are connecting to the internet in an effort to achieve greater value by exchanging information more efficiently.
The hidden value of your data
Possibly the primary reason why data is growing is due to advances made in physics and engineering, allowing progressively faster information processing and information storage capability. Subsequently, companies now gather and store more data than they can effectively manage in terms of business potential. This is where data science aims to bridge the gap between business opportunity and all the data. The need to analyse extremely large amounts of information in near real-time (in some cases), to drive value from it, is undoubtedly increasing with this data explosion.
Data scientists specialising in the field of machine learning aim to build algorithms capable of detecting patterns in the data (hidden information), which can be used to better understand the underlying dynamics captured in the form of digital information or to develop data products that can be implemented in real-time systems that mimic or enhance human information processing tasks. It also empowers us to address uncertainty.
What do utilities need to achieve this?
The data needs to be stored on proper data management platforms that can scale well and provide high speed processing (particularly for machine learning applications). Platforms (from the open source community) gaining popularity are: Hadoop with its two stage MapReduce paradigm, Apache Spark with its in-memory iterative computation advantages and Cluster Map Reduce that is a Hadoop-like framework in a distributed environment.
Alternative platforms are emerging, but the choice of which platform to use will depend on factors such as the…