Practicing Python Data Science

Practicing Python Data Science

Data analysis is a skill that I'm proud of; much of my work in this arena over the last decade has been done using MATLAB in a professional setting. Seeing that a lot of compelling and accessible tools are developing in the Python community I looked for an opportunity to practice methods I was familiar with in the MATLAB but instead using Python.

Data available in reference 1 and associated with the paper in reference 2 on power utilization in Tetouan, Morocco seemed like a good opportunity to try out various timeseries analysis techniques as well as some machine learning concepts and a variety of visualization techniques. Packages I used included:

  • pandas for data import and manipulation
  • NumPy for FFT and SVD algorithms
  • Matplotlib for basic plotting
  • Plotly for plots that with exploration tools like zoom and datatips
  • Seaborn for statistical visualization

The data analyzed consists of timeseries of power utilization in three zones in the city along with possibly causal variables including temperature, humidity, wind speed, and a couple variations on solar radiation. The starting point of the analysis was a heatmap of the correlation matrix among the variables (My wife remarked "it looks like a quilt"):


Article content
Figure 1: Correlation Matrix For Available Variables

As expected the power dissipation of the three zones in the city and are reasonably positively correlated and power dissipation was also somewhat correlated to temperature. The time history of the windspeed variable looked questionable and it is possible that there were limitations for that instrument that I'm unaware of, but windspeed was somewhat correlated to temperature. In my experience with desert climates of the Southwest United States is similar to this with winds tending to pick up as solar heating warms the surface. I did not quickly find a good description of "Generalized Diffuse Flows" and "Diffuse Flows" but the former clearly looked like the solar radiation diurnal cycle typical of desert regions (Reference (3)). This variable was also reasonably correlated to temperature.

FFT Analysis

Next was some FFT analysis. Clearly expected cycles included daily, weekly and annual cycles. The amplitude spectrum of the available variable is shown in Figure 2:


Article content
Figure 2: Amplitude Spectrum of Available Variables

The expected daily peaks are clearly present in all variables except for wind speed as shown in Figure 3. Harmonics of the daily cycle also appear reflecting the non-sinusoidal shape of the various waveforms.


Article content
Figure 3: Amplitude Peaks at Daily Cycle Frequency

A sub-harmonic at a frequency of 1/(7days) shows up in two of the three zones as shown in Figure 4, this may reflect different usages of zone 3 compared to zones 1 and 2, for instance industrial versus residential usage.


Article content
Figure 4: Amplitude Peaks at Weekly Cycle Frequency

The FFT results also allowed creation of an average daily cycle by inverse transforming components with daily cycles and harmonics thereof. Subtracting this daily cycle off of the time history data creates some interesting results (Figure 5). There is a decided change in the time history of power utilization from day 146 to 176. All three zones show this trend, but the most pronounced difference shows up in zone 1 (Figure 6). The source country for this data is 99% Muslim (Reference 4) and Ramadan in 2017 was the evening of Fri, May 26, 2017 through the evening of Sat, Jun 24, 2017, corresponding to days 146 through 175.

Article content
Figure 5: Variable Responses After Subtracting off Daily Cycle


Article content
Figure 6: Focus on Power Utilization Days 146 through 175 (Same Legend as Figure 5)

Statistical Analysis

Moving on from FFT analysis I conducted some statistical visualization of the data beginning with a PairPlot (Figure 7).


Article content
Figure 7: PairPlot of Available Variables (Q1 is Dec-Feb, Q2 is Mar-May, Q3 is Jun-Jul, Q4 is Aug-Oct)


Bivariate Kernel Density Estimates (KDE) are plotted below the diagonal, univariate KDE on the diagonal and actual data points above the diagonal. Separating the data by quarter-year was an easy approximation to separating it by season as the "Quarters" presented here start 1 December instead of 1 January; the color coding is intended to be intuitive with blue being approximately winter, green spring, red summer, and orange fall. The temperature shifts predictably through seasons, and the tail of Zone 3 power (P3) also shifts right during summer but the other two shift somewhat less. This plot provides other insights above what can be learned from Figure 1:

  • The bimodal distribution of wind speed with samples all clustered around 0 or 5 km/h illuminates a likely problem with this instrument.
  • Both of the solar radiation variables have a very large number of samples near zero which makes sense but is likely to distort any fit of this data unless nighttime data is removed.

SVD/PCA Analysis

The next step was some SVD/PCA analysis. The reasonably small number of variables involved in this data set may not really make dimensionality reduction necessary, but it was suitable for test driving the tools available in NumPy. Starting with an SVD decomposition of the three power zones resulted in a Sigma matrix plotted in Figure 8


Article content
Figure 8: Scaled Magnitudes of Sigma Matrix Diagonal

The first principal component accounts for 87% of the variance in the data set so the power data could reasonably be collapsed to a single variable for some applications. Looking at this concept more closely Figure 9 shows the error associated with using a rank-1 approximation to the three power zones:


Article content
Figure 9: Rank-1 Approximation to Power Utilization in three Zones

The performance of this approximation for zone 1 and 2 might be acceptable for some purposes, but the performance for zone 3 would unlikely be acceptable for any purpose unless the focus was the first half of the year.

When expanding the SVD analysis to all variables it was necessary to normalize them as they represent different physical parameters with different numerical ranges. Each signal was normalized to zero-mean and unit standard deviation before performing the SVD transformation. Figure 10 shows the relative importance of the decomposition components and illustrates that not much dimensionality reduction across the full data set is feasible:

Article content
Figure 10: Relative Importance of SVD components from SVD applied to all normalized variables

A rank 3 approximation was tried and for some variables and in some timeframes it matched general trends, but it also provided some significant misses. Since all variable were normalized to have the same amount of variance (unit standard deviation) it may make more sense to eliminate variables with obvious problems from this process. The most obvious choice being the wind sensor. Another possible choice is the variable labeled "DF" as this looks like a solar-radiation variable with all of the mid-day data zeroed out. Table 1 shows how the RSS errors are reduced by this approach for all variables except for GDF


Article content
Table 1: RSS Errors in SVD With All Variables and With Troublesome Variables Removed

Conclusions

The main objective of this work was learning how to implement techniques I'm familiar with in the MATLAB environment using the tools available in Python. With a reasonable understanding of the objectives of the various analyses I found the process reasonably easy although there are terminology differences and a certain amount of IT overhead in using the Python environment.

References:

  1. UCI Machine Learning Repository for "Power consumption of Tetouan city"
  2. Salam, A., & El Hibaoui, A. (2018, December). Comparison of Machine Learning Algorithms for the Power Consumption Prediction:-Case Study of Tetouan city“. In 2018 6th International Renewable and Sustainable Energy Conference (IRSEC) (pp. 1-5). IEEE.
  3. Global Climatic Data for Developing Military Products
  4. Morocco in CIA Factbook

I've recently created a gitHub account and uploaded the analysis code there. While I'm getting accustomed to gitHub I've decided to keep the project private, but would be happy to share it with my connections if anyone is interested.

Dave, this is a nice write up. I've played around with both languages at different times, but I find this a great synopsis of how much overlap can be found in the base capabilities.

Like
Reply

Nice Dave. I have been converting over too. Nice to have something free with so many libraries. Raspberry pi interface, opencv for vision processing and much more. It really is becoming an essential language at this point just to be part of the action

Like
Reply

Curious, now that you can do this in Python and Matlab, did you learn pros/cons of both ways? Are you going to try it in R next? ; )

Like
Reply

Hey Dave, a fun yet challenging read. Thanks!

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories