This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 766186.

ECOLE blog

Short Introduction about the Class Imbalance Classification Problem

Jiawen (Fay) Kong

What is a class imbalance classification problem?

Strictly speaking, any dataset that contains an unequal class distribution between its classes can be considered as imbalanced [1]. However, in the community, only the datasets that contain significant or extreme imbalance are regarded as imbalances datasets. An illustration of class imbalance problem is shown in Figure 1. Even if the classifier predicts all the samples as majority class, the accuracy is still 95%, which makes the classifier seems extremely efficient but actually it neglects the minority class.

Fig 1. An illustration of class imbalance problems.
Fig 1. An illustration of class imbalance problems.

Several steps to deal with class imbalance classification problem.

From Figure 1, we know that the accuracy does not reflect the actual effectiveness of an algorithm in imbalanced domains. Hence, the first thing needs to be done to deal with imbalanced dataset is to change the performance evaluation metrics. The Area Under the ROC curve (AUC), F-measure, geometric mean are three commonly used performance metrics (detailed information for the metrics can be found in [1]). Over years of development, many techniques have proven to be efficient in handling imbalanced datasets. These methods can be divided into data-level approaches and algorithmic-level approaches [2, 3], where the data-level approaches aim to produce balanced datasets and the algorithmic-level approaches aim to adjust classical classification algorithms in order to make them appropriate for handling imbalanced datasets.

Two popular data-level approaches.

Data-level approaches, as known as resampling techniques, can be divided into undersampling (under sample majority class samples) and oversampling techniques (oversample the minority class samples). Here we only introduce two popular oversampling techniques.

The synthetic minority over-sampling technique (SMOTE), proposed in 2002, is the most popular resampling technique [4]. SMOTE produces balanced data through creating artificial data based on the randomly chosen minority samples and their K-nearest neighbors. The procedure to generate a new synthetic sample is shown in Figure 2.

Fig 2. SMOTE working procedure.
Fig 2. SMOTE working procedure.

The adaptive synthetic (ADASYN) sampling technique is a method that aims to adaptively generate minority samples according to their distributions [5]. The main improvement compared to SMOTE is the samples which are harder to learn are given higher importance and will be oversampled more often in ADASYN. The general idea of ADASYN is shown in Figure 3.

Fig 3. ADASYN working procedure.

We only include the most basic knowledge of class imbalance in this blog and more tutorial about class imbalance will be posted later.

References

[1]. He, H. and Garcia, E.A., 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering21(9), pp.1263-1284
[2]. Ganganwar, V., 2012. An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering2(4), pp.42-47.
[3]. Santos, M.S., Soares, J.P., Abreu, P.H., Araujo, H. and Santos, J., 2018. Cross-validation for imbalanced datasets: Avoiding overoptimistic and overfitting approaches [research frontier]. ieee ComputatioNal iNtelligeNCe magaziNe13(4), pp.59-76.
[4]. Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P., 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research16, pp.321-357.
[5]. He, H., Bai, Y., Garcia, E.A. and Li, S., 2008, June. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 1322-1328). IEEE.

Crucial Facets of Modern Time Series Analysis

Sibghat Ullah

Time series analysis is an important motif in the domains of “Macroeconomics”, “Financial Portfolio Management”, “Weather Forecasting”, “Disaster Management” and “Health Care” to name a few. With the age of digitalization, the industrial and ubiquitous systems such as smart phones, health care units and economic transaction processing systems produce huge volume of time ordered data. Understanding the complex and dynamic processes that produce these time ordered data can lead to significant benefits for the society and businesses. This require a holistic analysis of the time series generated by these processes including feature selection, model selection, validation and prediction of future time related events. In this blog post, I try to summarize some of the major challenges in time series analysis that require the attention of scientific community. Two example signals, respectively in Figure 1 and 2 are plotted for this blog. Figure 1 shows the plot of daily minimum temperature in Celsius as measured by the Australian bureau of Meteorology for the city of Melbourne from 1981 to 1991. Figure 2 shows the plots concerning the power consumption of all households in Germany, and the solar and wind power generation in the country from 2006-2018. The unit of measurement for Fig 2 is Giga watthours (GWh).

Fig 1. Minimum Daily Temperature of Melbourne, Australia as recorded by Australian bureau of Meteorology from 1981-1991.
Fig 1. Minimum Daily Temperature of Melbourne, Australia as recorded by Australian bureau of Meteorology from 1981-1991.

The classical analysis of time series relies heavily on “Auto Regressive (AR)” and “Exponential Smoothing” models [1] which are typically linear and are suited for univariate time series. These state space methods involve human expertise in the loop and require explicit specifications of trend, seasonality, cyclical effects and shocks when modeling time series. As a result, these methods are interpretable and the prediction is ready to use. However, this interpretability comes at the sacrifice of prediction accuracy since such methods lack the dynamic and complex nature of modern industrial and ubiquitous processes with multivariate time series e.g., stock changes, energy consumption in power plants etc. Recently, with the advent of empirical models such as deep neural networks [2], there has been a growing interest in time series modeling using deep learning. This usually involves modeling time series by a sequential neural network architecture such as Recurrent Neural Network and its variants. The predictions for future observations can be made once the parameters of the neural network are optimized using any methodology to compute gradients in recurrent structures e.g., Back Propagation Through Time (BPTT), Real Time Recurrent Learning (RTRL) [3] etc. Although the neural networks enjoy higher accuracy than linear state space models in the context of time series, their prediction is often not interpretable since they behave as black box models. Additionally, the computational complexity and choice of architecture make things much more complicated. Hence, we’re left with the dilemma of simplicity vs complexity, both of which have their own pros and cons.

Fig 2. Daily Open Power System Data for electricity consumption and generation in Germany from 2006-2018.
Fig 2. Daily Open Power System Data for electricity consumption and generation in Germany from 2006-2018.

Recently, hybrid models such as deep state space models [4] have been proposed in the context of time series analysis which combine the best of both worlds to produce robust predictions. Hybrid models have the advantage of working with both simple and complex time series. This further means that size of time series does not affect the learning itself since for small data, state space models make full use of the structural assumptions in the data while for large data sets, deep neural networks can capture the complex and nonlinear relationship and hence, the resulting model usually perform better than a stand-alone model. It is important to understand that despite the advent in hybrid models, there’re still major issues in time series analysis that need the attention of scientific community. This involve feature selection, model selection, cross validation and interpretation in time series analysis. To this end, Bayesian time series frameworks [5] have been proposed that take care of feature and model selection however such frameworks have the drawback of computational complexity since estimating the posterior using MCMC models is computationally expensive for higher dimensions.

Added to these are the issues of missing data imputation, irregular & asynchronous time series [6], non-causal and non-invertible processes, categorical/qualitative processes, class imbalance learning, and mechanisms for early outliers and changepoints detection [7] i.e., structural breaks. Not addressing these issues may have grave consequences in the context of learning and the predictions might be misleading. Finally, there’s lack of cohesive literature in Multivariate Time Series Analysis and consequently there is no consensus on the selection of multiple related time series based on the information from the application domain e.g., disease diagnostics in the case of electronic health records in ICU etc. which could be useful for inference.

References

[1] Lütkepohl, Helmut. New introduction to multiple time series analysis. Springer Science & Business Media, 2005.
[2] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. “Deep learning.” nature 521.7553 (2015): 436-444.
[3] Williams, Ronald J., and David Zipser. “A learning algorithm for continually running fully recurrent neural networks.” Neural computation 1.2 (1989): 270-280.
[4] Rangapuram, Syama Sundar, et al. “Deep state space models for time series forecasting.” Advances in Neural Information Processing Systems. 2018.
[5] Qiu, Jinwen, S. Rao Jammalamadaka, and Ning Ning. “Multivariate Bayesian Structural Time Series Model.” Journal of Machine Learning Research 19.68 (2018): 1-33.
[6] Harutyunyan, Hrayr, et al. “Multitask learning and benchmarking with clinical time series data.” arXiv preprint arXiv:1703.07771 (2017).
[7] Guo, Tian, et al. “Robust online time series prediction with recurrent neural networks.” 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). Ieee, 2016.

Sharing Experience of First Year in ECOLE

Thiago Rios

Personal Background:

Not much longer than one year ago, at the end of my Master’s degree and after 8 years in the Mechanical Engineering department at UFSC (Brazil), starting a PhD was not exactly what I had in mind. I have seen many friends in the post-graduate program doing pure academic research, sometimes without perspective of any real-world industrial application for their developments. Since I experienced several extra-curricular activities in parallel to my studies – 3 one-year internships and 6 years dedicated to projects in student racing teams – I did not see myself working on an exclusively academic environment and enduring for additional three or four years of lectures and publications, even though the doctorate program there was really good. On top of that, I was afraid that what I learned during practical assignments would slowly vanish away in order to leave space for new, but highly specialized, knowledge.

Marie Skłodowska-Curie Actions (ITN)

Research Networks

Innovative Training Networks (ITN) drive scientific excellence and innovation. They bring together universities, research institutes and other sectors from across the world to train researchers to doctorate level.

Motivation to apply for ECOLE:

The proposal of the ECOLE project and the concept of the Innovative Training Networks (ITN) challenged me with the idea that pursuing the PhD degree could be very different to what I expected. Spending three years among industry and academia; living in two different countries; and being able to finish a dissertation, papers and training modules sounded very intense even before the job interview. However, the aspect of bringing the fundamental research together to real-world industrial applications was exactly what I was looking for; and the particularity of the ECOLE being applied to the automotive engineering, which is deeply related to my educational background, really pushed me to join the team.

aca_ins

Experience in ECOLE:

Now, almost one year after my start in the project, it is satisfying to realize how much this interesting mix between academia and industry, computer science and engineering, improved my technical and professional skills. The difficulties to fit the research into engineering applications still exist – several years of practice with mechanical engineering tasks help me tough – but it has become easier to face them as part of the research and take it as a motivation to move forward with the PhD. The support of the supervisors has been essential: through constructive feedbacks and fruitful scientific discussions, we already achieved great results (nine accepted publications distributed among the ESRs in the project!) and put great ideas together for future research.

ECOLE blog

Sharing technical concepts, experience, applications in industry.

Subscribe to our
mailing list

How can we
help you?