Sort by year:

We study a localized version of kernel ridge regression that can continuously, smoothly interpolate the underlying function values which are highly non-linear with observed data points. This new method can deal with the data of which (a) local density is highly uneven and (b) the function values change dramatically in certain small but unknown regions. By introducing a new rank-based interpolation scheme, the interpolated values provided by our local method continuously vary with query points. Our method is scalable by avoiding the full matrix inverse, compared with traditional kernel ridge regression.

Machine learning (ML) is becoming one of the most anticipated methods in predicting demand. However, it is still uncertain how ML methods perform relative to traditional econometric methods under different scales of data. This study estimates and compares the out-of-sample predictive accuracy of household budget share for organic fresh produce using two parametric models and six ML methods under different sample sizes. Results show that ML method, particularly Logistic Elastic Net, performs better than econometric models under regular sample size. Contrarily, when dealing with big data, econometric models reach to same accuracy level as ML method whereas random forest presents a possible overfitting problem. This study illustrates the competence of ML methods in demand estimation, but choosing the optimal method needs to consider the product specifics, sample size, and observable features.

| Submitted

Over-parameterized models like deep nets and random forests have become very popular in machine learning. However, the natural goals of continuity and differentiability, common in regression models, are now often ignored in modern overparametrized, locally-adaptive models. We propose a general framework to construct a global continuous and differentiable model based on a weighted average of locally learned models in corresponding local regions. This model is competitive in dealing with data with different densities or scales of function values in different local regions. We demonstrate that when we mix kernel ridge and polynomial regression terms in the local models, and stitch them together continuously, we achieve faster statistical convergence in theory and improved performance in various practical settings.

Variable screening has been a useful research area that helps to deal with ultra-high-dimensional data. When there exist both marginally and jointly dependent predictors to the response, existing methods such as conditional screening or iterative screening often suffer from the instability against the selection of the conditional set or the computational burden, respectively. In this paper, we propose a new independence measure, named conditional martingale difference divergence (\(\text{CMD}_{\cal{H}}\)), that can be treated as either a conditional or a marginal independence measure. Under regularity conditions, we show that the sure screening property of \(\text{CMD}_{\cal{H}}\) holds for both marginally and jointly active variables. Based on this measure, we propose a kernel-based model-free variable screening method that is efficient, flexible, and stable against high correlation and heterogeneity. In addition, we provide a data-driven method of conditional set selection, when the conditional set is unknown. In simulations and real data applications, we demonstrate the superior performance of the proposed method.

An emerging number of learning scenarios involve a set of learners/analysts each equipped with a unique dataset and algorithm, who may collaborate with each other to enhance their learning performance. From the perspective of a particular learner, a careless collaboration with task-irrelevant other learners is likely to incur modeling error. A crucial problem is to search for the most appropriate collaborators so that their data and modeling resources can be effectively leveraged. Motivated by this, we propose to study the problem of ‘meta clustering’, where the goal is to identify subsets of relevant learners whose collaboration will improve the performance of each individual learner. In particular, we study the scenario where each learner is performing a supervised regression, and the meta clustering aims to categorize the underlying supervised relations (between responses and predictors) instead of the raw data. We propose a general method named as Select-Exchange-Cluster (SEC) for performing such a clustering. Our method is computationally efficient as it does not require each learner to exchange their raw data. We prove that the SEC method can accurately cluster the learners into appropriate collaboration sets according to their underlying regression functions. Synthetic and real data examples show the desired performance and wide applicability of SEC to a variety of learning tasks.

This paper aims to better predict highly skewed auto insurance claims by combining candidate predictions. We analyze a version of the Kangaroo Auto Insurance company data and study the effects of combining different methods using five measures of prediction accuracy. The results show the following. First, when there is an outstanding (in terms of Gini Index) prediction among the candidates, the "forecast combination puzzle" phenomenon disappears. The simple average method performs much worse than the more sophisticated model combination methods, indicating that combining different methods could help us avoid performance degradation. Second, the choice of the prediction accuracy measure is crucial in defining the best candidate prediction for "low frequency and high severity" (LFHS) data. For example, mean square error (MSE) does not distinguish well between model combination methods, as the values are close. Third, the performances of different model combination methods can differ drastically. We propose using a new model combination method, named ARM-Tweedie, for such LFHS data; it benefits from an optimal rate of convergence and exhibits a desirable performance in several measures for the Kangaroo data. Fourth, overall, model combination methods improve the prediction accuracy for auto insurance claim costs. In particular, Adaptive Regression by Mixing (ARM), ARM-Tweedie, and constrained Linear Regression can improve forecast performance when there are only weak learners or when no dominant learner exists.

High-dimensional linear regression with interaction effects is broadly applied in research fields such as bioinformatics and social science. In this paper, we first investigate the minimax rate of convergence for regression estimation in high-dimensional sparse linear models with two-way interactions. We derive matching upper and lower bounds under three types of heredity conditions: strong heredity, weak heredity and no heredity. From the results: (i) A stronger heredity condition may or may not drastically improve the minimax rate of convergence. In fact, in some situations, the minimax rates of convergence are the same under all three heredity conditions; (ii) The minimax rate of convergence is determined by the maximum of the total price of estimating the main effects and that of estimating the interaction effects, which goes beyond purely comparing the order of the number of non-zero main effects \(r_1\) and non-zero interaction effects \(r_{2}\); (iii) Under any of the three heredity conditions, the estimation of the interaction terms may be the dominant part in determining the rate of convergence for two different reasons: 1) there exist more interaction terms than main effect terms or 2) a large ambient dimension makes it more challenging to estimate even a small number of interaction terms. Second, we construct an adaptive estimator that achieves the minimax rate of convergence regardless of the true heredity condition and the sparsity indices \(r_1, r_2\).

With now well-recognized non-negligible model selection uncertainty, data analysts should no longer be satisfied with the output of a single final model from a model selection process, regardless of its sophistication. To improve reliability and reproducibility in model choice, one constructive approach is to make good use of a sound variable importance measure. Although interesting importance measures are available and increasingly used in data analysis, little theoretical justification has been done. In this paper, we propose a new variable importance measure, sparsity oriented importance learning (SOIL), for high-dimensional regression from a sparse linear modeling perspective by taking into account the variable selection uncertainty via the use of a sensible model weighting. The SOIL method is theoretically shown to have the inclusion/exclusion property: When the model weights are properly around the true model, the SOIL importance can well separate the variables in the true model from the rest. In particular, even if the signal is weak, SOIL rarely gives variables not in the true model significantly higher important values than those in the true model. Extensive simulations in several illustrative settings and real data examples with guided simulations show desirable properties of the SOIL importance in contrast to other importance measures.