Title: | Practical Significance Ranking of Regressors and Exact t Density |
---|---|
Description: | Consider a possibly nonlinear nonparametric regression with p regressors. We provide evaluations by 13 methods to rank regressors by their practical significance or importance using various methods, including machine learning tools. Comprehensive methods are as follows. m6=Generalized partial correlation coefficient or GPCC by Vinod (2021)<doi:10.1007/s10614-021-10190-x> and Vinod (2022)<https://www.mdpi.com/1911-8074/15/1/32>. m7= a generalization of psychologists' effect size incorporating nonlinearity and many variables. m8= local linear partial (dy/dxi) using the 'np' package for kernel regressions. m9= partial (dy/dxi) using the 'NNS' package. m10= importance measure using the 'NNS' boost function. m11= Shapley Value measure of importance (cooperative game theory). m12 and m13= two versions of the random forest algorithm. Taraldsen's exact density for sampling distribution of correlations added. |
Authors: | Hrishikesh Vinod [aut, cre] |
Maintainer: | Hrishikesh Vinod <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.2 |
Built: | 2024-10-27 04:20:45 UTC |
Source: | https://github.com/cran/practicalSigni |
Psychologists' so-called "effect size" reveals the practical significance of only one regressor. This function generalizes their algorithm to two or more regressors (p>2). Generalization first converts the xi regressor into a categorical treatment variable with only two categories. One imagines that observations larger than the median (xit> median(xi)) are "treated," and those below the median are "untreated." The aim is the measure the size of the (treatment) effect of (xi) on y. Denote other variables with postscript "o" as (xo). Since we have p regressors in our multiple regression, we need to remove the nonlinear kernel regression effect of other variables (xo) on y while focusing on the effect of xi. There are two options in treating (xo) (i) letting xo be as they are in the data (ii) converting xo to binary at the median. One chooses the first option (i) by setting the logical argument ane=TRUE in calling the function. ane=TRUE is the default. Set ane=FALSE for the second option.
effSizCut(y, bigx, ane = TRUE)
effSizCut(y, bigx, ane = TRUE)
y |
A (T x 1) vector of dependent variable data values. |
bigx |
A (T x p) data matrix of xi regressor variables associated with the regression. |
ane |
logical variable controls the treatment of other regressors. If ane=TRUE (default), other regressors are used in kernel regression without forcing them to be binary variables. When ane=FALSE, the kernel regression removes the effect of other regressors when other regressors are also binary type categorical variables |
out vector with p values of t-statistics for p regressors
The aim is to answer the following question. Which regressor has the largest effect on the dependent variable? We assume that the signs of regressors are already adjusted such that a numerically larger effect size suggests that the corresponding regressor is most important, having the largest effect size in explaining y the dependent variable.
Prof. H. D. Vinod, Economics Dept., Fordham University, NY
set.seed(9) y=sample(1:15,replace = TRUE) x1=sample(2:16, replace = TRUE) x2=sample(3:17, replace = TRUE) effSizCut(y,bigx=cbind(x1,x2),ane=TRUE)
set.seed(9) y=sample(1:15,replace = TRUE) x1=sample(2:16, replace = TRUE) x2=sample(3:17, replace = TRUE) effSizCut(y,bigx=cbind(x1,x2),ane=TRUE)
This is an internal function of the R package practicalSigni Psychologists use effect size to evaluate the practical importance of a treatment on a dependent variable using a binary [0,1] variable. Assuming numerical data, we can always compute the median and regard values < or = the median as zero and other values as unity.
fncut(x)
fncut(x)
x |
numerical vector of data values |
x vector of zeros and ones split at the median.
Prof. H. D. Vinod, Fordham University, NY
Thirteen methods are denoted m1 to m13. Each yields p numbers when there are p regressors denoted xi. m1=OLS coefficient slopes. m2= t-stat of each slope. m3= beta coefficients OLS after all variables have mean zero and sd=1. m4= Pearson correlation coefficient between y and xi (only two variables at a time, assuming linearity). Let r*(y|xi) denote the generalized correlation coefficient allowing for nonlinearity from Vinod (2021, 2022). It does not equal analogous r*(xi|y). The larger of the two, max(r*(y|xi), r*(xi|y)), is given by the function depMeas() from the 'generalCorr' package. m5= depMeas, which allows nonlinearity. m5 is not comprehensive because it measures only two variables, y and xi, at a time. m6= generalized partial correlation coefficient or GPCC. This is the first comprehensive measure of practical significance. m7=a generalization of psychologists' "effect size" after incorporating the nonlinear effect of other variables. m8= local linear partial (dy/dxi) using the 'np' package for kernel regressions and local linear derivatives. m9= partial derivative (dy/dxi) using the 'NNS' package. m10=importance measure using NNS.boost() function of 'NNS.' m11=Shapley Value measure of importance (cooperative game theory). m12 and m13= two versions of the random forest algorithm measuring the importance of regressors.
pracSig13(y, bigx, yes13 = rep(1, 13), verbo = FALSE)
pracSig13(y, bigx, yes13 = rep(1, 13), verbo = FALSE)
y |
input dependent variable data as a vector |
bigx |
input matrix of p regressor variables |
yes13 |
vector of ones to compute respective 13 measures m1 to m13. Default is all ones to compute all e.g., yes13[10]=0 means do not compute the m10 method. |
verbo |
logical to print results along the way default=FALSE |
If m6, m10 slow down computations, we recommend setting yes13[6]=0=yes13[10] to turn off slowcomputation of m6 and m10 at least initially to get quick answers for other m's.
output matrix (p x 13) containing m1 to m13 criteria (numerical measures of practical significance) along columns and a row for each regressor (excluding the intercept).
needs the function kern(), which requires package 'np'. also needs 'NNS', 'randomForest', packages.
The machine learning methods are subject to random seeds. For some seed values, m10 values from NNS.boost() become degenerate and are reported as NA or missing. In that case the average ranking output r613 from reportRank() needs manual adjustments.
Prof. H. D. Vinod, Economics Dept., Fordham University, NY
Vinod, H. D."Generalized Correlation and Kernel Causality with Applications in Development Economics" in Communications in Statistics -Simulation and Computation, 2015, doi:10.1080/03610918.2015.1122048
Vinod, H. D.", "Generalized Correlations and Instantaneous Causality for Data Pairs Benchmark," (March 8, 2015). https://www.ssrn.com/abstract=2574891
Vinod, H. D. “Generalized, Partial and Canonical Correlation Coefficients,” Computational Economics (2021) SpringerLink vol. 59, pp.1-28. URL https://link.springer.com/article/10.1007/s10614-021-10190-x
Vinod, H. D. “Kernel regression coefficients for practical significance," Journal of Risk and Financial Management 15(1), 2022 pp.1-13. https://doi.org/10.3390/jrfm15010032
Vinod, H. D.", "Hands-On Intermediate Econometrics Using R" (2022) World Scientific Publishers: Hackensack, NJ. https://www.worldscientific.com/worldscibooks/10.1142/12831
See Also as effSizCut
,
See Also as reportRank
Compute the p-value for exact correlation significance test using Taraldsen's exact methods.
pvTarald(n, rho = 0, obsr)
pvTarald(n, rho = 0, obsr)
n |
number of observations, n-1 is degrees of freedom |
rho |
True unknown population correlation coefficient in the r-interval [-1, 1], default=0 |
obsr |
observed r or correlation coefficient |
ans is the p-value or probability from sampling distribution of observing a correlation as extreme or more extreme than the input obsr or observed r.
needs function hypergeo from the package of that name.
Prof. H. D. Vinod, Economics Dept., Fordham University, NY
Taraldsen, G. "The Confidence Density for Correlation" Sankhya: The Indian Journal of Statistics 2023, Volume 85-A, Part 1, pp. 600-616.
See Also as qTarald
,
Compute the quantile for exact t test density using Taraldsen's methods
qTarald(n, rho = 0, cum)
qTarald(n, rho = 0, cum)
n |
number of observastions, n-1 is degrees of freedom |
rho |
True unknown population correlation coefficient, default=0 |
cum |
cumulative probability for which quantile is needed |
r quantile of Taraldsen's density for correlation coefficient.
needs function hypergeo::hypergeo(). The quantiles are rounded to 3 places and computed by numerical methods.
Prof. H. D. Vinod, Economics Dept., Fordham University, NY
Taraldsen, G. "The Confidence Density for Correlation" Sankhya: The Indian Journal of Statistics 2023, Volume 85-A, Part 1, pp. 600-616.
See Also as pvTarald
,
This function generates a report based on the regression of y on bigx. It acknowledges that some methods for evaluating the importance of regressor in explaining y may give the importance value with a wrong (unrealistic) sign. For example, m2 reports t-values. Imagine that due to collinearity m2 value is negative when the correct sign from prior knowledge of the subject matter is that the coefficient should be positive, and hence the t-stat should be positive. The wrong sign means the importance of regressor in explaining y should be regarded as relatively less important. The larger the absolute size of the t-stat, the less its true importance in measuring y. The ranking of coefficients computed here suitably deprecates the importance of the regressor when its coefficient has the wrong sign (perverse direction).
reportRank( y, bigx, yesLatex = 1, yes13 = rep(1, 13), bsign = 0, dig = 3, verbo = FALSE )
reportRank( y, bigx, yesLatex = 1, yes13 = rep(1, 13), bsign = 0, dig = 3, verbo = FALSE )
y |
A (T x 1) vector of dependent variable data y |
bigx |
a (T x p) data marix of xi regressor variables associated with the regression |
yesLatex |
default 1 means print Latex-ready Tables |
yes13 |
default vector of ones to compute all 13 measures. |
bsign |
A (p x 1) vector of right signs of regression coefficients. Default is bsign=0 means the right sign is the same as the sign of the covariance, cov(y, xi) |
dig |
digits to be printed in latex tables, default, dig=d33 |
verbo |
logical to print results by pracSig13, default=FALSE |
v15 |
practical significance index values (sign adjusted) for m1 to m5 using older linear and /or bivariate methods |
v613 |
practical significance index values for m6 to m13 newer comprehensive and nonlinear methods |
r15 |
ranks and average rank for m1 to m5 using older linear and /or bivariate methods |
r613 |
ranks and average rank for m6 to m13 newer comprehensive and nonlinear methods |
The machine learning methods are subject to random seeds. For some seed values, m10 values from NNS.boost() rarely become degenerate and are reported as NA or missing. In that case the average ranking output r613 here needs adjustment.
Prof. H. D. Vinod, Economics Dept., Fordham University, NY
set.seed(9) y=sample(1:15,replace = TRUE) x0=sample(2:16, replace = TRUE) x2=sample(3:17, replace = TRUE) x3=sample(4:18,replace = TRUE) options(np.messages=FALSE) yes13=rep(1,13) yes13[10]=0 reportRank(y,bigx=cbind(x0,x2,x3),yes13=yes13)
set.seed(9) y=sample(1:15,replace = TRUE) x0=sample(2:16, replace = TRUE) x2=sample(3:17, replace = TRUE) x3=sample(4:18,replace = TRUE) options(np.messages=FALSE) yes13=rep(1,13) yes13[10]=0 reportRank(y,bigx=cbind(x0,x2,x3),yes13=yes13)