Package 'rminer'

Title:	Data Mining Classification and Regression Methods
Description:	Facilitates the use of data mining algorithms in classification and regression (including time series forecasting) tasks by presenting a short and coherent set of functions. Versions: 1.4.8 improved help, several warning and error code fixes (more stable version, all examples run correctly); 1.4.7 improved Importance function and examples, minor error fixes; 1.4.6 / 1.4.5 / 1.4.4 new automated machine learning (AutoML) and ensembles, via improved fit(), mining() and mparheuristic() functions, and new categorical preprocessing, via improved delevels() function; 1.4.3 new metrics (e.g., macro precision, explained variance), new "lssvm" model and improved mparheuristic() function; 1.4.2 new "NMAE" metric, "xgboost" and "cv.glmnet" models (16 classification and 18 regression models); 1.4.1 new tutorial and more robust version; 1.4 - new classification and regression models, with a total of 14 classification and 15 regression methods, including: Decision Trees, Neural Networks, Support Vector Machines, Random Forests, Bagging and Boosting; 1.3 and 1.3.1 - new classification and regression metrics; 1.2 - new input importance methods via improved Importance() function; 1.0 - first version.
Authors:	Paulo Cortez [aut, cre]
Maintainer:	Paulo Cortez <[email protected]>
License:	GPL-2
Version:	1.4.8
Built:	2025-03-27 03:53:03 UTC
Source:	https://github.com/cran/rminer

Help Index

Create a training set (data.frame) from a time series using a sliding window.
Computes k-fold cross validation for rminer models.
Reduce, replace or transform levels of a data.frame or factor variable (useful for preprocessing datasets).
Fit a supervised data mining model (classification or regression) model
Computes indexes for holdout data split into training and test sets.
Measure input importance (including sensitivity analysis) given a supervised data mining model.
Missing data imputation (e.g. substitution by value or hotdeck method).
Compute long term forecasts.
Mining graph function
Powerful function that trains and tests a particular fit model under several runs and a given validation method
Compute classification or regression error metrics.
Function that returns a list of searching (hyper)parameters for a particular model (classification or regression) or for a multiple list of models (automl or ensembles).
predict method for fit objects (rminer)
Synthetic regression and classification datasets for measuring input importance of supervised learning models
Load/save into a file the result of a fit (model) or mining functions.
sin1 regression dataset
VEC plot function (to use in conjunction with Importance function).

Create a training set (data.frame) from a time series using a sliding window.

Description

Create a training set (data.frame) from a time series using a sliding window.

Usage

CasesSeries(t, W, start = 1, end = length(t))
CasesSeries(t, W, start = 1, end = length(t))

Arguments

`t`	a time series (numeric vector).
`W`	a sliding window (with time lags, numeric vector).
`start`	starting period.
`end`	ending period.

Details

Check reference for details.

Value

Returns a data.frame, where y is the output target and the inputs are the time lags.

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

To check for more details:
P. Cortez.
Sensitivity Analysis for Time Lag Selection to Forecast Seasonal Time Series using Neural Networks and Support Vector Machines.
In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2010), pp. 3694-3701, Barcelona, Spain, July, 2010. IEEE Computer Society, ISBN: 978-1-4244-6917-8 (DVD edition).
doi:10.1109/IJCNN.2010.5596890
This tutorial shows additional code examples:
P. Cortez.
A tutorial on using the rminer R package for data mining tasks.
Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015.
http://hdl.handle.net/1822/36210

Examples

t=1:20
d=CasesSeries(1:10,c(1,3,4))
print(d)
d=CasesSeries(1:10,c(1,2,3))
print(d)
t=1:20
d=CasesSeries(1:10,c(1,3,4))
print(d)
d=CasesSeries(1:10,c(1,2,3))
print(d)

Computes k-fold cross validation for rminer models.

Description

Computes k-fold cross validation for rminer models.

Usage

crossvaldata(x, data, theta.fit, theta.predict, ngroup = 10, 
             mode = "stratified", seed = NULL, model, task, feature = "none",
             ...)
crossvaldata(x, data, theta.fit, theta.predict, ngroup = 10, 
             mode = "stratified", seed = NULL, model, task, feature = "none",
             ...)

Arguments

`x`	See `fit` for details.
`data`	See `fit` for details.
`theta.fit`	fitting function
`theta.predict`	prediction function
`ngroup`	number of folds
`mode`	Possibilities are: "stratified", "random" or "order" (see `holdout` for details).
`seed`	if `NULL` then no seed is used and the current R randomness is assumed; else a fixed seed is adopted to generate local random sample sequences, returning always the same result for the same seed (local means that it does not affect the state of other random number generations called after this function, see `holdout` example).
`model`	See `fit` for details.
`task`	See `fit` for details.
`feature`	See `fit` for details.
`...`	Additional parameters sent to `theta.fit` or `theta.predic` (e.g. `search`)

Details

Standard k-fold cross-validation adopted for rminer models. By default, for classification tasks ("class" or "prob") a stratified sampling is used (the class distributions are identical for each fold), unless mode is set to random or order (see holdout for details).

Value

Returns a list with:

$cv.fit – all predictions (factor if task="class", matrix if task="prob" or numeric if task="reg");
$model – vector list with the model for each fold.
$mpar – vector list with the mpar for each fold;
$attributes – the selected attributes for each fold if a feature selection algorithm was adopted;
$ngroup – the number of folds;
$leave.out – the computed size for each fold (=nrow(data)/ngroup);
$groups – vector list with the indexes of each group;
$call – the call of this function;

Note

A better control (e.g. use of several Runs) is achieved using the simpler mining function.

Author(s)

This function was adapted by Paulo Cortez from the crossval function of the bootstrap library (S original by R. Tibshirani and R port by F. Leisch).

References

Check the crossval function.

Examples

### dontrun is used when the execution of the example requires some computational effort.
## Not run: 
 data(iris)
 # 3-fold cross validation using fit and predict
 # the control argument is sent to rpart function 
 # rpart.control() is from the rpart package
 M=crossvaldata(Species~.,iris,fit,predict,ngroup=3,seed=12345,model="rpart",
                task="prob", control = rpart::rpart.control(cp=0.05))
 print("cross validation object:")
 print(M)
 C=mmetric(iris$Species,M$cv.fit,metric="CONF")
 print("confusion matrix:")
 print(C)

## End(Not run)
### dontrun is used when the execution of the example requires some computational effort.
## Not run: 
 data(iris)
 # 3-fold cross validation using fit and predict
 # the control argument is sent to rpart function 
 # rpart.control() is from the rpart package
 M=crossvaldata(Species~.,iris,fit,predict,ngroup=3,seed=12345,model="rpart",
                task="prob", control = rpart::rpart.control(cp=0.05))
 print("cross validation object:")
 print(M)
 C=mmetric(iris$Species,M$cv.fit,metric="CONF")
 print("confusion matrix:")
 print(C)

## End(Not run)

Reduce, replace or transform levels of a data.frame or factor variable (useful for preprocessing datasets).

Description

Reduce, replace or transform levels of a data.frame or factor variable (useful for preprocessing datasets).

Usage

delevels(x, levels, label = NULL)
delevels(x, levels, label = NULL)

Arguments

x

factor with several levels or a data.frame. If a data.frame, then all factor attributes are transformed.

levels

character vector with several options:

idf – factor is transformed into a numeric vector using IDF transform.
pcp or c("pcp",perc) – factor is transformed using PCP transform. If perc is not provided, the default 0.1 value is used.
any other values – all level values are merged into a single factor level according to label.

Another possibility is to define a vector list, with levels[[i]] values for each factor of the data.frame (see example).

label

the new label used for all levels examples (if NULL then "_OTHER" is assumed).

Details

The Inverse Document Frequency (IDF) uses f(x)= log(n/f_x), where n is the length of x and f_x is the frequency of x.
The Percentage Categorical Pruned (PCP) merges all least frequent levels (summing up to perc percent) into a single level.
When other values are used for levels, this function replaces all levels values with the single label value.

Value

Returns a transformed factor or data.frame.

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

PCP transform:
L.M. Matos, P. Cortez, R. Mendes, A. Moreau.
Using Deep Learning for Mobile Marketing User Conversion Prediction. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2019), paper N-19327, Budapest, Hungary, July, 2019 (8 pages), IEEE, ISBN 978-1-7281-2009-6.
doi:10.1109/IJCNN.2019.8851888
http://hdl.handle.net/1822/62771
IDF transform:
L.M. Matos, P. Cortez, R. Mendes and A. Moreau.
A Comparison of Data-Driven Approaches for Mobile Marketing User Conversion Prediction. In Proceedings of 9th IEEE International Conference on Intelligent Systems (IS 2018), pp. 140-146, Funchal, Madeira, Portugal, September, 2018, IEEE, ISBN 978-1-5386-7097-2.
https://ieeexplore.ieee.org/document/8710472
http://hdl.handle.net/1822/61586

Examples

### simples examples:
f=factor(c("A","A","B","B","C","D","E"))
print(table(f))
# replace "A" with "a":
f1=delevels(f,"A","a")
print(table(f1))
# merge c("C","D","E") into "CDE":
f2=delevels(f,c("C","D","E"),"CDE")
print(table(f2))
# merge c("B","C","D","E") into _OTHER:
f3=delevels(f,c("B","C","D","E"))
print(table(f3))

## Not run: 
# larger factor:
x=factor(c(1,rep(2,2),rep(3,3),rep(4,4),rep(5,5),rep(10,10),rep(100,100)))
print(table(x))
# IDF: frequent values are close to zero and
# infrequent ones are more close to each other:
x1=delevels(x,"idf")
print(table(x1))
# PCP: infrequent values are merged
x2=delevels(x,c("pcp",0.1)) # around 10
print(table(x2))

# example with a data.frame:
y=factor(c(rep("a",100),rep("b",20),rep("c",5)))
z=1:125 # numeric
d=data.frame(x=x,y=y,z=z,x2=x)
print(summary(d))

# IDF:
d1=delevels(d,"idf")
print(summary(d1))
# PCP:
d2=delevels(d,"pcp")
print(summary(d2))
# delevels:
L=vector("list",ncol(d)) # one per attribute
L[[1]]=c("1","2","3","4","5")
L[[2]]=c("b","c")
L[[4]]=c("1","2","3") # different on purpose
d3=delevels(d,levels=L,label="other")
print(summary(d3))

## End(Not run) # end dontrun 

### simples examples:
f=factor(c("A","A","B","B","C","D","E"))
print(table(f))
# replace "A" with "a":
f1=delevels(f,"A","a")
print(table(f1))
# merge c("C","D","E") into "CDE":
f2=delevels(f,c("C","D","E"),"CDE")
print(table(f2))
# merge c("B","C","D","E") into _OTHER:
f3=delevels(f,c("B","C","D","E"))
print(table(f3))

## Not run: 
# larger factor:
x=factor(c(1,rep(2,2),rep(3,3),rep(4,4),rep(5,5),rep(10,10),rep(100,100)))
print(table(x))
# IDF: frequent values are close to zero and
# infrequent ones are more close to each other:
x1=delevels(x,"idf")
print(table(x1))
# PCP: infrequent values are merged
x2=delevels(x,c("pcp",0.1)) # around 10
print(table(x2))

# example with a data.frame:
y=factor(c(rep("a",100),rep("b",20),rep("c",5)))
z=1:125 # numeric
d=data.frame(x=x,y=y,z=z,x2=x)
print(summary(d))

# IDF:
d1=delevels(d,"idf")
print(summary(d1))
# PCP:
d2=delevels(d,"pcp")
print(summary(d2))
# delevels:
L=vector("list",ncol(d)) # one per attribute
L[[1]]=c("1","2","3","4","5")
L[[2]]=c("b","c")
L[[4]]=c("1","2","3") # different on purpose
d3=delevels(d,levels=L,label="other")
print(summary(d3))

## End(Not run) # end dontrun

Fit a supervised data mining model (classification or regression) model

Description

Fit a supervised data mining model (classification or regression) model. Wrapper function that allows to fit distinct data mining (16 classification and 18 regression) methods under the same coherent function structure. Also, it tunes the hyperparameters of the models (e.g., kknn, mlpe and ksvm) and performs some feature selection methods.

Usage

fit(x, data = NULL, model = "default", task = "default", 
    search = "heuristic", mpar = NULL, feature = "none", 
    scale = "default", transform = "none", 
    created = NULL, fdebug = FALSE, ...)
fit(x, data = NULL, model = "default", task = "default", 
    search = "heuristic", mpar = NULL, feature = "none", 
    scale = "default", transform = "none", 
    created = NULL, fdebug = FALSE, ...)

Arguments

`x`	a symbolic description (formula) of the model to be fit. If `data=NULL` it is assumed that `x` contains a formula expression with known variables (see first example below).
`data`	an optional data frame (columns denote attributes, rows show examples) containing the training data, when using a formula.
`model`	Typically this should be a character object with the model type name (data mining method, as explained in valid character options). First usage: individual fit. Valid character options are the typical R base learning functions (individual models), namely one of: `naive` most common class (classification) or mean output value (regression) `ctree` – conditional inference tree (classification and regression, uses `ctree`) `cv.glmnet` – generalized linear model (GLM) with lasso or elasticnet regularization (classification and regression, uses `cv.glmnet`; note: cross-validation is used to automatically set the lambda parameter that is needed to compute the predictions) `rpart` or `dt` – decision tree (classification and regression, uses `rpart`) `kknn` or `knn` – k-nearest neighbor (classification and regression, uses `kknn`) `ksvm` or `svm` – support vector machine (classification and regression, uses `ksvm`) `lssvm` – least squares support vector machine (pure classification only, uses `lssvm`) `mlp` – multilayer perceptron with one hidden layer (classification and regression, uses `nnet` (in this version, for both `mlp` and `mlpe`, the maximum number of weights was increased and fixed to `MaxNWts=10000`)) `mlpe` – multilayer perceptron ensemble (classification and regression, uses `nnet`) `randomForest` or `randomforest` – random forest algorithm (classification and regression, uses `randomForest`) `xgboost` – eXtreme Gradient Boosting (Tree) (classification and regression, uses `xgboost`; note: `nrounds` parameter is set by default to 2) `bagging` – bagging from Breiman, 1996 (classification, uses `bagging`) `boosting` – adaboost.M1 method from Freund and Schapire, 1996 (classification, uses `boosting`) `lda` – linear discriminant analysis (classification, uses `lda`) `multinom` or `lr` – logistic regression (classification, uses `multinom`) `naiveBayes` or `naivebayes` – naive bayes (classification, uses `naiveBayes`) `qda` – quadratic discriminant analysis (classification, uses `qda`) `cubist` – M5 rule-based model (regression, uses `cubist`) `lm` – standard multiple/linear regression (uses `lm`) `mr` – multiple regression (regression, equivalent to `lm` but uses `nnet` with zero hidden nodes and linear output function) `mars` – multivariate adaptive regression splines (regression, uses `mars`) `pcr` – principal component regression (regression, uses `pcr`) `plsr` – partial least squares regression (regression, uses `plsr`) `cppls` – canonical powered partial least squares (regression, uses `cppls`) `rvm` – relevance vector machine (regression, uses `rvm`) Second usage: multiple models. `model` can be used to perform Automated Machine Learning (AutoML) or ensembles of several individual models: `auto` – first, the best model is automatically set by searching all models defined in `search` and selecting the one with the best “validation” metric on a validation set (depending on the method defined in `search`); then, the selected best model is fit to all training data. When `auto` is used, a ranked leaderboard of the models (and their selected hyperparameters) is returned as a new `$LB` field of the `@mpar` returned slot (e.g., try: `print(M@mpar$LB)`, where `M` is an object returned by `fit`). `AE`, `WE` or `SE` – all individual models are first fit to the data; then an ensemble is built by: `AE` – Average Ensemble, majority (if `task=="class"`) or average of the predictions; `WE`) – Weighted Ensemble, similar to `AE` but each prediction is weighted according to the validation metric (for `task=="class"` it is equal to `AE`); `SE` – Stacking Ensemble, applies a second-level GLM to weight the individual predictions. For any ensemble, when an individual model produces an error then it is excluded from the ensemble. After excluding invalid models, if there is just a single model then such model is returned (and no ensemble is produced). Third usage: `model` can be a `list` with 2 possibilities of fields A) and B). A) if you have your one fit function, then you can embed it using: `$fit` – a fit function that accepts the arguments `x`, `data` and `...`, the goal is to accept here any R classification or regression model, mainly for its use within the `mining` or `Importance` functions, or to use a hyperparameter search (via `search`). `$predict` – a predict function that accepts the arguments `object`, `newdata`, this function should behave as any rminer prediction, i.e., return: a factor when `task=="class"`; a matrix with Probabilities x Instances when `task=="prob"`; and a vector when `task=="reg"`. `$name` – optional field with the name of the method. B) automatically produced by some ensemble methods, for the sake of documentation the fields for the ensembles ("AE", "WE" or "SE") are listed here: `$m` – a vector character with the fit object model names. `$f` – a vector list with several fit objects. `$w` – a vector with the "weighting" of the individual models. Note: current rminer version emphasizes the use of native fitting functions from their respective packages, since these functions contain several specific hyperparameters that can now be searched or set using the `search` or `...` arguments. For compatibility with previous rminer versions, older `model` options are kept.
`task`	data mining task. Valid options are: `prob` (or `p`) – classification with output probabilities (i.e. the sum of all outputs equals 1). `class` (or `c`) – classification with discrete outputs (`factor`) `reg` (or `r`) – regression (numeric output) `default` tries to guess the best task (`prob` or `reg`) given the `model` and output variable type (if factor then `prob` else `reg`)
`search`	used to tune hyperparameter(s) of the model, such as: `kknn` – number of neighbors (k); `mlp` or `mlpe` – number of hidden nodes (size) or decay; `ksvm` – gaussian kernel parameter (sigma); `randomForest` – `mtry` parameter). This is a very flexible argument that can be used under several options: simpler use, complex tuning of an individual model or multiple models. The simpler use is kept for compatibility issues but it is advised to define this argument via the easier `mparheuristic` function. Valid options for a simpler character type `search` use: `heuristic` – simple heuristic, one search parameter (e.g., size=inputs/2 for `mlp` or size=10 if classification and inputs/2>10, sigma is set using `kpar="automatic"` and `kernel="rbfdot"` of `ksvm`). Important Note: instead of the "heuristic" options, it is advisable to use the explicit `mparheuristic` function that is designed for a wider option of models (all "heuristic" options were kept due to compatibility issues and work only for: `kknn`; `mlp` or `mlpe`; `ksvm`, with `kernel="rbfdot"`; and `randomForest`). `heuristic5` – heuristic with a 5 range grid-search (e.g., `seq(1,9,2)` for `kknn`, `seq(0,8,2)` for `mlp` or `mlpe`, `2^seq(-15,3,4)` for `ksvm`, `1:5` for `randomRorest`) `heuristic10` – heuristic with a 10 range grid-search (e.g., `seq(1,10,1)` for `kknn`, `seq(0,9,1)` for `mlp` or `mlpe`, `2^seq(-15,3,2)` for `ksvm`, `1:10` for `randomRorest`) `UD`, `UD1` or `UD2` – uniform design 2-Level with 13 (`UD` or `UD2`) or 21 (`UD1`) searches (only works for `ksvm` and `kernel="rbfdot"`). Another simpler use of the `search` argument is a numeric type: a-`vector` – numeric vector with all hyperparameter values that will be searched within an internal grid-search (the number of searches is `length(search)` when `convex=0`) A more complex but advised use of `search` is to use a `list`. Non expert users should create this list via the `mparheuristic` function, which is very easy to use. Nevertheless, the fields of the list for a single fit (individual model) are shown here: `$smethod` – type of search method. Valid options are: `none` – no search is executed, one single fit is performed. `matrix` – matrix search (tests only n searches, all search parameters are of size n). `grid` – normal grid search (tests all combinations of search parameters). `2L` - nested 2-Level grid search. First level range is set by `$search` and then the 2nd level performs a fine tuning, with `length($search)` searches around (original range/2) best value in first level (2nd level is only performed on numeric searches). `UD`, `UD1` or `UD2` – uniform design 2-Level with 13 (`UD` or `UD2`) or 21 (`UD1`) searches (note: only works for `model="ksvm"` and `kernel="rbfdot"`). Under this option, `$search` should contain the first level ranges, such as `c(-15,3,-5,15)` for classification (gamma min and max, C min and max, after which a `2^` transform is applied) or `c(-8,0,-1,6,-8,-1)` for regression (last two values are epsilon min and max, after which a `2^` transform is applied). `$search` – a-`list` with all hyperparameter values to be searched or character with previous described options (e.g., "heuristic", "heuristic5", "UD"). If a character, then `$smethod` equal to `"none"` or `"grid"` or `"UD"` is automatically assumed. `$convex` – number that defines how many searches are performed after a local minimum/maximum is found (if >0, the search can be stopped without testing all grid-search values) `$method` – type of internal (validation) estimation method used during the search (see `method` argument of `mining` for details) `$metric` – used to compute a metric value during internal estimation. Can be a single character such as `"SAD"` or a list with all the arguments used by the `mmetric` function except `y` and `x`, such as: `search$metric=list(metric="AUC",TC=3,D=0.7)`. See `mmetric` for more details. A more sophisticated definition of `search` involves the tuning of several models (used by the `model=` `auto`, `AE`, `WE` or `SE`). Again, this sophisticated definition should be automatically set using the `mparheuristic` function. The list of fields for the multiple tuning mode are: `$models` - a vector character with LM individual `model` values. This field can also include ensembles (`"AE"`, `"WE"`, `"SE"`) provided they appear at the end of this vector. They will work if more than one valid individual model is included. `$ls` - a vector list with LM search values (for each individual model, the values are the same as in individual search `$search` field). `$smethod` - must have the `auto` value. `$smethod` - must have the `auto` value. `$method` - internal (validation) estimation method (equal to the individual search `$method` field). `$metric` - internal (validation) estimation metric (equal to the individual search `$metric` field). `$convex` - equal to the individual search `$convex` field. Note: the `mpar` argument only appears due to compatibility issues. If used, then the `mpar` values are automatically fed into search. However, a direct use of the `search` argument is advised instead of `mpar`, since `search` is more flexible and powerful.
`mpar`	(important note: this argument only is kept in this version due to compatibility with previous rminer versions. Instead of `mpar`, you should use the more flexible and powerful `search` argument.) vector with extra default (fixed) model parameters (used for modeling, search and feature selection) with: c(vmethod,vpar,`metric`) – generic use of mpar (including most models); c(C,epsilon,vmethod,vpar,`metric`) – if `ksvm` and C and epsilon are explicitly set; c(nr,maxit,vmethod,vpar,`metric`) – if `mlp` or `mlpe` and nr and maxit are explicitly set; C and epsilon are default values for `svm` (if any of these is `=NA` then heuristics are used to set the value). nr is the number of `mlp` runs or `mlpe` individual models, while maxit is the maximum number of epochs (if any of these is `=NA` then heuristics are used to set the value). For help on vmethod and vpar see `mining`. `metric` is the internal error function (e.g., used by search to select the best model), valid options are explained in `mmetric`. When `mpar=NULL` then default values are used. If there are `NA` values (e.g., `mpar=c(NA,NA)`) then default values are used.
`feature`	feature selection and sensitivity analysis control. Valid `fit` function options are: `none` – no feature selection; a fmethod character value, such as `sabs` (see below); a-`vector` – vector with c(fmethod,deletions,Runs,vmethod,vpar,defaultsearch) a-`vector` – vector with c(fmethod,deletions,Runs,vmethod,vpar) fmethod sets the type. Valid options are: `sbs` – standard backward selection; `sabs` – sensitivity analysis backward selection (faster); `sabsv` – equal to `sabs` but uses variance for sensitivity importance measure; `sabsr` – equal to `sabs` but uses range for sensitivity importance measure; `sabsg` – equal to `sabs` (uses gradient for sensitivity importance measure); deletions is the maximum number of feature deletions (if -1 not used). Runs is the number of runs for each feature set evaluation (e.g., 1). For help on vmethod and vpar see `mining`. defaultsearch is one hyperparameter used during the feature selection search, after selecting the best feature set then `search` is used (faster). If not defined, then `search` is used during feature selection (may be slow). When feature is a vector then default values are used to fill missing values or `NA` values. Note: feature selection capabilities are expected to be enhanced in next rminer versions.
`scale`	if data needs to be scaled (i.e. for `mlp` or `mlpe`). Valid options are: `default` – uses scaling when needed (i.e. for `mlp` or `mlpe`) `none` – no scaling; `inputs` – standardizes (0 mean, 1 st. deviation) input attributes; `all` – standardizes (0 mean, 1 st. deviation) input and output attributes; If needed, the `predict` function of rminer performs the inverse scaling.
`transform`	if the output data needs to be transformed (e.g., `log` transform). Valid options are: `none` – no transform; `log` – y=(log(y+1)) (the inverse function is applied in the `predict` function); `positive` – all predictions are positive (negative values are turned into zero); `logpositive` – both `log` and `logpositive`;
`created`	time stamp for the model. By default, the system time is used. Else, you can specify another time.
`fdebug`	if TRUE show some search details.
`...`	additional and specific parameters send to each fit function model (e.g., `dt`, `rpart`, `randomforest`, `kernlab`). A few examples: – the `rpart` function is used for decision trees, thus you can have: `control=rpart.control(cp=.05)` (see `crossvaldata` example). – the `ksvm` function is used for support vector machines, thus you can change the kernel type: `kernel="polydot"` (see examples below). Important note: if you use package functions and get an error, then try to explicitly define the package. For instance, you might need to use `fit(`several-arguments`,control=Cubist::cubistControl())` instead of `fit(`several-arguments`,control=cubistControl())`.

Details

Fits a classification or regression model given a data.frame (see [Cortez, 2010] for more details). The ... optional arguments should be used to fix values used by specific model functions (see examples). Notes:
- if there is an error in the fit, then a warning is issued (see example).
- the new search argument is very flexible and allows a powerful design of supervised learning models.
- the search correct use is very dependent on the R learning base functions. For example, if you are tuning model="rpart" then read carefully the help of function rpart.
- mpar argument is only kept due to compatibility issues and should be avoided; instead, use the more flexible search.

Details about some models:

Neural Network: mlp trains nr multilayer perceptrons (with maxit epochs, size hidden nodes and decay value according to the nnet function) and selects the best network according to minimum penalized error ($value). mlpe uses an ensemble of nr networks and the final prediction is given by the average of all outputs. To tune mlp or mlpe you can use the search parameter, which performs a grid search for size or decay.
Support Vector Machine: svm adopts by default the gaussian (rbfdot) kernel. For classification tasks, you can use search to tune sigma (gaussian kernel parameter) and C (complexity parameter). For regression, the epsilon insensitive function is adopted and there is an additional hyperparameter epsilon.
Other methods: Random Forest – if needed, you can tune several parameters, including the default mtry parameter adopted by search heuristics; k-nearest neighbor – search by default tunes k. The remaining models can also be tunned but a full definition of search is required (e.g., with $smethod, $search and other fields); please check mparheuristic function for further tuning examples (e.g., rpart).

Value

Returns a model object. You can check all model elements with str(M), where M is a model object. The slots are:

@formula – the x;
@model – the model;
@task – the task;
@mpar – data.frame with the best model parameters (interpretation depends on model);
@attributes – the attributes used by the model;
@scale – the scale;
@transform – the transform;
@created – the date when the model was created;
@time – computation effort to fit the model;
@object – the R object model (e.g., rpart, nnet, ...);
@outindex – the output index (of @attributes);
@levels – if task=="prob"||task=="class" stores the output levels;
@error – similarly to mining this is the "validation" error for some search options;

Note

Author(s)

Paulo Cortez https://pcortez.dsi.uminho.pt

References

To check for more details about rminer and for citation purposes:
P. Cortez.
Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool.
In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects 10th Industrial Conference on Data Mining (ICDM 2010), Lecture Notes in Artificial Intelligence 6171, pp. 572-583, Berlin, Germany, July, 2010. Springer. ISBN: 978-3-642-14399-1.
@Springer: https://link.springer.com/chapter/10.1007/978-3-642-14400-4_44
http://www3.dsi.uminho.pt/pcortez/2010-rminer.pdf
This tutorial shows additional code examples:
P. Cortez.
A tutorial on using the rminer R package for data mining tasks.
Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015.
http://hdl.handle.net/1822/36210
For the grid search and other optimization methods:
P. Cortez.
Modern Optimization with R.
Use R! series, Springer, 2nd edition, July 2021, ISBN 978-3-030-72818-2.
https://link.springer.com/book/10.1007/978-3-030-72819-9
The automl is inspired in this work:
L. Ferreira, A. Pilastri, C. Martins, P. Santos, P. Cortez.
An Automated and Distributed Machine Learning Framework for Telecommunications Risk Management. In J. van den Herik et al. (Eds.), Proceedings of 12th International Conference on Agents and Artificial Intelligence – ICAART 2020, Volume 2, pp. 99-107, Valletta, Malta, February, 2020, SCITEPRESS, ISBN 978-989-758-395-7.
@INSTICC: http://hdl.handle.net/1822/66818
For the sabs feature selection:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
doi:10.1016/j.dss.2009.05.016
For the uniform design details:
C.M. Huang, Y.J. Lee, D.K.J. Lin and S.Y. Huang.
Model selection for support vector machines via uniform design,
In Computational Statistics & Data Analysis, 52(1):335-346, 2007.

Examples

### dontrun is used when the execution of the example requires some computational effort.

### simple regression (with a formula) example.
x1=rnorm(200,100,20); x2=rnorm(200,100,20)
y=0.7*sin(x1/(25*pi))+0.3*sin(x2/(25*pi))
M=fit(y~x1+x2,model="mlpe")
new1=rnorm(100,100,20); new2=rnorm(100,100,20)
ynew=0.7*sin(new1/(25*pi))+0.3*sin(new2/(25*pi))
P=predict(M,data.frame(x1=new1,x2=new2,y=rep(NA,100)))
print(mmetric(ynew,P,"MAE"))

### simple classification example.
## Not run: 
data(iris)
M=fit(Species~.,iris,model="rpart")
plot(M@object); text(M@object) # show model
P=predict(M,iris)
print(mmetric(iris$Species,P,"CONF"))
print(mmetric(iris$Species,P,"ALL"))
mgraph(iris$Species,P,graph="ROC",TC=2,main="versicolor ROC",
baseline=TRUE,leg="Versicolor",Grid=10)

M2=fit(Species~.,iris,model="ctree")
plot(M2@object) # show model
P2=predict(M2,iris)
print(mmetric(iris$Species,P2,"CONF"))

# ctree with different setup:
# (ctree_control is from the party package)
M3=fit(Species~.,iris,model="ctree",controls = party::ctree_control(testtype="MonteCarlo"))
plot(M3@object) # show model

## End(Not run)

### simple binary classification example with cv.glmnet and xgboost
## Not run: 
data(sa_ssin_2)
H=holdout(sa_ssin_2$y,ratio=2/3)
# cv.glmnet:
M=fit(y~.,sa_ssin_2[H$tr,],model="cv.glmnet",task="cla") # pure classes
P=predict(M,sa_ssin_2[H$ts,])
cat("1st prediction, class:",as.character(P[1]),"\n")
cat("Confusion matrix:\n")
print(mmetric(sa_ssin_2[H$ts,]$y,P,"CONF")$conf)

M2=fit(y~.,sa_ssin_2[H$tr,],model="cv.glmnet") # probabilities
P2=predict(M2,sa_ssin_2[H$ts,])
L=M2@levels
cat("1st prediction, prob:",L[1],"=",P2[1,1],",",L[2],"=",P2[1,2],"\n")
cat("Confusion matrix:\n")
print(mmetric(sa_ssin_2[H$ts,]$y,P2,"CONF")$conf)
cat("AUC of ROC curve:\n")
print(mmetric(sa_ssin_2[H$ts,]$y,P2,"AUC"))

M3=fit(y~.,sa_ssin_2[H$tr,],model="cv.glmnet",nfolds=3) # use 3 folds instead of 10
plot(M3@object) # show cv.glmnet object
P3=predict(M3,sa_ssin_2[H$ts,])

# xgboost:
M4=fit(y~.,sa_ssin_2[H$tr,],model="xgboost",verbose=1) # nrounds=2, show rounds:
P4=predict(M4,sa_ssin_2[H$ts,])
print(mmetric(sa_ssin_2[H$ts,]$y,P4,"AUC"))
M5=fit(y~.,sa_ssin_2[H$tr,],model="xgboost",nrounds=3,verbose=1) # nrounds=3, show rounds:
P5=predict(M5,sa_ssin_2[H$ts,])
print(mmetric(sa_ssin_2[H$ts,]$y,P5,"AUC"))

## End(Not run)

### classification example with discrete classes, probabilities and holdout
## Not run: 
data(iris)
H=holdout(iris$Species,ratio=2/3)
M=fit(Species~.,iris[H$tr,],model="ksvm",task="class")
M1=fit(Species~.,iris[H$tr,],model="lssvm") # default task="class" is assumed
M2=fit(Species~.,iris[H$tr,],model="ksvm",task="prob")
P=predict(M,iris[H$ts,]) # classes
P1=predict(M1,iris[H$ts,]) # classes
P2=predict(M2,iris[H$ts,]) # probabilities
print(mmetric(iris$Species[H$ts],P,"CONF"))
print(mmetric(iris$Species[H$ts],P1,"CONF"))
print(mmetric(iris$Species[H$ts],P2,"CONF"))
print(mmetric(iris$Species[H$ts],P,"CONF",TC=1))
print(mmetric(iris$Species[H$ts],P2,"CONF",TC=1))
print(mmetric(iris$Species[H$ts],P2,"AUC"))

### exploration of some rminer classification models:
models=c("lda","naiveBayes","kknn","randomForest","cv.glmnet","xgboost")
for(m in models)
 { cat("model:",m,"\n") 
   M=fit(Species~.,iris[H$tr,],model=m)
   P=predict(M,iris[H$ts,])
   print(mmetric(iris$Species[H$ts],P,"AUC")[[1]])
 }

## End(Not run)

### classification example with hyperparameter selection 
###    note: for regression, similar code can be used
### SVM 
## Not run: 
data(iris)
# large list of SVM configurations:
# SVM with kpar="automatic" sigma rbfdot kernel estimation and default C=1:
#  note: each execution can lead to different M@mpar due to sigest stochastic nature:
M=fit(Species~.,iris,model="ksvm")
print(M@mpar) # model hyperparameters/arguments
# same thing, explicit use of mparheuristic:
M=fit(Species~.,iris,model="ksvm",search=list(search=mparheuristic("ksvm")))
print(M@mpar) # model hyperparameters

# SVM with C=3, sigma=2^-7
M=fit(Species~.,iris,model="ksvm",C=3,kpar=list(sigma=2^-7))
print(M@mpar)
# SVM with different kernels:
M=fit(Species~.,iris,model="ksvm",kernel="polydot",kpar="automatic") 
print(M@mpar)
# fit already has a scale argument, thus the only way to fix scale of "tanhdot"
# is to use the special search argument with the "none" method:
s=list(smethod="none",search=list(scale=2,offset=2))
M=fit(Species~.,iris,model="ksvm",kernel="tanhdot",search=s) 
print(M@mpar)
# heuristic: 10 grid search values for sigma, rbfdot kernel (fdebug is used only for more verbose):
s=list(search=mparheuristic("ksvm",10)) # advised "heuristic10" usage
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# same thing, uses older search="heuristic10"
M=fit(Species~.,iris,model="ksvm",search="heuristic10",fdebug=TRUE)
print(M@mpar)
# identical search under a different and explicit code:
s=list(search=2^seq(-15,3,2))
M=fit(Species~.,iris,model="ksvm",search=2^seq(-15,3,2),fdebug=TRUE)
print(M@mpar)

# uniform design "UD" for sigma and C, rbfdot kernel, two level of grid searches, 
# under exponential (2^x) search scale:
M=fit(Species~.,iris,model="ksvm",search="UD",fdebug=TRUE)
print(M@mpar)
M=fit(Species~.,iris,model="ksvm",search="UD1",fdebug=TRUE)
print(M@mpar)
# now the more powerful search argument is used for modeling SVM:
# grid 3 x 3 search:
s=list(smethod="grid",search=list(sigma=2^c(-15,-5,3),C=2^c(-5,0,15)),convex=0,
            metric="AUC",method=c("kfold",3,12345))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# identical search with different argument smethod="matrix" 
s$smethod="matrix"
s$search=list(sigma=rep(2^c(-15,-5,3),times=3),C=rep(2^c(-5,0,15),each=3))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# search for best kernel (only works for kpar="automatic"):
s=list(smethod="grid",search=list(kernel=c("rbfdot","laplacedot","polydot","vanilladot")),
       convex=0,metric="AUC",method=c("kfold",3,12345))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# search for best parameters of "rbfdot" or "laplacedot" (which use same kpar):
s$search=list(kernel=c("rbfdot","laplacedot"),sigma=2^seq(-15,3,5))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)

### randomForest
# search for mtry and ntree
s=list(smethod="grid",search=list(mtry=c(1,2,3),ntree=c(100,200,500)),
            convex=0,metric="AUC",method=c("kfold",3,12345))
print(s)
M=fit(Species~.,iris,model="randomForest",search=s,fdebug=TRUE)
print(M@mpar)

### rpart
# simpler way to tune cp in 0.01 to 0.9 (10 searches):
s=list(search=mparheuristic("rpart",n=10,lower=0.01,upper=0.9),method=c("kfold",3,12345))
M=fit(Species~.,iris,model="rpart",search=s,fdebug=TRUE)
print(M@mpar)

# same thing but with more lines of code
# note: this code can be adapted to tune other rpart parameters,
#       while mparheuristic only tunes cp
# a vector list needs to be used for the search$search parameter
lcp=vector("list",10) # 10 grid values for the complexity cp
names(lcp)=rep("cp",10) # same cp name 
scp=seq(0.01,0.9,length.out=10) # 10 values from 0.01 to 0.18
for(i in 1:10) lcp[[i]]=scp[i] # cycle needed due to [[]] notation
s=list(smethod="grid",search=list(control=lcp),
            convex=0,metric="AUC",method=c("kfold",3,12345))
M=fit(Species~.,iris,model="rpart",search=s,fdebug=TRUE)
print(M@mpar)

### ctree 
# simpler way to tune mincriterion in 0.1 to 0.98 (9 searches):
mint=c("kfold",3,123) # internal validation method
s=list(search=mparheuristic("ctree",n=8,lower=0.1,upper=0.99),method=mint)
M=fit(Species~.,iris,model="ctree",search=s,fdebug=TRUE)
print(M@mpar)
# same thing but with more lines of code
# note: this code can be adapted to tune other ctree parameters,
#       while mparheuristic only tunes mincriterion
# a vector list needs to be used for the search$search parameter
lmc=vector("list",9) # 9 grid values for the mincriterion
smc=seq(0.1,0.99,length.out=9)
for(i in 1:9) lmc[[i]]=party::ctree_control(mincriterion=smc[i]) 
s=list(smethod="grid",search=list(controls=lmc),method=mint,convex=0)
M=fit(Species~.,iris,model="ctree",search=s,fdebug=TRUE)
print(M@mpar)

### some MLP fitting examples:
# simplest use:
M=fit(Species~.,iris,model="mlpe")  
print(M@mpar)
# same thing, with explicit use of mparheuristic:
M=fit(Species~.,iris,model="mlpe",search=list(search=mparheuristic("mlpe")))
print(M@mpar) # hidden nodes and number of ensemble mlps
# setting some nnet parameters:
M=fit(Species~.,iris,model="mlpe",size=3,decay=0.1,maxit=100,rang=0.9) 
print(M@mpar) # mlpe hyperparameters
# MLP, 5 grid search fdebug is only used to put some verbose in the console:
s=list(search=mparheuristic("mlpe",n=5)) # 5 searches for size
print(s) # show search
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)
# previous searches used a random holdout (seed=NULL), now a fixed seed (123) is used:
s=list(smethod="grid",search=mparheuristic("mlpe",n=5),convex=0,metric="AUC",
            method=c("holdout",2/3,123))
print(s)
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)
# faster and greedy grid search:
s$convex=1;s$search=list(size=0:9)
print(s)
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)
# 2 level grid with total of 5 searches 
#  note of caution: some "2L" ranges may lead to non integer (e.g., 1.3) values at
#  the 2nd level search. And some R functions crash if non integer values are used for
#  integer parameters.
s$smethod="2L";s$convex=0;s$search=list(size=c(4,8,12))
print(s)
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)

# testing of all 17 rminer classification methods:
model=c("naive","ctree","cv.glmnet","rpart","kknn","ksvm","lssvm","mlp","mlpe",
 "randomForest","xgboost","bagging","boosting","lda","multinom","naiveBayes","qda")
inputs=ncol(iris)-1
ho=holdout(iris$Species,2/3,seed=123) # 2/3 for training and 1/3 for testing
Y=iris[ho$ts,ncol(iris)]
for(i in 1:length(model))
 {
  cat("i:",i,"model:",model[i],"\n")
  search=list(search=mparheuristic(model[i])) # rminer default values
  M=fit(Species~.,data=iris[ho$tr,],model=model[i],search=search,fdebug=TRUE)
  P=predict(M,iris[ho$ts,])
  cat("predicted ACC:",round(mmetric(Y,P,metric="ACC"),1),"\n")
 }


## End(Not run)

### example of an error (warning) generated using fit:
## Not run: 
data(iris)
# size needs to be a positive integer, thus 0.1 leads to an error:
M=fit(Species~.,iris,model="mlp",size=0.1)  
print(M@object)

## End(Not run)

### exploration of some rminer regression models:
## Not run: 
data(sa_ssin)
H=holdout(sa_ssin$y,ratio=2/3,seed=12345)
models=c("lm","mr","ctree","mars","cubist","cv.glmnet","xgboost","rvm")
for(m in models)
 { cat("model:",m,"\n") 
   M=fit(y~.,sa_ssin[H$tr,],model=m)
   P=predict(M,sa_ssin[H$ts,])
   print(mmetric(sa_ssin$y[H$ts],P,"MAE"))
 }

## End(Not run)

# testing of all 18 rminer regression methods:
## Not run: 
model=c("naive","ctree","cv.glmnet","rpart","kknn","ksvm","mlp","mlpe",
 "randomForest","xgboost","cubist","lm","mr","mars","pcr","plsr","cppls","rvm")
# note: in this example, default values are considered for the hyperparameters.
# better results can be achieved by tuning hyperparameters via improved usage
# of the search argument (via mparheuristic function or written code)
data(iris)
ir2=iris[,1:4] # predict iris "Petal.Width"
names(ir2)[ncol(ir2)]="y" # change output name
inputs=ncol(ir2)-1
ho=holdout(ir2$y,2/3,seed=123) # 2/3 for training and 1/3 for testing
Y=ir2[ho$ts,ncol(ir2)]
for(i in 1:length(model))
 {
  cat("i:",i,"model:",model[i],"\n")
  search=list(search=mparheuristic(model[i])) # rminer default values
  M=fit(y~.,data=ir2[ho$tr,],model=model[i],search=search,fdebug=TRUE)
  P=predict(M,ir2[ho$ts,])
  cat("predicted MAE:",round(mmetric(Y,P,metric="MAE"),1),"\n")
 }

## End(Not run)

### regression example with hyperparameter selection:
## Not run: 
data(sa_ssin)
# some SVM experiments:
# default SVM:
M=fit(y~.,data=sa_ssin,model="svm")
print(M@mpar)
# SVM with (Cherkassy and Ma, 2004) heuristics to set C and epsilon:
M=fit(y~.,data=sa_ssin,model="svm",C=NA,epsilon=NA)
print(M@mpar)
# SVM with Uniform Design set sigma, C and epsilon:
M=fit(y~.,data=sa_ssin,model="ksvm",search="UD",fdebug=TRUE)
print(M@mpar)

# sensitivity analysis feature selection
M=fit(y~.,data=sa_ssin,model="ksvm",search=list(search=mparheuristic("ksvm",n=5)),feature="sabs") 
print(M@mpar)
print(M@attributes) # selected attributes (1, 2 and 3 are the relevant inputs)

# example that shows how transform works:
M=fit(y~.,data=sa_ssin,model="mr") # linear regression
P=predict(M,data.frame(x1=-1000,x2=0,x3=0,x4=0,y=NA)) # P should be negative
print(P)
M=fit(y~.,data=sa_ssin,model="mr",transform="positive")
P=predict(M,data.frame(x1=-1000,x2=0,x3=0,x4=0,y=NA)) # P is not negative
print(P)

## End(Not run)

### pure classification example with a generic R (not rminer default) model ###
## Not run: 
### nnet is adopted here but virtually ANY fitting function/package could be used:

# since the default nnet prediction is to provide probabilities, there is
# a need to create this "wrapping" function:
predictprob=function(object,newdata)
{ predict(object,newdata,type="class") }
# list with a fit and predict function:
# nnet::nnet (package::function)
model=list(fit=nnet::nnet,predict=predictprob,name="nnet")
data(iris)
# note that size is not a fit parameter and it is sent directly to nnet:
M=fit(Species~.,iris,model=model,size=3,task="class") 
P=predict(M,iris)
print(P)

## End(Not run) 

### multiple models: automl and ensembles 
## Not run: 
data(iris)
d=iris
names(d)[ncol(d)]="y" # change output name
inputs=ncol(d)-1
metric="AUC"

# consult the help of mparheuristic for more automl and ensemble examples:
#
# automatic machine learining (automl) with 5 distinct models and "SE" ensemble.
# the single models are tuned with 10 internal hyperparameter searches, 
# except ksvm that uses 13 searches via "UD".
# fit performs an internal validation 
sm=mparheuristic(model="automl3",n=NA,task="prob", inputs= inputs )
method=c("kfold",3,123)
search=list(search=sm,smethod="auto",method=method,metric=metric,convex=0)
M=fit(y~.,data=d,model="auto",search=search,fdebug=TRUE)
P=predict(M,d)
# show leaderboard:
cat("> leaderboard models:",M@mpar$LB$model,"\n")
cat(">  validation values:",round(M@mpar$LB$eval,4),"\n")
cat("best model is:",M@model,"\n")
cat(metric,"=",round(mmetric(d$y,P,metric=metric),2),"\n")


# average ensemble of 5 distinct models
# the single models are tuned with 1 (heuristic) hyperparameter search 
sm2=mparheuristic(model="automl",n=NA,task="prob", inputs= inputs )
method=c("kfold",3,123)
search2=list(search=sm2,smethod="auto",method=method,metric=metric,convex=0)
M2=fit(y~.,data=d,model="AE",search=search2,fdebug=TRUE)
P2=predict(M,d)
cat("best model is:",M2@model,"\n")
cat(metric,"=",round(mmetric(d$y,P2,metric=metric),2),"\n")

# example with an invalid model exclusion: 
# in this case, randomForest produces an error and warning
# thus it is excluded from the leaderboard
sm=mparheuristic(model="automl3",n=NA,task="prob", inputs= inputs )
method=c("holdout",2/3,123)
search=list(search=sm,smethod="auto",method=method,metric=metric,convex=0)
d2=d
#
d2[,2]=as.factor(1:150) # force randomForest error
M=fit(y~.,data=d2,model="auto",search=search,fdebug=TRUE)
P=predict(M,d2)
# show leaderboard:
cat("> leaderboard models:",M@mpar$LB$model,"\n")
cat(">  validation values:",round(M@mpar$LB$eval,4),"\n")
cat("best model is:",M@model,"\n")
cat(metric,"=",round(mmetric(d$y,P,metric=metric),2),"\n")


## End(Not run)

### dontrun is used when the execution of the example requires some computational effort.

### simple regression (with a formula) example.
x1=rnorm(200,100,20); x2=rnorm(200,100,20)
y=0.7*sin(x1/(25*pi))+0.3*sin(x2/(25*pi))
M=fit(y~x1+x2,model="mlpe")
new1=rnorm(100,100,20); new2=rnorm(100,100,20)
ynew=0.7*sin(new1/(25*pi))+0.3*sin(new2/(25*pi))
P=predict(M,data.frame(x1=new1,x2=new2,y=rep(NA,100)))
print(mmetric(ynew,P,"MAE"))

### simple classification example.
## Not run: 
data(iris)
M=fit(Species~.,iris,model="rpart")
plot(M@object); text(M@object) # show model
P=predict(M,iris)
print(mmetric(iris$Species,P,"CONF"))
print(mmetric(iris$Species,P,"ALL"))
mgraph(iris$Species,P,graph="ROC",TC=2,main="versicolor ROC",
baseline=TRUE,leg="Versicolor",Grid=10)

M2=fit(Species~.,iris,model="ctree")
plot(M2@object) # show model
P2=predict(M2,iris)
print(mmetric(iris$Species,P2,"CONF"))

# ctree with different setup:
# (ctree_control is from the party package)
M3=fit(Species~.,iris,model="ctree",controls = party::ctree_control(testtype="MonteCarlo"))
plot(M3@object) # show model

## End(Not run)

### simple binary classification example with cv.glmnet and xgboost
## Not run: 
data(sa_ssin_2)
H=holdout(sa_ssin_2$y,ratio=2/3)
# cv.glmnet:
M=fit(y~.,sa_ssin_2[H$tr,],model="cv.glmnet",task="cla") # pure classes
P=predict(M,sa_ssin_2[H$ts,])
cat("1st prediction, class:",as.character(P[1]),"\n")
cat("Confusion matrix:\n")
print(mmetric(sa_ssin_2[H$ts,]$y,P,"CONF")$conf)

M2=fit(y~.,sa_ssin_2[H$tr,],model="cv.glmnet") # probabilities
P2=predict(M2,sa_ssin_2[H$ts,])
L=M2@levels
cat("1st prediction, prob:",L[1],"=",P2[1,1],",",L[2],"=",P2[1,2],"\n")
cat("Confusion matrix:\n")
print(mmetric(sa_ssin_2[H$ts,]$y,P2,"CONF")$conf)
cat("AUC of ROC curve:\n")
print(mmetric(sa_ssin_2[H$ts,]$y,P2,"AUC"))

M3=fit(y~.,sa_ssin_2[H$tr,],model="cv.glmnet",nfolds=3) # use 3 folds instead of 10
plot(M3@object) # show cv.glmnet object
P3=predict(M3,sa_ssin_2[H$ts,])

# xgboost:
M4=fit(y~.,sa_ssin_2[H$tr,],model="xgboost",verbose=1) # nrounds=2, show rounds:
P4=predict(M4,sa_ssin_2[H$ts,])
print(mmetric(sa_ssin_2[H$ts,]$y,P4,"AUC"))
M5=fit(y~.,sa_ssin_2[H$tr,],model="xgboost",nrounds=3,verbose=1) # nrounds=3, show rounds:
P5=predict(M5,sa_ssin_2[H$ts,])
print(mmetric(sa_ssin_2[H$ts,]$y,P5,"AUC"))

## End(Not run)

### classification example with discrete classes, probabilities and holdout
## Not run: 
data(iris)
H=holdout(iris$Species,ratio=2/3)
M=fit(Species~.,iris[H$tr,],model="ksvm",task="class")
M1=fit(Species~.,iris[H$tr,],model="lssvm") # default task="class" is assumed
M2=fit(Species~.,iris[H$tr,],model="ksvm",task="prob")
P=predict(M,iris[H$ts,]) # classes
P1=predict(M1,iris[H$ts,]) # classes
P2=predict(M2,iris[H$ts,]) # probabilities
print(mmetric(iris$Species[H$ts],P,"CONF"))
print(mmetric(iris$Species[H$ts],P1,"CONF"))
print(mmetric(iris$Species[H$ts],P2,"CONF"))
print(mmetric(iris$Species[H$ts],P,"CONF",TC=1))
print(mmetric(iris$Species[H$ts],P2,"CONF",TC=1))
print(mmetric(iris$Species[H$ts],P2,"AUC"))

### exploration of some rminer classification models:
models=c("lda","naiveBayes","kknn","randomForest","cv.glmnet","xgboost")
for(m in models)
 { cat("model:",m,"\n") 
   M=fit(Species~.,iris[H$tr,],model=m)
   P=predict(M,iris[H$ts,])
   print(mmetric(iris$Species[H$ts],P,"AUC")[[1]])
 }

## End(Not run)

### classification example with hyperparameter selection 
###    note: for regression, similar code can be used
### SVM 
## Not run: 
data(iris)
# large list of SVM configurations:
# SVM with kpar="automatic" sigma rbfdot kernel estimation and default C=1:
#  note: each execution can lead to different M@mpar due to sigest stochastic nature:
M=fit(Species~.,iris,model="ksvm")
print(M@mpar) # model hyperparameters/arguments
# same thing, explicit use of mparheuristic:
M=fit(Species~.,iris,model="ksvm",search=list(search=mparheuristic("ksvm")))
print(M@mpar) # model hyperparameters

# SVM with C=3, sigma=2^-7
M=fit(Species~.,iris,model="ksvm",C=3,kpar=list(sigma=2^-7))
print(M@mpar)
# SVM with different kernels:
M=fit(Species~.,iris,model="ksvm",kernel="polydot",kpar="automatic") 
print(M@mpar)
# fit already has a scale argument, thus the only way to fix scale of "tanhdot"
# is to use the special search argument with the "none" method:
s=list(smethod="none",search=list(scale=2,offset=2))
M=fit(Species~.,iris,model="ksvm",kernel="tanhdot",search=s) 
print(M@mpar)
# heuristic: 10 grid search values for sigma, rbfdot kernel (fdebug is used only for more verbose):
s=list(search=mparheuristic("ksvm",10)) # advised "heuristic10" usage
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# same thing, uses older search="heuristic10"
M=fit(Species~.,iris,model="ksvm",search="heuristic10",fdebug=TRUE)
print(M@mpar)
# identical search under a different and explicit code:
s=list(search=2^seq(-15,3,2))
M=fit(Species~.,iris,model="ksvm",search=2^seq(-15,3,2),fdebug=TRUE)
print(M@mpar)

# uniform design "UD" for sigma and C, rbfdot kernel, two level of grid searches, 
# under exponential (2^x) search scale:
M=fit(Species~.,iris,model="ksvm",search="UD",fdebug=TRUE)
print(M@mpar)
M=fit(Species~.,iris,model="ksvm",search="UD1",fdebug=TRUE)
print(M@mpar)
# now the more powerful search argument is used for modeling SVM:
# grid 3 x 3 search:
s=list(smethod="grid",search=list(sigma=2^c(-15,-5,3),C=2^c(-5,0,15)),convex=0,
            metric="AUC",method=c("kfold",3,12345))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# identical search with different argument smethod="matrix" 
s$smethod="matrix"
s$search=list(sigma=rep(2^c(-15,-5,3),times=3),C=rep(2^c(-5,0,15),each=3))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# search for best kernel (only works for kpar="automatic"):
s=list(smethod="grid",search=list(kernel=c("rbfdot","laplacedot","polydot","vanilladot")),
       convex=0,metric="AUC",method=c("kfold",3,12345))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# search for best parameters of "rbfdot" or "laplacedot" (which use same kpar):
s$search=list(kernel=c("rbfdot","laplacedot"),sigma=2^seq(-15,3,5))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)

### randomForest
# search for mtry and ntree
s=list(smethod="grid",search=list(mtry=c(1,2,3),ntree=c(100,200,500)),
            convex=0,metric="AUC",method=c("kfold",3,12345))
print(s)
M=fit(Species~.,iris,model="randomForest",search=s,fdebug=TRUE)
print(M@mpar)

### rpart
# simpler way to tune cp in 0.01 to 0.9 (10 searches):
s=list(search=mparheuristic("rpart",n=10,lower=0.01,upper=0.9),method=c("kfold",3,12345))
M=fit(Species~.,iris,model="rpart",search=s,fdebug=TRUE)
print(M@mpar)

# same thing but with more lines of code
# note: this code can be adapted to tune other rpart parameters,
#       while mparheuristic only tunes cp
# a vector list needs to be used for the search$search parameter
lcp=vector("list",10) # 10 grid values for the complexity cp
names(lcp)=rep("cp",10) # same cp name 
scp=seq(0.01,0.9,length.out=10) # 10 values from 0.01 to 0.18
for(i in 1:10) lcp[[i]]=scp[i] # cycle needed due to [[]] notation
s=list(smethod="grid",search=list(control=lcp),
            convex=0,metric="AUC",method=c("kfold",3,12345))
M=fit(Species~.,iris,model="rpart",search=s,fdebug=TRUE)
print(M@mpar)

### ctree 
# simpler way to tune mincriterion in 0.1 to 0.98 (9 searches):
mint=c("kfold",3,123) # internal validation method
s=list(search=mparheuristic("ctree",n=8,lower=0.1,upper=0.99),method=mint)
M=fit(Species~.,iris,model="ctree",search=s,fdebug=TRUE)
print(M@mpar)
# same thing but with more lines of code
# note: this code can be adapted to tune other ctree parameters,
#       while mparheuristic only tunes mincriterion
# a vector list needs to be used for the search$search parameter
lmc=vector("list",9) # 9 grid values for the mincriterion
smc=seq(0.1,0.99,length.out=9)
for(i in 1:9) lmc[[i]]=party::ctree_control(mincriterion=smc[i]) 
s=list(smethod="grid",search=list(controls=lmc),method=mint,convex=0)
M=fit(Species~.,iris,model="ctree",search=s,fdebug=TRUE)
print(M@mpar)

### some MLP fitting examples:
# simplest use:
M=fit(Species~.,iris,model="mlpe")  
print(M@mpar)
# same thing, with explicit use of mparheuristic:
M=fit(Species~.,iris,model="mlpe",search=list(search=mparheuristic("mlpe")))
print(M@mpar) # hidden nodes and number of ensemble mlps
# setting some nnet parameters:
M=fit(Species~.,iris,model="mlpe",size=3,decay=0.1,maxit=100,rang=0.9) 
print(M@mpar) # mlpe hyperparameters
# MLP, 5 grid search fdebug is only used to put some verbose in the console:
s=list(search=mparheuristic("mlpe",n=5)) # 5 searches for size
print(s) # show search
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)
# previous searches used a random holdout (seed=NULL), now a fixed seed (123) is used:
s=list(smethod="grid",search=mparheuristic("mlpe",n=5),convex=0,metric="AUC",
            method=c("holdout",2/3,123))
print(s)
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)
# faster and greedy grid search:
s$convex=1;s$search=list(size=0:9)
print(s)
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)
# 2 level grid with total of 5 searches 
#  note of caution: some "2L" ranges may lead to non integer (e.g., 1.3) values at
#  the 2nd level search. And some R functions crash if non integer values are used for
#  integer parameters.
s$smethod="2L";s$convex=0;s$search=list(size=c(4,8,12))
print(s)
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)

# testing of all 17 rminer classification methods:
model=c("naive","ctree","cv.glmnet","rpart","kknn","ksvm","lssvm","mlp","mlpe",
 "randomForest","xgboost","bagging","boosting","lda","multinom","naiveBayes","qda")
inputs=ncol(iris)-1
ho=holdout(iris$Species,2/3,seed=123) # 2/3 for training and 1/3 for testing
Y=iris[ho$ts,ncol(iris)]
for(i in 1:length(model))
 {
  cat("i:",i,"model:",model[i],"\n")
  search=list(search=mparheuristic(model[i])) # rminer default values
  M=fit(Species~.,data=iris[ho$tr,],model=model[i],search=search,fdebug=TRUE)
  P=predict(M,iris[ho$ts,])
  cat("predicted ACC:",round(mmetric(Y,P,metric="ACC"),1),"\n")
 }


## End(Not run)

### example of an error (warning) generated using fit:
## Not run: 
data(iris)
# size needs to be a positive integer, thus 0.1 leads to an error:
M=fit(Species~.,iris,model="mlp",size=0.1)  
print(M@object)

## End(Not run)

### exploration of some rminer regression models:
## Not run: 
data(sa_ssin)
H=holdout(sa_ssin$y,ratio=2/3,seed=12345)
models=c("lm","mr","ctree","mars","cubist","cv.glmnet","xgboost","rvm")
for(m in models)
 { cat("model:",m,"\n") 
   M=fit(y~.,sa_ssin[H$tr,],model=m)
   P=predict(M,sa_ssin[H$ts,])
   print(mmetric(sa_ssin$y[H$ts],P,"MAE"))
 }

## End(Not run)

# testing of all 18 rminer regression methods:
## Not run: 
model=c("naive","ctree","cv.glmnet","rpart","kknn","ksvm","mlp","mlpe",
 "randomForest","xgboost","cubist","lm","mr","mars","pcr","plsr","cppls","rvm")
# note: in this example, default values are considered for the hyperparameters.
# better results can be achieved by tuning hyperparameters via improved usage
# of the search argument (via mparheuristic function or written code)
data(iris)
ir2=iris[,1:4] # predict iris "Petal.Width"
names(ir2)[ncol(ir2)]="y" # change output name
inputs=ncol(ir2)-1
ho=holdout(ir2$y,2/3,seed=123) # 2/3 for training and 1/3 for testing
Y=ir2[ho$ts,ncol(ir2)]
for(i in 1:length(model))
 {
  cat("i:",i,"model:",model[i],"\n")
  search=list(search=mparheuristic(model[i])) # rminer default values
  M=fit(y~.,data=ir2[ho$tr,],model=model[i],search=search,fdebug=TRUE)
  P=predict(M,ir2[ho$ts,])
  cat("predicted MAE:",round(mmetric(Y,P,metric="MAE"),1),"\n")
 }

## End(Not run)

### regression example with hyperparameter selection:
## Not run: 
data(sa_ssin)
# some SVM experiments:
# default SVM:
M=fit(y~.,data=sa_ssin,model="svm")
print(M@mpar)
# SVM with (Cherkassy and Ma, 2004) heuristics to set C and epsilon:
M=fit(y~.,data=sa_ssin,model="svm",C=NA,epsilon=NA)
print(M@mpar)
# SVM with Uniform Design set sigma, C and epsilon:
M=fit(y~.,data=sa_ssin,model="ksvm",search="UD",fdebug=TRUE)
print(M@mpar)

# sensitivity analysis feature selection
M=fit(y~.,data=sa_ssin,model="ksvm",search=list(search=mparheuristic("ksvm",n=5)),feature="sabs") 
print(M@mpar)
print(M@attributes) # selected attributes (1, 2 and 3 are the relevant inputs)

# example that shows how transform works:
M=fit(y~.,data=sa_ssin,model="mr") # linear regression
P=predict(M,data.frame(x1=-1000,x2=0,x3=0,x4=0,y=NA)) # P should be negative
print(P)
M=fit(y~.,data=sa_ssin,model="mr",transform="positive")
P=predict(M,data.frame(x1=-1000,x2=0,x3=0,x4=0,y=NA)) # P is not negative
print(P)

## End(Not run)

### pure classification example with a generic R (not rminer default) model ###
## Not run: 
### nnet is adopted here but virtually ANY fitting function/package could be used:

# since the default nnet prediction is to provide probabilities, there is
# a need to create this "wrapping" function:
predictprob=function(object,newdata)
{ predict(object,newdata,type="class") }
# list with a fit and predict function:
# nnet::nnet (package::function)
model=list(fit=nnet::nnet,predict=predictprob,name="nnet")
data(iris)
# note that size is not a fit parameter and it is sent directly to nnet:
M=fit(Species~.,iris,model=model,size=3,task="class") 
P=predict(M,iris)
print(P)

## End(Not run) 

### multiple models: automl and ensembles 
## Not run: 
data(iris)
d=iris
names(d)[ncol(d)]="y" # change output name
inputs=ncol(d)-1
metric="AUC"

# consult the help of mparheuristic for more automl and ensemble examples:
#
# automatic machine learining (automl) with 5 distinct models and "SE" ensemble.
# the single models are tuned with 10 internal hyperparameter searches, 
# except ksvm that uses 13 searches via "UD".
# fit performs an internal validation 
sm=mparheuristic(model="automl3",n=NA,task="prob", inputs= inputs )
method=c("kfold",3,123)
search=list(search=sm,smethod="auto",method=method,metric=metric,convex=0)
M=fit(y~.,data=d,model="auto",search=search,fdebug=TRUE)
P=predict(M,d)
# show leaderboard:
cat("> leaderboard models:",M@mpar$LB$model,"\n")
cat(">  validation values:",round(M@mpar$LB$eval,4),"\n")
cat("best model is:",M@model,"\n")
cat(metric,"=",round(mmetric(d$y,P,metric=metric),2),"\n")


# average ensemble of 5 distinct models
# the single models are tuned with 1 (heuristic) hyperparameter search 
sm2=mparheuristic(model="automl",n=NA,task="prob", inputs= inputs )
method=c("kfold",3,123)
search2=list(search=sm2,smethod="auto",method=method,metric=metric,convex=0)
M2=fit(y~.,data=d,model="AE",search=search2,fdebug=TRUE)
P2=predict(M,d)
cat("best model is:",M2@model,"\n")
cat(metric,"=",round(mmetric(d$y,P2,metric=metric),2),"\n")

# example with an invalid model exclusion: 
# in this case, randomForest produces an error and warning
# thus it is excluded from the leaderboard
sm=mparheuristic(model="automl3",n=NA,task="prob", inputs= inputs )
method=c("holdout",2/3,123)
search=list(search=sm,smethod="auto",method=method,metric=metric,convex=0)
d2=d
#
d2[,2]=as.factor(1:150) # force randomForest error
M=fit(y~.,data=d2,model="auto",search=search,fdebug=TRUE)
P=predict(M,d2)
# show leaderboard:
cat("> leaderboard models:",M@mpar$LB$model,"\n")
cat(">  validation values:",round(M@mpar$LB$eval,4),"\n")
cat("best model is:",M@model,"\n")
cat(metric,"=",round(mmetric(d$y,P,metric=metric),2),"\n")


## End(Not run)

Computes indexes for holdout data split into training and test sets.

Description

Computes indexes for holdout data split into training and test sets.

Usage

holdout(y, ratio = 2/3, internalsplit = FALSE, mode = "stratified", iter = 1, 
               seed = NULL, window=10, increment=1)
holdout(y, ratio = 2/3, internalsplit = FALSE, mode = "stratified", iter = 1, 
               seed = NULL, window=10, increment=1)

Arguments

`y`	desired target: numeric vector; or factor – then a stratified holdout is applied (i.e. the proportions of the classes are the same for each set).
`ratio`	split ratio (in percentage – sets the training set size; or in total number of examples – sets the test set size).
`internalsplit`	if `TRUE` then the training data is further split into training and validation sets. The same `ratio` parameter is used for the internal split.
`mode`	sampling mode. Options are: `stratified` – stratified randomized holdout if `y` is a factor; else it behaves as standard randomized holdout; `random` – standard randomized holdout; `order` – static mode, where the first examples are used for training and the later ones for testing (useful for time series data); `rolling` – rolling window, also known as sliding window (e.g. useful for stock market prediction), similar to `order` except that `window` is the window size, `iter` is the rolling iteration and `increment` is the number of samples slided at each iteration. In each iteration, the training set size is fixed to `window`, while the test set size is equal to `ratio` except for the last iteration (where it may be smaller). `incremental` – incremental retraining mode, also known as growing windows, similar to `order` except that `window` is the initial window size, `iter` is the incremental iteration and `increment` is the number of samples added at each iteration. In each iteration, the training set size grows (+increment), while the test set size is equal to `ratio` except for the last iteration (where it may be smaller).
`iter`	iteration of the incremental retraining mode (only used when `mode="rolling"` or `"incremental"`, typically `iter` is set within a cycle, see the example below).
`seed`	if `NULL` then no seed is used and the current R randomness is assumed; else a fixed seed is adopted to generate local random sample sequences, returning always the same result for the same seed (local means that it does not affect the state of other random number generations called after this function, see example).
`window`	training window size (if `mode="rolling"`) or initial training window size (if `mode="incremental"`).
`increment`	number of samples added to the training window at each iteration (if `mode="incremental"` or `mode="rolling"`).

Details

Computes indexes for holdout data split into training and test sets.

Value

A list with the components:

$tr – numeric vector with the training examples indexes;
$ts – numeric vector with the test examples indexes;
$itr – numeric vector with the internal training examples indexes;
$val – numeric vector with the internal validation examples indexes;

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

See fit.

Examples

### simple examples:
# preserves order, last two elements go into test set
H=holdout(1:10,ratio=2,internal=TRUE,mode="order")
print(H)
# no seed or NULL returns different splits:
H=holdout(1:10,ratio=2/3,mode="random")
print(H)
H=holdout(1:10,ratio=2/3,mode="random",seed=NULL)
print(H)
# same seed returns identical split:
H=holdout(1:10,ratio=2/3,mode="random",seed=12345)
print(H)
H=holdout(1:10,ratio=2/3,mode="random",seed=12345)
print(H)

### classification example
## Not run: 
data(iris)
# random stratified holdout
H=holdout(iris$Species,ratio=2/3,mode="stratified") 
print(table(iris[H$tr,]$Species))
print(table(iris[H$ts,]$Species))
M=fit(Species~.,iris[H$tr,],model="rpart") # training data only
P=predict(M,iris[H$ts,]) # test data
print(mmetric(iris$Species[H$ts],P,"CONF"))

## End(Not run)

### regression example with incremental and rolling window holdout:
## Not run: 
ts=c(1,4,7,2,5,8,3,6,9,4,7,10,5,8,11,6,9)
d=CasesSeries(ts,c(1,2,3))
print(d) # with 14 examples
# incremental holdout example (growing window)
for(b in 1:4) # iterations
  {
   H=holdout(d$y,ratio=4,mode="incremental",iter=b,window=5,increment=2)
   M=fit(y~.,d[H$tr,],model="mlpe",search=2)
   P=predict(M,d[H$ts,])
   cat("batch :",b,"TR from:",H$tr[1],"to:",H$tr[length(H$tr)],"size:",length(H$tr),
       "TS from:",H$ts[1],"to:",H$ts[length(H$ts)],"size:",length(H$ts),
       "mae:",mmetric(d$y[H$ts],P,"MAE"),"\n")
  }
# rolling holdout example (sliding window)
for(b in 1:4) # iterations
  {
   H=holdout(d$y,ratio=4,mode="rolling",iter=b,window=5,increment=2)
   M=fit(y~.,d[H$tr,],model="mlpe",search=2)
   P=predict(M,d[H$ts,])
   cat("batch :",b,"TR from:",H$tr[1],"to:",H$tr[length(H$tr)],"size:",length(H$tr),
       "TS from:",H$ts[1],"to:",H$ts[length(H$ts)],"size:",length(H$ts),
       "mae:",mmetric(d$y[H$ts],P,"MAE"),"\n")
  }

## End(Not run)

### local seed simple example
## Not run: 
# seed is defined, same sequence for N1 and N2:
# s2 generation sequence is not affected by the holdout call
set.seed(1); s1=sample(1:10,3)
set.seed(1);
N1=holdout(1:10,seed=123) # local seed
N2=holdout(1:10,seed=123) # local seed
print(N1$tr)
print(N2$tr)
s2=sample(1:10,3)
cat("s1:",s1,"\n") 
cat("s2:",s2,"\n") # s2 is equal to s1

## End(Not run)

### simple examples:
# preserves order, last two elements go into test set
H=holdout(1:10,ratio=2,internal=TRUE,mode="order")
print(H)
# no seed or NULL returns different splits:
H=holdout(1:10,ratio=2/3,mode="random")
print(H)
H=holdout(1:10,ratio=2/3,mode="random",seed=NULL)
print(H)
# same seed returns identical split:
H=holdout(1:10,ratio=2/3,mode="random",seed=12345)
print(H)
H=holdout(1:10,ratio=2/3,mode="random",seed=12345)
print(H)

### classification example
## Not run: 
data(iris)
# random stratified holdout
H=holdout(iris$Species,ratio=2/3,mode="stratified") 
print(table(iris[H$tr,]$Species))
print(table(iris[H$ts,]$Species))
M=fit(Species~.,iris[H$tr,],model="rpart") # training data only
P=predict(M,iris[H$ts,]) # test data
print(mmetric(iris$Species[H$ts],P,"CONF"))

## End(Not run)

### regression example with incremental and rolling window holdout:
## Not run: 
ts=c(1,4,7,2,5,8,3,6,9,4,7,10,5,8,11,6,9)
d=CasesSeries(ts,c(1,2,3))
print(d) # with 14 examples
# incremental holdout example (growing window)
for(b in 1:4) # iterations
  {
   H=holdout(d$y,ratio=4,mode="incremental",iter=b,window=5,increment=2)
   M=fit(y~.,d[H$tr,],model="mlpe",search=2)
   P=predict(M,d[H$ts,])
   cat("batch :",b,"TR from:",H$tr[1],"to:",H$tr[length(H$tr)],"size:",length(H$tr),
       "TS from:",H$ts[1],"to:",H$ts[length(H$ts)],"size:",length(H$ts),
       "mae:",mmetric(d$y[H$ts],P,"MAE"),"\n")
  }
# rolling holdout example (sliding window)
for(b in 1:4) # iterations
  {
   H=holdout(d$y,ratio=4,mode="rolling",iter=b,window=5,increment=2)
   M=fit(y~.,d[H$tr,],model="mlpe",search=2)
   P=predict(M,d[H$ts,])
   cat("batch :",b,"TR from:",H$tr[1],"to:",H$tr[length(H$tr)],"size:",length(H$tr),
       "TS from:",H$ts[1],"to:",H$ts[length(H$ts)],"size:",length(H$ts),
       "mae:",mmetric(d$y[H$ts],P,"MAE"),"\n")
  }

## End(Not run)

### local seed simple example
## Not run: 
# seed is defined, same sequence for N1 and N2:
# s2 generation sequence is not affected by the holdout call
set.seed(1); s1=sample(1:10,3)
set.seed(1);
N1=holdout(1:10,seed=123) # local seed
N2=holdout(1:10,seed=123) # local seed
print(N1$tr)
print(N2$tr)
s2=sample(1:10,3)
cat("s1:",s1,"\n") 
cat("s2:",s2,"\n") # s2 is equal to s1

## End(Not run)

Measure input importance (including sensitivity analysis) given a supervised data mining model.

Description

Measure input importance (including sensitivity analysis) given a supervised data mining model.

Usage

Importance(M, data, RealL = 7, method = "1D-SA", measure = "AAD", 
           sampling = "regular", baseline = "mean", responses = TRUE, 
           outindex = NULL, task = "default", PRED = NULL, 
           interactions = NULL, Aggregation = -1, LRandom = -1,
           MRandom = "discrete", Lfactor = FALSE)
Importance(M, data, RealL = 7, method = "1D-SA", measure = "AAD", 
           sampling = "regular", baseline = "mean", responses = TRUE, 
           outindex = NULL, task = "default", PRED = NULL, 
           interactions = NULL, Aggregation = -1, LRandom = -1,
           MRandom = "discrete", Lfactor = FALSE)

Arguments

`M`	fitted model, typically is the object returned by `fit`. Can also be any fitted model (i.e. not from rminer), provided that the predict function PRED is defined (see examples for details).
`data`	training data (the same data.frame that was used to fit the model, currently only used to add data histogram to VEC curve).
`RealL`	the number of sensitivity analysis levels (e.g. 7). Note: you need to use `RealL`>=2.
`method`	input importance method. Options are: 1D-SA – 1 dimensional sensitivity analysis, very fast, sets interactions to NULL. sens or SA – sensitivity analysis. There are some extra variants: sensa – equal to `sens` but also sets `measure="AAD"`; sensv – sets `measure="variance"`; sensg – sets `measure="gradient"`; sensr – sets `measure="range"`. if interactions is not null, then GSA is assumed, else 1D-SA is assumed. DSA – Data-based SA (good option if input interactions need to be detected). MSA – Monte-Carlo SA. CSA – Cluster-based SA. GSA – Global SA (very slow method, particularly if the number of inputs is large, should be avoided). randomForest – uses method of Leo Breiman (type=1), only makes sense when M is a randomRorest.
`measure`	sensitivity analysis measure (used to measure input importance). Options are: AAD – average absolute deviation from the median. gradient – average absolute gradient (y_i+1-y_i) of the responses. variance – variance of the responses. range – maximum - minimum of the responses.
`sampling`	for numeric inputs, the sampling scan function. Options are: regular – regular sequence (uniform distribution), do not change this value, kept here only due to compatibility issues.
`baseline`	baseline vector used during the sensitivity analysis. Options are: mean – uses a vector with the mean values of each attribute from `data`. median – uses a vector with the median values of each attribute from `data`. a data.frame with the baseline example (should have the same attribute names as `data`).
`responses`	if `TRUE` then all sensitivity analysis responses are stored and returned.
`outindex`	the output index (column) of `data` if `M` is not a model object (returned by fit).
`task`	the `task` as defined in `fit` if `M` is not a model object (returned by fit).
`PRED`	the prediction function of `M`, if `M` is not a model object (returned by fit). Note: this function should behave like the rminer `predict-methods`, i.e. return a numeric vector in case of regression; a matrix of examples (rows) vs probabilities (columns) (`task="prob"`) or a factor (`task="class"`) in case of classification.
`interactions`	numeric vector with the attributes (columns) used by Ith-D sensitivity analysis (2-D or higher, "GSA" method): if `NULL` then only a 1-D sensitivity analysis is performed. if `length(interactions)==1` then a "special" 2-D sensitivity analysis is performed using the index of interactions versus all remaining inputs. Note: the $sresponses[[interactions]] will be empty (in `vecplot` do not use `xval` `=interactions`). if `length(interactions)>1` then a full Ith-D sensitivity analysis is performed, where I=length(interactions). Note: Computational effort can highly increase if I is too large, i.e. O(RealL^I). Also, you need to preprocess the returned list (e.g. using `avg_imp`) to use the `vecplot` function (see the examples).
`Aggregation`	numeric value that sets the number of multi-metric aggregation function (used only for "DSA", ""). Options are: -1 – the default value that should work in most cases (if regression, sets Aggregation=3, else if classification then sets Aggregation=1). 1 – value that should work for classification (only use the average of all sensitivity values). 3 – value that should work for regression (use 3 metrics, the minimum, average and maximum of all sensitivity values).
`LRandom`	number of samples used by DSA and MSA methods. The default value is -1, which means: use a number equal to training set size. If a different value is used (1<= value <= number of training samples), then LRandom samples are randomly selected.
`MRandom`	sampling type used by MSA: "discrete" (default discrete uniform distribution) or "continuous" (from continuous uniform distribution).
`Lfactor`	sets the maximum number of sensitivity levels for discrete inputs. if FALSE then a maximum of up to RealL levels are used (most frequent ones), else (TRUE) then all levels of the input are used in the SA analysis.

Details

This function provides several algorithms for measuring input importance of supervised data mining models and the average effect of a given input (or pair of inputs) in the model. A particular emphasis is given on sensitivity analysis (SA), which is a simple method that measures the effects on the output of a given model when the inputs are varied through their range of values. Check the references for more details.

Value

A list with the components:

$value – numeric vector with the computed sensitivity analysis measure for each attribute.
$imp – numeric vector with the relative importance for each attribute (only makes sense for 1-D analysis).
$sresponses – vector list as described in the Value documentation of mining.
$data – if DSA or MSA, store the used data samples, needed for visualizations made by vecplot.
$method – SA method
$measure – SA measure
$agg – Aggregation value
$nclasses – if task="prob" or "class", the number of output classes, else nclasses=1
$inputs – indexes of the input attributes
$Llevels – sensitivity levels used for each attribute (NA means output attribute)
$interactions – which attributes were interacted when method=GSA.

Note

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

To cite the Importance function, sensitivity analysis methods or synthetic datasets, please use:
P. Cortez and M.J. Embrechts.
Using Sensitivity Analysis and Visualization Techniques to Open Black Box Data Mining Models.
In Information Sciences, Elsevier, 225:1-17, March 2013.
doi:10.1016/j.ins.2012.10.039

Examples

 
### dontrun is used when the execution of the example requires some computational effort.

### 1st example, regression, 1-D sensitivity analysis
## Not run: 
data(sa_ssin) # x1 should account for 55
M=fit(y~.,sa_ssin,model="ksvm")
I=Importance(M,sa_ssin,method="1D-SA") # 1-D SA, AAD
print(round(I$imp,digits=2))

L=list(runs=1,sen=t(I$imp),sresponses=I$sresponses)
mgraph(L,graph="IMP",leg=names(sa_ssin),col="gray",Grid=10)
mgraph(L,graph="VEC",xval=1,Grid=10,data=sa_ssin,
   main="VEC curve for x1 influence on y") # or:
vecplot(I,xval=1,Grid=10,data=sa_ssin,datacol="gray",
   main="VEC curve for x1 influence on y") # same graph
vecplot(I,xval=c(1,2,3),pch=c(1,2,3),Grid=10,
leg=list(pos="bottomright",leg=c("x1","x2","x3"))) # all x1, x2 and x3 VEC curves

## End(Not run)

### 2nd example, regression, DSA sensitivity analysis:
## Not run: 
I2=Importance(M,sa_ssin,method="DSA")
print(I2)
# influence of x1 and x2 over y
vecplot(I2,graph="VEC",xval=1) # VEC curve
vecplot(I2,graph="VECB",xval=1) # VEC curve with boxplots
vecplot(I2,graph="VEC3",xval=c(1,2)) # VEC surface
vecplot(I2,graph="VECC",xval=c(1,2)) # VEC contour

## End(Not run)

### 3th example, classification (pure class labels, task="cla"), DSA:
## Not run: 
data(sa_int2_3c) # pair (x1,x2) is more relevant than x3, all x1,x2,x3 affect y, 
                 # x4 has a null effect.
M2=fit(y~.,sa_int2_3c,model="mlpe",task="class")
I4=Importance(M2,sa_int2_3c,method="DSA")
# VEC curve (should present a kind of "saw" shape curve) for class B (TC=2):
vecplot(I4,graph="VEC",xval=2,cex=1.2,TC=2,
 main="VEC curve for x2 influence on y (class B)",xlab="x2")
# same VEC curve but with boxplots:
vecplot(I4,graph="VECB",xval=2,cex=1.2,TC=2,
 main="VEC curve with box plots for x2 influence on y (class B)",xlab="x2")

## End(Not run)

### 4th example, regression, DSA and GSA:
## Not run: 
data(sa_psin)
# same model from Table 1 of the reference:
M3=fit(y~.,sa_psin,model="ksvm",search=2^-2,C=2^6.87,epsilon=2^-8)
# in this case: Aggregation should be -1 (default), 1 (class) or 3 (reg), see ref. paper.
I5=Importance(M3,sa_psin,method="DSA",Aggregation=3)
print("Input importances:")
print(round(I5$imp,digits=2)) # INS 2013 similar results

# 2D analysis (check reference for more details), RealL=L=7:
# need to aggregate results into a matrix of SA measure by using the agg_matrix_imp function.
# important notes: 
# - agg_matrix_imp only works for the methods "DSA", "MSA" and "GSA".
# - reliable agg_matrix_imp results for "DSA" or "MSA" only for a 
#   a large LRandom value (e.g., LRandom=1000) or when LRandom=-1 (all training samples)
cm=agg_matrix_imp(I5)
print("show Table 8 DSA results (from the reference):")
print(round(cm$m1,digits=2))
print(round(cm$m2,digits=2))
# internal rminer function:
# show most relevant (darker) input pairs, in this case (x1,x2) > (x1,x3) > (x2,x3)
# to build a nice plot, a fixed threshold=c(0.05,0.05) is used. note that
# in the paper and for real data, we use threshold=0.1, 
# which means threshold=rep(max(cm$m1,cm$m2)*threshold,2)
fcm=cmatrixplot(cm,threshold=c(0.05,0.05)) 
# 2D analysis using pair AT=c(x1,x2') (check reference for more details), RealL=7:
# nice 3D VEC surface plot:
vecplot(I5,xval=c(1,2),graph="VEC3",xlab="x1",ylab="x2",zoom=1.1,
 main="VEC surface of (x1,x2') influence on y")
# same influence but know shown using VEC contour:
par(mar=c(4.0,4.0,1.0,0.3)) # change the graph window space size
vecplot(I5,xval=c(1,2),graph="VECC",xlab="x1",ylab="x2",
 main="VEC surface of (x1,x2') influence on y")
# slower GSA:
I6=Importance(M3,sa_psin,method="GSA",interactions=1:4)
print("Input importances:")
print(round(I6$imp,digits=2)) # INS 2013 similar results
cm2=agg_matrix_imp(I6)
# compare cm2 with cm1, almost identical:
print(round(cm2$m1,digits=2))
print(round(cm2$m2,digits=2))
fcm2=cmatrixplot(cm2,threshold=0.1) 

## End(Not run)

### 5th example, classification, 1D_SA, DSA, MSA and GSA:
## Not run: 
data(sa_ssin_n2p)
# same model from Table 1 of the reference:
M4=fit(y~.,sa_ssin_n2p,model="ksvm",kpar=list(sigma=2^-8.25),C=2^10)

I7=Importance(M4,sa_ssin_n2p,method="1D-SA")
print("1D-SA Input importances:")
print(round(I7$imp,digits=2)) # INS 2013 similar results (Table 6)

I8=Importance(M4,sa_ssin_n2p,method="GSA",interactions=1:4)
print("GSA Input importances:")
print(round(I8$imp,digits=2)) # INS 2013 similar results (Table 6)

I9=Importance(M4,sa_ssin_n2p,method="DSA",LRandom=1000)
print("DSA Ns=1000 Input importances:")
print(round(I9$imp,digits=2)) # INS 2013 similar results (Table 6)

I10=Importance(M4,sa_ssin_n2p,method="DSA",LRandom=10)
print("DSA Ns=10 Input importances:")
print(round(I10$imp,digits=2)) # INS 2013 similar results (Table 6)

I11=Importance(M4,sa_ssin_n2p,method="MSA",LRandom=10)
print("MSA Ns=10 Input importances:")
print(round(I11$imp,digits=2)) # INS 2013 similar results (Table 6)

# 2D analysis:
cm3=agg_matrix_imp(I8)
fcm3=cmatrixplot(cm3,threshold=c(0.05,0.05)) 
cm4=agg_matrix_imp(I9)
fcm4=cmatrixplot(cm4,threshold=c(0.05,0.05)) 

## End(Not run)

### If you want to use Importance over your own model (different than rminer ones):
# 1st example, regression, uses the theoretical sin1reg function: x1=70% and x2=30%
data(sin1reg)
mypred=function(M,data)
{ return (M[1]*sin(pi*data[,1]/M[3])+M[2]*sin(pi*data[,2]/M[3])) }
M=c(0.7,0.3,2000)
# 4 is the column index of y
I=Importance(M,sin1reg,method="sens",measure="AAD",PRED=mypred,outindex=4) 
print(I$imp) # x1=72.3% and x2=27.7%
L=list(runs=1,sen=t(I$imp),sresponses=I$sresponses)
mgraph(L,graph="IMP",leg=names(sin1reg),col="gray",Grid=10)
mgraph(L,graph="VEC",xval=1,Grid=10) # equal to:
par(mar=c(2.0,2.0,1.0,0.3)) # change the graph window space size
vecplot(I,graph="VEC",xval=1,Grid=10,main="VEC curve for x1 influence on y:")

### 2nd example, 3-class classification for iris and lda model:
## Not run: 
data(iris)
library(MASS)
predlda=function(M,data) # the PRED function
{ return (predict(M,data)$posterior) }
LDA=lda(Species ~ .,iris, prior = c(1,1,1)/3)
# 4 is the column index of Species
I=Importance(LDA,iris,method="1D-SA",PRED=predlda,outindex=4)
vecplot(I,graph="VEC",xval=1,Grid=10,TC=1,
main="1-D VEC for Sepal.Lenght (x-axis) influence in setosa (prob.)")

## End(Not run)

### 3rd example, binary classification for setosa iris and lda model:
## Not run: 
data(iris)
library(MASS)
iris2=iris;iris2$Species=factor(iris$Species=="setosa")
predlda2=function(M,data) # the PRED function
{ return (predict(M,data)$class) }
LDA2=lda(Species ~ .,iris2)
I=Importance(LDA2,iris2,method="1D-SA",PRED=predlda2,outindex=4)
vecplot(I,graph="VEC",xval=1,
main="1-D VEC for Sepal.Lenght (x-axis) influence in setosa (class)",Grid=10)

## End(Not run)

### Example with discrete inputs 
## Not run: 
data(iris)
ir1=iris
ir1[,1]=cut(ir1[,1],breaks=4)
ir1[,2]=cut(ir1[,2],breaks=4)
M=fit(Species~.,ir1,model="mlpe")
I=Importance(M,ir1,method="DSA")
# discrete example:
vecplot(I,graph="VEC",xval=1,TC=1,main="class: setosa (discrete x1)",data=ir1)
# continuous example:
vecplot(I,graph="VEC",xval=3,TC=1,main="class: setosa (cont. x1)",data=ir1)

## End(Not run)

### dontrun is used when the execution of the example requires some computational effort.

### 1st example, regression, 1-D sensitivity analysis
## Not run: 
data(sa_ssin) # x1 should account for 55
M=fit(y~.,sa_ssin,model="ksvm")
I=Importance(M,sa_ssin,method="1D-SA") # 1-D SA, AAD
print(round(I$imp,digits=2))

L=list(runs=1,sen=t(I$imp),sresponses=I$sresponses)
mgraph(L,graph="IMP",leg=names(sa_ssin),col="gray",Grid=10)
mgraph(L,graph="VEC",xval=1,Grid=10,data=sa_ssin,
   main="VEC curve for x1 influence on y") # or:
vecplot(I,xval=1,Grid=10,data=sa_ssin,datacol="gray",
   main="VEC curve for x1 influence on y") # same graph
vecplot(I,xval=c(1,2,3),pch=c(1,2,3),Grid=10,
leg=list(pos="bottomright",leg=c("x1","x2","x3"))) # all x1, x2 and x3 VEC curves

## End(Not run)

### 2nd example, regression, DSA sensitivity analysis:
## Not run: 
I2=Importance(M,sa_ssin,method="DSA")
print(I2)
# influence of x1 and x2 over y
vecplot(I2,graph="VEC",xval=1) # VEC curve
vecplot(I2,graph="VECB",xval=1) # VEC curve with boxplots
vecplot(I2,graph="VEC3",xval=c(1,2)) # VEC surface
vecplot(I2,graph="VECC",xval=c(1,2)) # VEC contour

## End(Not run)

### 3th example, classification (pure class labels, task="cla"), DSA:
## Not run: 
data(sa_int2_3c) # pair (x1,x2) is more relevant than x3, all x1,x2,x3 affect y, 
                 # x4 has a null effect.
M2=fit(y~.,sa_int2_3c,model="mlpe",task="class")
I4=Importance(M2,sa_int2_3c,method="DSA")
# VEC curve (should present a kind of "saw" shape curve) for class B (TC=2):
vecplot(I4,graph="VEC",xval=2,cex=1.2,TC=2,
 main="VEC curve for x2 influence on y (class B)",xlab="x2")
# same VEC curve but with boxplots:
vecplot(I4,graph="VECB",xval=2,cex=1.2,TC=2,
 main="VEC curve with box plots for x2 influence on y (class B)",xlab="x2")

## End(Not run)

### 4th example, regression, DSA and GSA:
## Not run: 
data(sa_psin)
# same model from Table 1 of the reference:
M3=fit(y~.,sa_psin,model="ksvm",search=2^-2,C=2^6.87,epsilon=2^-8)
# in this case: Aggregation should be -1 (default), 1 (class) or 3 (reg), see ref. paper.
I5=Importance(M3,sa_psin,method="DSA",Aggregation=3)
print("Input importances:")
print(round(I5$imp,digits=2)) # INS 2013 similar results

# 2D analysis (check reference for more details), RealL=L=7:
# need to aggregate results into a matrix of SA measure by using the agg_matrix_imp function.
# important notes: 
# - agg_matrix_imp only works for the methods "DSA", "MSA" and "GSA".
# - reliable agg_matrix_imp results for "DSA" or "MSA" only for a 
#   a large LRandom value (e.g., LRandom=1000) or when LRandom=-1 (all training samples)
cm=agg_matrix_imp(I5)
print("show Table 8 DSA results (from the reference):")
print(round(cm$m1,digits=2))
print(round(cm$m2,digits=2))
# internal rminer function:
# show most relevant (darker) input pairs, in this case (x1,x2) > (x1,x3) > (x2,x3)
# to build a nice plot, a fixed threshold=c(0.05,0.05) is used. note that
# in the paper and for real data, we use threshold=0.1, 
# which means threshold=rep(max(cm$m1,cm$m2)*threshold,2)
fcm=cmatrixplot(cm,threshold=c(0.05,0.05)) 
# 2D analysis using pair AT=c(x1,x2') (check reference for more details), RealL=7:
# nice 3D VEC surface plot:
vecplot(I5,xval=c(1,2),graph="VEC3",xlab="x1",ylab="x2",zoom=1.1,
 main="VEC surface of (x1,x2') influence on y")
# same influence but know shown using VEC contour:
par(mar=c(4.0,4.0,1.0,0.3)) # change the graph window space size
vecplot(I5,xval=c(1,2),graph="VECC",xlab="x1",ylab="x2",
 main="VEC surface of (x1,x2') influence on y")
# slower GSA:
I6=Importance(M3,sa_psin,method="GSA",interactions=1:4)
print("Input importances:")
print(round(I6$imp,digits=2)) # INS 2013 similar results
cm2=agg_matrix_imp(I6)
# compare cm2 with cm1, almost identical:
print(round(cm2$m1,digits=2))
print(round(cm2$m2,digits=2))
fcm2=cmatrixplot(cm2,threshold=0.1) 

## End(Not run)

### 5th example, classification, 1D_SA, DSA, MSA and GSA:
## Not run: 
data(sa_ssin_n2p)
# same model from Table 1 of the reference:
M4=fit(y~.,sa_ssin_n2p,model="ksvm",kpar=list(sigma=2^-8.25),C=2^10)

I7=Importance(M4,sa_ssin_n2p,method="1D-SA")
print("1D-SA Input importances:")
print(round(I7$imp,digits=2)) # INS 2013 similar results (Table 6)

I8=Importance(M4,sa_ssin_n2p,method="GSA",interactions=1:4)
print("GSA Input importances:")
print(round(I8$imp,digits=2)) # INS 2013 similar results (Table 6)

I9=Importance(M4,sa_ssin_n2p,method="DSA",LRandom=1000)
print("DSA Ns=1000 Input importances:")
print(round(I9$imp,digits=2)) # INS 2013 similar results (Table 6)

I10=Importance(M4,sa_ssin_n2p,method="DSA",LRandom=10)
print("DSA Ns=10 Input importances:")
print(round(I10$imp,digits=2)) # INS 2013 similar results (Table 6)

I11=Importance(M4,sa_ssin_n2p,method="MSA",LRandom=10)
print("MSA Ns=10 Input importances:")
print(round(I11$imp,digits=2)) # INS 2013 similar results (Table 6)

# 2D analysis:
cm3=agg_matrix_imp(I8)
fcm3=cmatrixplot(cm3,threshold=c(0.05,0.05)) 
cm4=agg_matrix_imp(I9)
fcm4=cmatrixplot(cm4,threshold=c(0.05,0.05)) 

## End(Not run)

### If you want to use Importance over your own model (different than rminer ones):
# 1st example, regression, uses the theoretical sin1reg function: x1=70% and x2=30%
data(sin1reg)
mypred=function(M,data)
{ return (M[1]*sin(pi*data[,1]/M[3])+M[2]*sin(pi*data[,2]/M[3])) }
M=c(0.7,0.3,2000)
# 4 is the column index of y
I=Importance(M,sin1reg,method="sens",measure="AAD",PRED=mypred,outindex=4) 
print(I$imp) # x1=72.3% and x2=27.7%
L=list(runs=1,sen=t(I$imp),sresponses=I$sresponses)
mgraph(L,graph="IMP",leg=names(sin1reg),col="gray",Grid=10)
mgraph(L,graph="VEC",xval=1,Grid=10) # equal to:
par(mar=c(2.0,2.0,1.0,0.3)) # change the graph window space size
vecplot(I,graph="VEC",xval=1,Grid=10,main="VEC curve for x1 influence on y:")

### 2nd example, 3-class classification for iris and lda model:
## Not run: 
data(iris)
library(MASS)
predlda=function(M,data) # the PRED function
{ return (predict(M,data)$posterior) }
LDA=lda(Species ~ .,iris, prior = c(1,1,1)/3)
# 4 is the column index of Species
I=Importance(LDA,iris,method="1D-SA",PRED=predlda,outindex=4)
vecplot(I,graph="VEC",xval=1,Grid=10,TC=1,
main="1-D VEC for Sepal.Lenght (x-axis) influence in setosa (prob.)")

## End(Not run)

### 3rd example, binary classification for setosa iris and lda model:
## Not run: 
data(iris)
library(MASS)
iris2=iris;iris2$Species=factor(iris$Species=="setosa")
predlda2=function(M,data) # the PRED function
{ return (predict(M,data)$class) }
LDA2=lda(Species ~ .,iris2)
I=Importance(LDA2,iris2,method="1D-SA",PRED=predlda2,outindex=4)
vecplot(I,graph="VEC",xval=1,
main="1-D VEC for Sepal.Lenght (x-axis) influence in setosa (class)",Grid=10)

## End(Not run)

### Example with discrete inputs 
## Not run: 
data(iris)
ir1=iris
ir1[,1]=cut(ir1[,1],breaks=4)
ir1[,2]=cut(ir1[,2],breaks=4)
M=fit(Species~.,ir1,model="mlpe")
I=Importance(M,ir1,method="DSA")
# discrete example:
vecplot(I,graph="VEC",xval=1,TC=1,main="class: setosa (discrete x1)",data=ir1)
# continuous example:
vecplot(I,graph="VEC",xval=3,TC=1,main="class: setosa (cont. x1)",data=ir1)

## End(Not run)

Missing data imputation (e.g. substitution by value or hotdeck method).

Description

Missing data imputation (e.g. substitution by value or hotdeck method).

Usage

imputation(imethod = "value", D, Attribute = NULL, Missing = NA, Value = 1)
imputation(imethod = "value", D, Attribute = NULL, Missing = NA, Value = 1)

Arguments

`imethod`	imputation method type: `value` – substitutes missing data by `Value` (with single element or several elements); `hotdeck` – searches first the most similar example (i.e. using a k-nearest neighbor method – `knn`) in the dataset and replaces the missing data by the value found in such example;
`D`	dataset with missing data (data.frame)
`Attribute`	if `NULL` then all attributes (data columns) with missing data are replaced. Else, `Attribute` is the attribute number (numeric) or name (character).
`Missing`	missing data symbol
`Value`	the substitution value (if `imethod=value`) or number of neighbors (k of `knn`).

Details

Check the references.

Value

A data.frame without missing data.

Note

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

M. Brown and J. Kros.
Data mining and the impact of missing data.
In Industrial Management & Data Systems, 103(8):611-621, 2003.
This tutorial shows additional code examples:
P. Cortez.
A tutorial on using the rminer R package for data mining tasks.
Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015.
http://hdl.handle.net/1822/36210

Examples

d=matrix(ncol=5,nrow=5)
d[1,]=c(5,4,3,2,1)
d[2,]=c(4,3,4,3,4)
d[3,]=c(1,1,1,1,1)
d[4,]=c(4,NA,3,4,4)
d[5,]=c(5,NA,NA,2,1)
d=data.frame(d); d[,3]=factor(d[,3])
print(d)
print(imputation("value",d,3,Value="3"))
print(imputation("value",d,2,Value=median(na.omit(d[,2]))))
print(imputation("value",d,2,Value=c(1,2)))
print(imputation("hotdeck",d,"X2",Value=1))
print(imputation("hotdeck",d,Value=1))

## Not run: 
# hotdeck 1-nearest neighbor substitution on a real dataset:
require(kknn)
d=read.table(
   file="http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
   sep=",",na.strings="?",stringsAsFactors=TRUE)
print(summary(d))
d2=imputation("hotdeck",d,Value=1)
print(summary(d2))
par(mfrow=c(2,1))
hist(d$V26)
hist(d2$V26)
par(mfrow=c(1,1)) # reset mfrow

## End(Not run)

d=matrix(ncol=5,nrow=5)
d[1,]=c(5,4,3,2,1)
d[2,]=c(4,3,4,3,4)
d[3,]=c(1,1,1,1,1)
d[4,]=c(4,NA,3,4,4)
d[5,]=c(5,NA,NA,2,1)
d=data.frame(d); d[,3]=factor(d[,3])
print(d)
print(imputation("value",d,3,Value="3"))
print(imputation("value",d,2,Value=median(na.omit(d[,2]))))
print(imputation("value",d,2,Value=c(1,2)))
print(imputation("hotdeck",d,"X2",Value=1))
print(imputation("hotdeck",d,Value=1))

## Not run: 
# hotdeck 1-nearest neighbor substitution on a real dataset:
require(kknn)
d=read.table(
   file="http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
   sep=",",na.strings="?",stringsAsFactors=TRUE)
print(summary(d))
d2=imputation("hotdeck",d,Value=1)
print(summary(d2))
par(mfrow=c(2,1))
hist(d$V26)
hist(d2$V26)
par(mfrow=c(1,1)) # reset mfrow

## End(Not run)

Compute long term forecasts.

Description

Performs multi-step forecasts by iteratively using 1-ahead predictions as inputs

Usage

lforecast(M, data, start, horizon)
lforecast(M, data, start, horizon)

Arguments

`M`	fitted model, the object returned by `fit`.
`data`	training data, typically built using `CasesSeries`.
`start`	starting period (when out-of-samples start).
`horizon`	number of multi-step predictions.

Details

Check the reference for details.

Value

Returns a numeric vector with the multi-step predictions.

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

This tutorial shows additional code examples:
P. Cortez.
A tutorial on using the rminer R package for data mining tasks.
Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015.
http://hdl.handle.net/1822/36210
To check for more details:
P. Cortez.
Sensitivity Analysis for Time Lag Selection to Forecast Seasonal Time Series using Neural Networks and Support Vector Machines.
In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2010), pp. 3694-3701, Barcelona, Spain, July, 2010. IEEE Computer Society, ISBN: 978-1-4244-6917-8 (DVD edition).
doi:10.1109/IJCNN.2010.5596890

Examples

ts=c(1,4,7,2,5,8,3,6,9,4,7,10,5,8,11,6,9)
d=CasesSeries(ts,c(1,2,3))
M=fit(y~.,d[1:7,],model="mlpe",search=2)
P1=predict(M,d[8:14,]) # single-step predictions
P2=lforecast(M,d,8,7) # multi-step predictions, horizon=7
print(mmetric(d$y[8:14],P1,"MAE"))
print(mmetric(d$y[8:14],P2,"MAE"))
L=vector("list",2); pred=vector("list",1);test=vector("list",1)
pred[[1]]=P1; test[[1]]=d$y[8:14]; L[[1]]=list(pred=pred,test=test,runs=1)
pred[[1]]=P2; test[[1]]=d$y[8:14]; L[[2]]=list(pred=pred,test=test,runs=1)
mgraph(L,graph="REG",Grid=10,leg=c("y","P1","P2"),col=c("black","cyan","blue"))
mgraph(L,graph="RSC",Grid=10,leg=c("P1","P2"),col=c("cyan","blue"))
ts=c(1,4,7,2,5,8,3,6,9,4,7,10,5,8,11,6,9)
d=CasesSeries(ts,c(1,2,3))
M=fit(y~.,d[1:7,],model="mlpe",search=2)
P1=predict(M,d[8:14,]) # single-step predictions
P2=lforecast(M,d,8,7) # multi-step predictions, horizon=7
print(mmetric(d$y[8:14],P1,"MAE"))
print(mmetric(d$y[8:14],P2,"MAE"))
L=vector("list",2); pred=vector("list",1);test=vector("list",1)
pred[[1]]=P1; test[[1]]=d$y[8:14]; L[[1]]=list(pred=pred,test=test,runs=1)
pred[[1]]=P2; test[[1]]=d$y[8:14]; L[[2]]=list(pred=pred,test=test,runs=1)
mgraph(L,graph="REG",Grid=10,leg=c("y","P1","P2"),col=c("black","cyan","blue"))
mgraph(L,graph="RSC",Grid=10,leg=c("P1","P2"),col=c("cyan","blue"))

Mining graph function

Description

Plots a graph given a mining list, list of several mining lists or given the pair y - target and x - predictions.

Usage

mgraph(y, x = NULL, graph, leg = NULL, xval = -1, PDF = "", PTS = -1, 
       size = c(5, 5), sort = TRUE, ranges = NULL, data = NULL,
       digits = NULL, TC = -1, intbar = TRUE, lty = 1, col = "black",
       main = "", metric = "MAE", baseline = FALSE, Grid = 0, 
       axis = NULL, cex = 1)
mgraph(y, x = NULL, graph, leg = NULL, xval = -1, PDF = "", PTS = -1, 
       size = c(5, 5), sort = TRUE, ranges = NULL, data = NULL,
       digits = NULL, TC = -1, intbar = TRUE, lty = 1, col = "black",
       main = "", metric = "MAE", baseline = FALSE, Grid = 0, 
       axis = NULL, cex = 1)

Arguments

`y`	if there are predictions (`!is.null(x)`), `y` should be a numeric vector or factor with the target desired responses (or output values). Else, `y` should be a list returned by the `mining` function or a vector list with several mining lists.
`x`	the predictions (should be a numeric vector if `task="reg"`, matrix if `task="prob"` or factor if `task="class"` (use if `y` is not a list).
`graph`	type of graph. Options are: `ROC` – ROC curve (classification); `LIFT` – LIFT accumulative curve (classification); `IMP` – relative input importance barplot; `REC` – REC curve (regression); `VEC` – variable effect curve; `RSC` – regression scatter plot; `REP` – regression error plot; `REG` – regression plot; `DLC` – distance line comparison (for comparing errors in one line);
`leg`	legend of graph: if `NULL` – not used; if -1 and `graph="ROC" or "LIFT"` – the target class name is used; if -1 and `graph="REG"` – `leg=c("Target","Predictions")`; if -1 and `graph="RSC"` – `leg=c("Predictions")`; if vector with "character" type (text) – the text of the legend; if is list – `$leg` = vector with the text of the legend and `$pos` is the position of the legend (e.g. "top" or c(4,5));
`xval`	auxiliary value, used by some graphs: `VEC` – if -1 means perform several 1-D sensitivity analysis VEC curves, one for each attribute, if >0 means the attribute index (e.g. 1). `ROC` or `LIFT` or `REC` – if -1 then `xval=1`. For these graphs, `xval` is the maximum x-axis value. `IMP` – `xval` is the x-axis value for the legend of the attributes. `REG` – `xval` is the set of plotted examples (e.g. 1:5), if -1 then all examples are used. `DLC` – `xval` is the `val` of the `mmetric` function.
`PDF`	if `""` then the graph is plotted on the screen, else the graph is saved into a pdf file with the name set in this argument.
`PTS`	number of points in each line plot. If -1 then `PTS=11` (for `ROC`, `REC` or `LIFT`) or `PTS=6` (`VEC`).
`size`	size of the graph, c(width,height), in inches.
`sort`	if TRUE then sorts the data (works only for some graphs, e.g. `VEC`, `IMP`, `REP`).
`ranges`	matrix with the attribute minimum and maximum ranges (only used by `VEC`).
`data`	the training data, for plotting histograms and getting the minimum and maximum attribute ranges if not defined in ranges (only used by `VEC`).
`digits`	the number of digits for the axis, can also be defined as c(x-axis digits,y-axis digits) (only used by `VEC`).
`TC`	target class (for multi-class classification class) from 1 to Nc, where Nc is the number of classes. If multi-class and TC==-1 then TC is set to the index of the last class.
`intbar`	if 95% confidence interval bars (according to t-student distribution) should be plotted as whiskers.
`lty`	the same `lty` argument of the `par` function.
`col`	color, as defined in the `par` function.
`main`	the title of the graph, as defined in the `plot` function.
`metric`	the error metric, as defined in `mmetric` (used by `DLC`).
`baseline`	if the baseline should be plotted (used by `ROC` and `LIFT`).
`Grid`	if >1 then there are GRID light gray squared grid lines in the plot.
`axis`	Currently only used by `IMP`: numeric vector with the axis numbers (1 – bottom, 3 – top). If `NULL` then `axis=c(1,3)`.
`cex`	label font size

Details

Plots a graph given a mining list, list of several mining lists or given the pair y - target and x - predictions.

Value

A graph (in screen or pdf file).

Note

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

To check for more details about rminer and for citation purposes:
P. Cortez.
Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool.
In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects 10th Industrial Conference on Data Mining (ICDM 2010), Lecture Notes in Artificial Intelligence 6171, pp. 572-583, Berlin, Germany, July, 2010. Springer. ISBN: 978-3-642-14399-1.
@Springer: https://link.springer.com/chapter/10.1007/978-3-642-14400-4_44
http://www3.dsi.uminho.pt/pcortez/2010-rminer.pdf
This tutorial shows additional code examples:
P. Cortez.
A tutorial on using the rminer R package for data mining tasks.
Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015.
http://hdl.handle.net/1822/36210

Examples

### regression
y=c(1,5,10,11,7,3,2,1);x=rnorm(length(y),0,1.0)+y
mgraph(y,x,graph="RSC",Grid=10,col=c("blue"))
mgraph(y,x,graph="REG",Grid=10,lty=1,col=c("black","blue"),
       leg=list(pos="topleft",leg=c("target","predictions")))
mgraph(y,x,graph="REP",Grid=10)
mgraph(y,x,graph="REP",Grid=10,sort=FALSE)
x2=rnorm(length(y),0,1.2)+y;x3=rnorm(length(y),0,1.4)+y;
L=vector("list",3); pred=vector("list",1); test=vector("list",1);
pred[[1]]=y; test[[1]]=x; L[[1]]=list(pred=pred,test=test,runs=1)
test[[1]]=x2; L[[2]]=list(pred=pred,test=test,runs=1)
test[[1]]=x3; L[[3]]=list(pred=pred,test=test,runs=1)
# distance line comparison graph:
mgraph(L,graph="DLC",metric="MAE",leg=c("x1","x2","x3"),main="MAE errors")

# new REC multi-curve single graph with NAREC (normalized Area of REC) values
# for maximum tolerance of val=0.5 (other val values can be used)
e1=mmetric(y,x,metric="NAREC",val=5)
e2=mmetric(y,x2,metric="NAREC",val=5)
e3=mmetric(y,x3,metric="NAREC",val=5)
l1=paste("x1, NAREC=",round(e1,digits=2))
l2=paste("x2, NAREC=",round(e2,digits=2))
l3=paste("x3, NAREC=",round(e3,digits=2))
mgraph(L,graph="REC",leg=list(pos="bottom",leg=c(l1,l2,l3)),main="REC curves")

### regression example with mining
## Not run: 
data(sin1reg)
M1=mining(y~.,sin1reg[,c(1,2,4)],model="mr",Runs=5)
M2=mining(y~.,sin1reg[,c(1,2,4)],model="mlpe",nr=3,maxit=50,size=4,Runs=5,feature="simp")
L=vector("list",2); L[[1]]=M2; L[[2]]=M1
mgraph(L,graph="REC",xval=0.1,leg=c("mlpe","mr"),main="REC curve")
mgraph(L,graph="DLC",metric="TOLERANCE",xval=0.01,
       leg=c("mlpe","mr"),main="DLC: TOLERANCE plot")
mgraph(M2,graph="IMP",xval=0.01,leg=c("x1","x2"),
       main="sin1reg Input importance",axis=1)
mgraph(M2,graph="VEC",xval=1,main="sin1reg 1-D VEC curve for x1")
mgraph(M2,graph="VEC",xval=1,
       main="sin1reg 1-D VEC curve and histogram for x1",data=sin1reg)

## End(Not run)

### classification example
## Not run: 
data(iris)
M1=mining(Species~.,iris,model="rpart",Runs=5) # decision tree (DT)
M2=mining(Species~.,iris,model="ksvm",Runs=5) # support vector machine (SVM)
L=vector("list",2); L[[1]]=M2; L[[2]]=M1
mgraph(M1,graph="ROC",TC=3,leg=-1,baseline=TRUE,Grid=10,main="ROC")
mgraph(M1,graph="ROC",TC=3,leg=-1,baseline=TRUE,Grid=10,main="ROC",intbar=FALSE)
mgraph(L,graph="ROC",TC=3,leg=c("SVM","DT"),baseline=TRUE,Grid=10,
       main="ROC for virginica")
mgraph(L,graph="LIFT",TC=3,leg=list(pos=c(0.4,0.2),leg=c("SVM","DT")),
       baseline=TRUE,Grid=10,main="LIFT for virginica")

## End(Not run)

### regression
y=c(1,5,10,11,7,3,2,1);x=rnorm(length(y),0,1.0)+y
mgraph(y,x,graph="RSC",Grid=10,col=c("blue"))
mgraph(y,x,graph="REG",Grid=10,lty=1,col=c("black","blue"),
       leg=list(pos="topleft",leg=c("target","predictions")))
mgraph(y,x,graph="REP",Grid=10)
mgraph(y,x,graph="REP",Grid=10,sort=FALSE)
x2=rnorm(length(y),0,1.2)+y;x3=rnorm(length(y),0,1.4)+y;
L=vector("list",3); pred=vector("list",1); test=vector("list",1);
pred[[1]]=y; test[[1]]=x; L[[1]]=list(pred=pred,test=test,runs=1)
test[[1]]=x2; L[[2]]=list(pred=pred,test=test,runs=1)
test[[1]]=x3; L[[3]]=list(pred=pred,test=test,runs=1)
# distance line comparison graph:
mgraph(L,graph="DLC",metric="MAE",leg=c("x1","x2","x3"),main="MAE errors")

# new REC multi-curve single graph with NAREC (normalized Area of REC) values
# for maximum tolerance of val=0.5 (other val values can be used)
e1=mmetric(y,x,metric="NAREC",val=5)
e2=mmetric(y,x2,metric="NAREC",val=5)
e3=mmetric(y,x3,metric="NAREC",val=5)
l1=paste("x1, NAREC=",round(e1,digits=2))
l2=paste("x2, NAREC=",round(e2,digits=2))
l3=paste("x3, NAREC=",round(e3,digits=2))
mgraph(L,graph="REC",leg=list(pos="bottom",leg=c(l1,l2,l3)),main="REC curves")

### regression example with mining
## Not run: 
data(sin1reg)
M1=mining(y~.,sin1reg[,c(1,2,4)],model="mr",Runs=5)
M2=mining(y~.,sin1reg[,c(1,2,4)],model="mlpe",nr=3,maxit=50,size=4,Runs=5,feature="simp")
L=vector("list",2); L[[1]]=M2; L[[2]]=M1
mgraph(L,graph="REC",xval=0.1,leg=c("mlpe","mr"),main="REC curve")
mgraph(L,graph="DLC",metric="TOLERANCE",xval=0.01,
       leg=c("mlpe","mr"),main="DLC: TOLERANCE plot")
mgraph(M2,graph="IMP",xval=0.01,leg=c("x1","x2"),
       main="sin1reg Input importance",axis=1)
mgraph(M2,graph="VEC",xval=1,main="sin1reg 1-D VEC curve for x1")
mgraph(M2,graph="VEC",xval=1,
       main="sin1reg 1-D VEC curve and histogram for x1",data=sin1reg)

## End(Not run)

### classification example
## Not run: 
data(iris)
M1=mining(Species~.,iris,model="rpart",Runs=5) # decision tree (DT)
M2=mining(Species~.,iris,model="ksvm",Runs=5) # support vector machine (SVM)
L=vector("list",2); L[[1]]=M2; L[[2]]=M1
mgraph(M1,graph="ROC",TC=3,leg=-1,baseline=TRUE,Grid=10,main="ROC")
mgraph(M1,graph="ROC",TC=3,leg=-1,baseline=TRUE,Grid=10,main="ROC",intbar=FALSE)
mgraph(L,graph="ROC",TC=3,leg=c("SVM","DT"),baseline=TRUE,Grid=10,
       main="ROC for virginica")
mgraph(L,graph="LIFT",TC=3,leg=list(pos=c(0.4,0.2),leg=c("SVM","DT")),
       baseline=TRUE,Grid=10,main="LIFT for virginica")

## End(Not run)

Powerful function that trains and tests a particular fit model under several runs and a given validation method

Description

Powerful function that trains and tests a particular fit model under several runs and a given validation method. Since there can be a huge number of models, the fitted models are not stored. Yet, several useful statistics (e.g. predictions) are returned.

Usage

mining(x, data = NULL, Runs = 1, method = NULL, model = "default", 
       task = "default", search = "heuristic", mpar = NULL,
       feature="none", scale = "default", transform = "none", 
       debug = FALSE, ...)
mining(x, data = NULL, Runs = 1, method = NULL, model = "default", 
       task = "default", search = "heuristic", mpar = NULL,
       feature="none", scale = "default", transform = "none", 
       debug = FALSE, ...)

Arguments

`x`	a symbolic description (formula) of the model to be fit. If `x` contains the data, then `data=NULL` (similar to x in `ksvm`, kernlab package).
`data`	an optional data frame (columns denote attributes, rows show examples) containing the training data, when using a formula.
`Runs`	number of runs used (e.g. 1, 5, 10, 20, 30)
`method`	a vector with c(vmethod,vpar,seed) or c(vmethod,vpar,window,increment), where vmethod is: `all` – all NROW examples are used as both training and test sets (no vpar or seed is needed). `holdout` – standard holdout method. If vpar<1 then NROWvpar random samples are used for training and the remaining rows are used for testing. Else, then NROW*vpar random samples are used for testing and the remaining are used for training. For classification tasks (`prob` or `class`) a stratified sampling is assumed (equal to `mode="stratified"` in `holdout`). `holdoutrandom` – similar to `holdout` except that assumes always a random sampling (not stratified). `holdoutorder` – similar to `holdout` except that instead of a random sampling, the first rows (until the split) are used for training and the remaining ones for testing (equal to `mode="order"` in `holdout`). `holdoutinc` – incremental holdout retraining (e.g. used for stock market data). Here, vpar* is the test size, window is the initial window size and increment is the number of samples added at each iteration. Note: argument `Runs` is automatically set when this option is used. See also `holdout`. `holdoutrol` – rolling holdout retraining (e.g. used for stock market data). Here, vpar is the test size, window is the window size and increment is the number of samples added at each iteration. Note: argument `Runs` is automatically set when this option is used. See also `holdout`. `kfold` – K-fold cross-validation method, where vpar is the number of folds. For classification tasks (`prob` or `class`) a stratified split is assumed (equal to `mode="stratified"` in `crossvaldata`). `kfoldrandom` – similar to `kfold` except that assumes always a random sampling (not stratified). `kfoldorder` – similar to `kfold` except that instead of a random sampling, the order of the rows is used to build the folds. vpar – number used by vmethod (optional, if not defined 2/3 for `holdout` and 10 for `kfold` is assumed); and seed (optional, if not defined then `NA` is assumed) is: `NA` – random seed is adopted (default R method for generating random numbers); a vector of size `Runs` with fixed seed numbers for each Run; a number – `set.seed`(number) is applied then a vector of seeds (of size Runs) is generated.
`model`	See `fit` for details.
`task`	See `fit` for details.
`search`	See `fit` for details.
`mpar`	Only kept for compatibility with previous `rminer` versions, as you should use `search` instead of `mpar`. See `fit` for details.
`feature`	See `fit` for more details about `feature="none"`, `"sabs"` or `"sbs"` options. For the `mining` function, additional options are `feature=`fmethod, where fmethod can be one of: `sens` or `sensg` – compute the 1-D sensitivity analysis input importances (`$sen`), gradient measure. `sensv` – compute the 1-D sensitivity analysis input importances (`$sen`), variance measure. `sensr` – compute the 1-D sensitivity analysis input importances (`$sen`), range measure. `simp`, `simpg` or `s` – equal to `sensg` but also computes the 1-D sensitivity responses (`$sresponses`, useful for `graph="VEC"`). `simpv` – equal to `sensv` but also computes the 1-D sensitivity responses (useful for `graph="VEC"`). `simpr` – equal to `sensr` but also computes the 1-D sensitivity responses (useful for `graph="VEC"`).
`scale`	See `fit` for details.
`transform`	See `fit` for details.
`debug`	If TRUE shows some information about each run.
`...`	See `fit` for details.

Details

Powerful function that trains and tests a particular fit model under several runs and a given validation method (see [Cortez, 2010] for more details).
Several Runs are performed. In each run, the same validation method is adopted (e.g. holdout) and several relevant statistics are stored. Note: this function can require some computational effort, specially if a large dataset and/or a high number of Runs is adopted.

Value

A list with the components:

$object – fitted object values of the last run (used by multiple model fitting: "auto" mode). For "holdout", it is equal to a fit object, while for "kfold" it is a list.
$time – vector with time elapsed for each run.
$test – vector list, where each element contains the test (target) results for each run.
$pred – vector list, where each element contains the predicted results for each test set and each run.
$error – vector with a (validation) measure (often it is a error value) according to search$metric for each run (valid options are explained in mmetric).
$mpar – vector list, where each element contains the fit model mpar parameters (for each run).
$model – the model.
$task – the task.
$method – the external validation method.
$sen – a matrix with the 1-D sensitivity analysis input importances. The number of rows is Runs times vpar, if kfold, else is Runs.
$sresponses – a vector list with a size equal to the number of attributes (useful for graph="VEC"). Each element contains a list with the 1-D sensitivity analysis input responses (n – name of the attribute; l – number of levels; x – attribute values; y – 1-D sensitivity responses.
Important note: sresponses (and "VEC" graphs) are only available if feature="sabs" or "simp" related (see feature).
$runs – the Runs.
$attributes – vector list with all attributes (features) selected in each run (and fold if kfold) if a feature selection algorithm is used.
$feature – the feature.

Note

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

To check for more details about rminer and for citation purposes:
P. Cortez.
Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool.
In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects 10th Industrial Conference on Data Mining (ICDM 2010), Lecture Notes in Artificial Intelligence 6171, pp. 572-583, Berlin, Germany, July, 2010. Springer. ISBN: 978-3-642-14399-1.
@Springer: https://link.springer.com/chapter/10.1007/978-3-642-14400-4_44
http://www3.dsi.uminho.pt/pcortez/2010-rminer.pdf
This tutorial shows additional code examples:
P. Cortez.
A tutorial on using the rminer R package for data mining tasks.
Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015.
http://hdl.handle.net/1822/36210
For the grid search and other optimization methods:
P. Cortez.
Modern Optimization with R.
Use R! series, Springer, 2nd edition, July 2021, ISBN 978-3-030-72818-2.
https://link.springer.com/book/10.1007/978-3-030-72819-9

Examples

### dontrun is used when the execution of the example requires some computational effort.

### simple regression example
set.seed(123); x1=rnorm(200,100,20); x2=rnorm(200,100,20)
y=0.7*sin(x1/(25*pi))+0.3*sin(x2/(25*pi))
# mining with an ensemble of neural networks, each fixed with size=2 hidden nodes
# assumes a default holdout (random split) with 2/3 for training and 1/3 for testing:
M=mining(y~x1+x2,Runs=2,model="mlpe",search=2)
print(M)
print(mmetric(M,metric="MAE"))

### more regression examples:
## Not run: 
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
# 5 runs of an external holdout with 2/3 for training and 1/3 for testing, fixed seed 12345
# feature selection: sabs method
# model selection: 5 searches for size, internal 2-fold cross validation fixed seed 123
#                  with optimization for minimum MAE metric 
M=mining(y~.,data=sin1reg,Runs=5,method=c("holdout",2/3,12345),model="mlpe",
         search=list(search=mparheuristic("mlpe",n=5),method=c("kfold",2,123),metric="MAE"),
         feature="sabs")
print(mmetric(M,metric="MAE"))
print(M$mpar)
print("median hidden nodes (size) and number of MLPs (nr):")
print(centralpar(M$mpar))
print("attributes used by the model in each run:")
print(M$attributes)
mgraph(M,graph="RSC",Grid=10,main="sin1 MLPE scatter plot")
mgraph(M,graph="REP",Grid=10,main="sin1 MLPE scatter plot",sort=FALSE)
mgraph(M,graph="REC",Grid=10,main="sin1 MLPE REC")
mgraph(M,graph="IMP",Grid=10,main="input importances",xval=0.1,leg=names(sin1reg))
# average influence of x1 on the model:
mgraph(M,graph="VEC",Grid=10,main="x1 VEC curve",xval=1,leg=names(sin1reg)[1])

## End(Not run)

### regression example with holdout rolling windows:
## Not run: 
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
# rolling with 20 test samples, training window size of 300 and increment of 50 in each run:
# note that Runs argument is automatically set to 14 in this example:
M=mining(y~.,data=sin1reg,method=c("holdoutrol",20,300,50),
         model="mlpe",debug=TRUE)

## End(Not run)

### regression example with all rminer models:
## Not run: 
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
models=c("naive","ctree","rpart","kknn","mlp","mlpe","ksvm","randomForest","mr","mars",
         "cubist","pcr","plsr","cppls","rvm")
for(model in models)
{ 
 M=mining(y~.,data=sin1reg,method=c("holdout",2/3,12345),model=model)
 cat("model:",model,"MAE:",round(mmetric(M,metric="MAE")$MAE,digits=3),"\n")
}

## End(Not run)

### classification example (task="prob")
## Not run: 
data(iris)
# 10 runs of a 3-fold cross validation with fixed seed 123 for generating the 3-fold runs
M=mining(Species~.,iris,Runs=10,method=c("kfold",3,123),model="rpart")
print(mmetric(M,metric="CONF"))
print(mmetric(M,metric="AUC"))
print(meanint(mmetric(M,metric="AUC")))
mgraph(M,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg="Versicolor",
       main="versicolor ROC")
mgraph(M,graph="LIFT",TC=2,baseline=TRUE,Grid=10,leg="Versicolor",
       main="Versicolor ROC")
M2=mining(Species~.,iris,Runs=10,method=c("kfold",3,123),model="ksvm")
L=vector("list",2)
L[[1]]=M;L[[2]]=M2
mgraph(L,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg=c("DT","SVM"),main="ROC")

## End(Not run)

### other classification examples
## Not run: 
### 1st example:
data(iris)
# 2 runs of an external 2-fold validation, random seed
# model selection: SVM model with rbfdot kernel, automatic search for sigma,
#                  internal 3-fold validation, random seed, minimum "AUC" is assumed
# feature selection: none, "s" is used only to store input importance values
M=mining(Species~.,data=iris,Runs=2,method=c("kfold",2,NA),model="ksvm",
         search=list(search=mparheuristic("ksvm"),method=c("kfold",3)),feature="s")

print(mmetric(M,metric="AUC",TC=2))
mgraph(M,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg="SVM",main="ROC",intbar=FALSE)
mgraph(M,graph="IMP",TC=2,Grid=10,main="input importances",xval=0.1,
leg=names(iris),axis=1)
mgraph(M,graph="VEC",TC=2,Grid=10,main="Petal.Width VEC curve",
data=iris,xval=4)
### 2nd example, ordered kfold, k-nearest neigbor:
M=mining(Species~.,iris,Runs=1,method=c("kfoldo",3),model="knn")
# confusion matrix:
print(mmetric(M,metric="CONF"))

### 3rd example, use of all rminer models: 
models=c("naive","ctree","rpart","kknn","mlp","mlpe","ksvm","randomForest","bagging",
         "boosting","lda","multinom","naiveBayes","qda")
for(model in models)
{ 
 M=mining(Species~.,iris,Runs=1,method=c("kfold",3,123),model=model)
 cat("model:",model,"ACC:",round(mmetric(M,metric="ACC")$ACC,digits=1),"\n")
}

## End(Not run)

### multiple models: automl or ensembles 
## Not run: 

data(iris)
d=iris
names(d)[ncol(d)]="y" # change output name
inputs=ncol(d)-1
metric="AUC"

# simple automl (1 search per individual model),
# internal holdout and external holdout:
sm=mparheuristic(model="automl",n=NA,task="prob",inputs=inputs)
mode="auto"

imethod=c("holdout",4/5,123) # internal validation method
emethod=c("holdout",2/3,567) # external validation method

search=list(search=sm,smethod=mode,method=imethod,metric=metric,convex=0)
M=mining(y~.,data=d,model="auto",search=search,method=emethod,fdebug=TRUE)
# 1 single model was selected:
cat("best",emethod[1],"selected model:",M$object@model,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")

# simple automl (1 search per individual model),
# internal kfold and external kfold: 
imethod=c("kfold",3,123) # internal validation method
emethod=c("kfold",5,567) # external validation method
search=list(search=sm,smethod=mode,method=imethod,metric=metric,convex=0)
M=mining(y~.,data=d,model="auto",search=search,method=emethod,fdebug=TRUE)
# kfold models were selected:
kfolds=as.numeric(emethod[2])
models=vector(length=kfolds)
for(i in 1:kfolds) models[i]=M$object$model[[i]]
cat("best",emethod[1],"selected models:",models,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")

# example with weighted ensemble:
M=mining(y~.,data=d,model="WE",search=search,method=emethod,fdebug=TRUE)
for(i in 1:kfolds) models[i]=M$object$model[[i]]
cat("best",emethod[1],"selected models:",models,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")


## End(Not run)


### for more fitting examples check the help of function fit: help(fit,package="rminer")
### dontrun is used when the execution of the example requires some computational effort.

### simple regression example
set.seed(123); x1=rnorm(200,100,20); x2=rnorm(200,100,20)
y=0.7*sin(x1/(25*pi))+0.3*sin(x2/(25*pi))
# mining with an ensemble of neural networks, each fixed with size=2 hidden nodes
# assumes a default holdout (random split) with 2/3 for training and 1/3 for testing:
M=mining(y~x1+x2,Runs=2,model="mlpe",search=2)
print(M)
print(mmetric(M,metric="MAE"))

### more regression examples:
## Not run: 
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
# 5 runs of an external holdout with 2/3 for training and 1/3 for testing, fixed seed 12345
# feature selection: sabs method
# model selection: 5 searches for size, internal 2-fold cross validation fixed seed 123
#                  with optimization for minimum MAE metric 
M=mining(y~.,data=sin1reg,Runs=5,method=c("holdout",2/3,12345),model="mlpe",
         search=list(search=mparheuristic("mlpe",n=5),method=c("kfold",2,123),metric="MAE"),
         feature="sabs")
print(mmetric(M,metric="MAE"))
print(M$mpar)
print("median hidden nodes (size) and number of MLPs (nr):")
print(centralpar(M$mpar))
print("attributes used by the model in each run:")
print(M$attributes)
mgraph(M,graph="RSC",Grid=10,main="sin1 MLPE scatter plot")
mgraph(M,graph="REP",Grid=10,main="sin1 MLPE scatter plot",sort=FALSE)
mgraph(M,graph="REC",Grid=10,main="sin1 MLPE REC")
mgraph(M,graph="IMP",Grid=10,main="input importances",xval=0.1,leg=names(sin1reg))
# average influence of x1 on the model:
mgraph(M,graph="VEC",Grid=10,main="x1 VEC curve",xval=1,leg=names(sin1reg)[1])

## End(Not run)

### regression example with holdout rolling windows:
## Not run: 
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
# rolling with 20 test samples, training window size of 300 and increment of 50 in each run:
# note that Runs argument is automatically set to 14 in this example:
M=mining(y~.,data=sin1reg,method=c("holdoutrol",20,300,50),
         model="mlpe",debug=TRUE)

## End(Not run)

### regression example with all rminer models:
## Not run: 
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
models=c("naive","ctree","rpart","kknn","mlp","mlpe","ksvm","randomForest","mr","mars",
         "cubist","pcr","plsr","cppls","rvm")
for(model in models)
{ 
 M=mining(y~.,data=sin1reg,method=c("holdout",2/3,12345),model=model)
 cat("model:",model,"MAE:",round(mmetric(M,metric="MAE")$MAE,digits=3),"\n")
}

## End(Not run)

### classification example (task="prob")
## Not run: 
data(iris)
# 10 runs of a 3-fold cross validation with fixed seed 123 for generating the 3-fold runs
M=mining(Species~.,iris,Runs=10,method=c("kfold",3,123),model="rpart")
print(mmetric(M,metric="CONF"))
print(mmetric(M,metric="AUC"))
print(meanint(mmetric(M,metric="AUC")))
mgraph(M,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg="Versicolor",
       main="versicolor ROC")
mgraph(M,graph="LIFT",TC=2,baseline=TRUE,Grid=10,leg="Versicolor",
       main="Versicolor ROC")
M2=mining(Species~.,iris,Runs=10,method=c("kfold",3,123),model="ksvm")
L=vector("list",2)
L[[1]]=M;L[[2]]=M2
mgraph(L,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg=c("DT","SVM"),main="ROC")

## End(Not run)

### other classification examples
## Not run: 
### 1st example:
data(iris)
# 2 runs of an external 2-fold validation, random seed
# model selection: SVM model with rbfdot kernel, automatic search for sigma,
#                  internal 3-fold validation, random seed, minimum "AUC" is assumed
# feature selection: none, "s" is used only to store input importance values
M=mining(Species~.,data=iris,Runs=2,method=c("kfold",2,NA),model="ksvm",
         search=list(search=mparheuristic("ksvm"),method=c("kfold",3)),feature="s")

print(mmetric(M,metric="AUC",TC=2))
mgraph(M,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg="SVM",main="ROC",intbar=FALSE)
mgraph(M,graph="IMP",TC=2,Grid=10,main="input importances",xval=0.1,
leg=names(iris),axis=1)
mgraph(M,graph="VEC",TC=2,Grid=10,main="Petal.Width VEC curve",
data=iris,xval=4)
### 2nd example, ordered kfold, k-nearest neigbor:
M=mining(Species~.,iris,Runs=1,method=c("kfoldo",3),model="knn")
# confusion matrix:
print(mmetric(M,metric="CONF"))

### 3rd example, use of all rminer models: 
models=c("naive","ctree","rpart","kknn","mlp","mlpe","ksvm","randomForest","bagging",
         "boosting","lda","multinom","naiveBayes","qda")
for(model in models)
{ 
 M=mining(Species~.,iris,Runs=1,method=c("kfold",3,123),model=model)
 cat("model:",model,"ACC:",round(mmetric(M,metric="ACC")$ACC,digits=1),"\n")
}

## End(Not run)

### multiple models: automl or ensembles 
## Not run: 

data(iris)
d=iris
names(d)[ncol(d)]="y" # change output name
inputs=ncol(d)-1
metric="AUC"

# simple automl (1 search per individual model),
# internal holdout and external holdout:
sm=mparheuristic(model="automl",n=NA,task="prob",inputs=inputs)
mode="auto"

imethod=c("holdout",4/5,123) # internal validation method
emethod=c("holdout",2/3,567) # external validation method

search=list(search=sm,smethod=mode,method=imethod,metric=metric,convex=0)
M=mining(y~.,data=d,model="auto",search=search,method=emethod,fdebug=TRUE)
# 1 single model was selected:
cat("best",emethod[1],"selected model:",M$object@model,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")

# simple automl (1 search per individual model),
# internal kfold and external kfold: 
imethod=c("kfold",3,123) # internal validation method
emethod=c("kfold",5,567) # external validation method
search=list(search=sm,smethod=mode,method=imethod,metric=metric,convex=0)
M=mining(y~.,data=d,model="auto",search=search,method=emethod,fdebug=TRUE)
# kfold models were selected:
kfolds=as.numeric(emethod[2])
models=vector(length=kfolds)
for(i in 1:kfolds) models[i]=M$object$model[[i]]
cat("best",emethod[1],"selected models:",models,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")

# example with weighted ensemble:
M=mining(y~.,data=d,model="WE",search=search,method=emethod,fdebug=TRUE)
for(i in 1:kfolds) models[i]=M$object$model[[i]]
cat("best",emethod[1],"selected models:",models,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")


## End(Not run)


### for more fitting examples check the help of function fit: help(fit,package="rminer")

Compute classification or regression error metrics.

Description

Compute classification or regression error metrics.

Usage

mmetric(y, x = NULL, metric, D = 0.5, TC = -1, val = NULL, aggregate = "no")mmetric(y, x = NULL, metric, D = 0.5, TC = -1, val = NULL, aggregate = "no")

Arguments

`y`	if there are predictions (`!is.null(x)`), `y` should be a numeric vector or factor with the target desired responses (or output values). Else, `y` should be a list returned by the `mining` function.
`x`	the predictions (should be a numeric vector if `task="reg"`, matrix if `task="prob"` or factor if `task="class"` (used if `y` is not a list).
`metric`	a R function or a character. Note: if a R function, then it should be set to provide lower values for better models if the intention is to be used within the `search` argument of `fit` and `mining` (i.e., "<" meaning). Valid character options are (">" means "better" if higher value; "<" means "better" if lower value): `ALL` – returns all classification or regression metrics (context dependent, multi-metric). if vector – returns all metrics included in the vector, vector elements can be any of the options below (multi-metric). `CONF` – confusion matrix (classification, matrix). `ACC` – classification accuracy rate, equal to micro averaged F1 score (classification, ">", [0-%100]). `macroACC` – macro average ACC score, for multiclass tasks (classification, ">", [0-%100]). `weightedACC` – weighted average ACC score, for multiclass tasks (classification, ">", [0-%100]). `CE` – classification error or misclassification error rate (classification, "<", [0-%100]). `MAEO` – mean absolute error for ordinal classification (classification, "<", [0-Inf[). `MSEO` – mean squared error for ordinal classification (classification, "<", [0-Inf[). `KENDALL` – Kendalls's coefficient for ordinal classification or (mean if) ranking (classification, ">", [-1;1]). Note: if ranking, `y` is a matrix and mean metric is computed. `SPEARMAN` – Mean Spearman's rho coefficient for ranking (classification, ">", [-1;1]). Note: if ranking, `y` is a matrix and mean metric is computed. `BER` – balanced error rate (classification, "<", [0-%100]). `KAPPA` – kappa index (classification, "<", [0-%100]). `CRAMERV` – Cramer's V (classification, ">", [0,1.0]). `ACCLASS` – classification accuracy rate per class (classification, ">", [0-%100]). `BAL_ACC` – balanced accuracy rate per class (classification, ">", [0-%100]). `TPR` – true positive rate, sensitivity or recall (classification, ">", [0-%100]). `macroTPR` – macro average TPR score, for multiclass tasks (classification, ">", [0-%100]). `weightedTPR` – weighted average TPR score, for multiclass tasks (classification, ">", [0-%100]). `TNR` – true negative rate or specificity (classification, ">", [0-%100]). `macroTNR` – macro average TNR score, for multiclass tasks (classification, ">", [0-%100]). `weightedTNR` – weighted average TNR score, for multiclass tasks (classification, ">", [0-%100]). `microTNR` – micro average TNR score, for multiclass tasks (classification, ">", [0-%100]). `PRECISION` – precision (classification, ">", [0-%100]). `macroPRECISION` – macro average precision, for multiclass tasks (classification, ">", [0-%100]). `weightedPRECISION` – weighted average precision, for multiclass tasks (classification, ">", [0-%100]). `F1` – F1 score (classification, ">", [0-%100]). `macroF1` – macro average F1 score, for multiclass tasks (classification, ">", [0-%100]). `weightedF1` – weighted average F1 score, for multiclass tasks (classification, ">", [0-%100]). `MCC` – Matthews correlation coefficient (classification, ">", [-1,1]). `BRIER` – overall Brier score (classification "prob", "<", [0,1.0]). `BRIERCLASS` – Brier score per class (classification "prob", "<", [0,1.0]). `ROC` – Receiver Operating Characteristic curve (classification "prob", list with several components). `AUC` – overall area under the curve (of ROC curve, classification "prob", ">", domain values: [0,1.0]). `AUCCLASS` – area under the curve per class (of ROC curve, classification "prob", ">", domain values: [0,1.0]). `NAUC` – normalized AUC (given a fixed `val=`FPR, classification "prob", ">", [0,1.0]). `TPRATFPR` – the TPR (given a fixed `val=`FPR, classification "prob", ">", [0,1.0]). `LIFT` – accumulative percent of responses captured (LIFT accumulative curve, classification "prob", list with several components). `ALIFT` – area of the accumulative percent of responses captured (LIFT accumulative curve, classification "prob", ">", [0,1.0]). `NALIFT` – normalized ALIFT (given a fixed `val=`percentage of examples, classification "prob", ">", [0,1.0]). `ALIFTATPERC` – ALIFT value (given a fixed `val=`percentage of examples, classification "prob", ">", [0,1.0]). `SAE` – sum absolute error/deviation (regression, "<", [0,Inf[). `MAE` – mean absolute error (regression, "<", [0,Inf[). `MdAE` – median absolute error (regression, "<", [0,Inf[). `GMAE` – geometric mean absolute error (regression, "<", [0,Inf[). `MaxAE` – maximum absolute error (regression, "<", [0,Inf[). `NMAE` – normalized mean absolute error (regression, "<", [0%,Inf[). Note: by default, this metric assumes the range of `y` as the denominator of `NMAE`; a different range can be set by setting the optional `val` argument (see example). `RAE` – relative absolute error (regression, "<", [0%,Inf[). `SSE` – sum squared error (regression, "<", [0,Inf[). `MSE` – mean squared error (regression, "<", [0,Inf[). `MdSE` – median squared error (regression, "<", [0,Inf[). `RMSE` – root mean squared error (regression, "<", [0,Inf[). `GMSE` – geometric mean squared error (regression, "<", [0,Inf[). `HRMSE` – Heteroscedasticity consistent root mean squared error (regression, "<", [0,Inf[). `RSE` – relative squared error (regression, "<", [0%,Inf[). `RRSE` – root relative squared error (regression, "<", [0%,Inf[). `ME` – mean error (regression, "<", [0,Inf[). `SMinkowski3` – sum of Minkowski loss function (q=3, heavier penalty for large errors when compared with SSE, regression, "<", [0%,Inf[). `MMinkowski3` – mean of Minkowski loss function (q=3, heavier penalty for large errors when compared with SSE, regression, "<", [0%,Inf[). `MdMinkowski3` – median of Minkowski loss function (q=3, heavier penalty for large errors when compared with SSE, regression, "<", [0%,Inf[). `COR` – Pearson correlation (regression, ">", [-1,1]). `q2` – =1-correlation^2 test error metric, as used by M.J. Embrechts (regression, "<", [0,1.0]). `R2` – coefficient of determination R^2 (regression, ">", squared pearson correlation coefficient: [0,1]). `R22` – 2nd variant of coefficient of determination R^2 (regression, ">", most general definition that however can lead to negative values: ]-Inf,1]. In previous rminer versions, this variant was known as "R2"). `EV` – explained variance, 1 - var(y-x)/var(y) (regression, ">", ]-Inf,1]). `Q2` – R^2/SD test error metric, as used by M.J. Embrechts (regression, "<", [0,Inf[). `REC` – Regression Error Characteristic curve (regression, list with several components). `NAREC` – normalized REC area (given a fixed `val=`tolerance, regression, ">", [0,1.0]). `TOLERANCE` – the tolerance (y-axis value) of a REC curve given a fixed `val=`tolerance value, regression, ">", [0,1.0]). `TOLERANCEPERC` – the tolerance (y-axis value) of a REC curve given a percentage `val=` value (in terms of y range), regression, ">", [0,1.0]). `MAPE` – Mean Absolute Percentage mmetric forecasting metric (regression, "<", [0%,Inf[). `MdAPE` – Median Absolute Percentage mmetric forecasting metric (regression, "<"), [0%,Inf[). `RMSPE` – Root Mean Square Percentage mmetric forecasting metric (regression, "<", [0%,Inf[). `RMdSPE` – Root Median Square Percentage mmetric forecasting metric (regression, "<", [0%,Inf[). `SMAPE` – Symmetric Mean Absolute Percentage mmetric forecasting metric (regression, "<", [0%,200%]). `SMdAPE` – Symmetric Median Absolute Percentage mmetric forecasting metric (regression, "<", [0%,200%]). `MRAE` – Mean Relative Absolute mmetric forecasting metric (`val` should contain the last in-sample/training data value (for random walk) or full benchmark time series related with out-of-sample values, regression, "<", [0,Inf[). `MdRAE` – Median Relative Absolute mmetric forecasting metric (`val` should contain the last in-sample/training data value (for random walk) or full benchmark time series, regression, "<", [0,Inf[). `GMRAE` – Geometric Mean Relative Absoluate mmetric forecasting metric (`val` should contain the last in-sample/training data value (for random walk) or full benchmark time series, regression, "<", [0,Inf[). `THEILSU2` – Theils'U2 forecasting metric (`val` should contain the last in-sample/training data value (for random walk) or full benchmark time series, regression, "<", [0,Inf[). `MASE` – MASE forecasting metric (`val` should contain the time series in-samples or training data, regression, "<", [0,Inf[).
`D`	decision threshold (for `task="prob"`, probabilistic classification) within [0,1]. The class is TRUE if prob>D.
`TC`	target class index or vector of indexes (for multi-class classification class) from 1 to Nc, where Nc is the number of classes:<cr> if `TC==-1` (the default value), then it is assumed: if `metric` is "CONF" – `D` is ignored and highest probability class is assumed (if `TC>0`, the metric is computed for positive `TC` class and `D` is used). if `metric` is "ACC", "CE", "BER", "KAPPA", "CRAMERV", "BRIER", or "AUC" – the global metric (for all classes) is computed (if `TC>0`, the metric is computed for positive `TC` class). if `metric` is "ACCLASS", "TPR", "TNR", "Precision", "F1", "MCC", "ROC", "BRIERCLASS", "AUCCLASS" – it returns one result per class (if `TC>0`, it returns negative (e.g. "TPR1") and positive (TC, e.g. "TPR2") result). if `metric` is "NAUC", "TPRATFPR", "LIFT", "ALIFT", "NALIFT" or "ALIFTATPERC" – TC is set to the index of the last class.
`val`	auxiliary value: when two or more metrics need different `val` values, then `val` should be a vector list, see example. if numeric or vector – check the `metric` argument for specific details of each metric val meaning.
`aggregate`	character with type of aggregation performed when y is a `mining` list. Valid options are: no – returns all metrics for all `mining` runs. If `metric` includes "CONF", "ROC", "LIFT" or "REC", it returns a vector list, else if `metric` includes a single metric, it returns a vector; else it retuns a data.frame (runs x metrics). sum – sums all run results. mean – averages all run results. note: both "sum" and "mean" only work if only `metric=="CONF"` is used or if `metric` does not contain "ROC", "LIFT" or "REC".

Details

Compute classification or regression error metrics:

mmetric – compute one or more classification/regression metrics given y and x OR a mining list.
metrics – deprecated function, same as mmetric(x,y,metric="ALL"), included here just for compatability purposes but will be removed from the package.

Value

Returns the computed error metric(s):

one value if only one metric is requested (and y is not a mining list);
named vector if 2 or more elements are requested in metric (and y is not a mining list);
list if there is a "CONF", "ROC", "LIFT" or "REC" request on metric (other metrics are stored in field $res, and y is not a mining list).
if y is a mining list then there can be several runs, thus:
- a vector list of size y$runs is returned if metric includes "CONF", "ROC", "LIFT" or "REC" and aggregate="no";
- a data.frame is returned if aggregate="no" and metric does not include "CONF", "ROC", "LIFT" or "REC";
- a table is returned if aggregate="sum" or "mean" and metric="CONF";
- a vector or numeric value is returned if aggregate="sum" or "mean" and metric is not "CONF".

Note

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

To check for more details about rminer and for citation purposes:
P. Cortez.
Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool.
In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects 10th Industrial Conference on Data Mining (ICDM 2010), Lecture Notes in Artificial Intelligence 6171, pp. 572-583, Berlin, Germany, July, 2010. Springer. ISBN: 978-3-642-14399-1.
@Springer: https://link.springer.com/chapter/10.1007/978-3-642-14400-4_44
http://www3.dsi.uminho.pt/pcortez/2010-rminer.pdf
This tutorial shows additional code examples:
P. Cortez.
A tutorial on using the rminer R package for data mining tasks.
Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015.
http://hdl.handle.net/1822/36210
About the Brier and Global AUC scores:
A. Silva, P. Cortez, M.F. Santos, L. Gomes and J. Neves.
Rating Organ Failure via Adverse Events using Data Mining in the Intensive Care Unit.
In Artificial Intelligence in Medicine, Elsevier, 43 (3): 179-193, 2008.
doi:10.1016/j.artmed.2008.03.010
About the classification and regression metrics:
I. Witten and E. Frank.
Data Mining: Practical machine learning tools and techniques.
Morgan Kaufmann, 2005.
About the forecasting metrics:
R. Hyndman and A. Koehler
Another look at measures of forecast accuracy.
In International Journal of Forecasting, 22(4):679-688, 2006.
About the ordinal classification metrics:
J.S. Cardoso and R. Sousa.
Measuring the Performance of Ordinal Classification.
In International Journal of Pattern Recognition and Artificial Intelligence, 25(8):1173-1195, 2011.

Examples

### pure binary classification 
y=factor(c("a","a","a","a","b","b","b","b"))
x=factor(c("a","a","b","a","b","a","b","a"))
print(mmetric(y,x,"CONF")$conf)
print(mmetric(y,x,metric=c("ACC","TPR","ACCLASS")))
print(mmetric(y,x,"ALL"))

### probabilities binary classification 
y=factor(c("a","a","a","a","b","b","b","b"))
px=matrix(nrow=8,ncol=2)
px[,1]=c(1.0,0.9,0.8,0.7,0.6,0.5,0.4,0.3)
px[,2]=1-px[,1]
print(px)
print(mmetric(y,px,"CONF")$conf)
print(mmetric(y,px,"CONF",D=0.5,TC=2)$conf)
print(mmetric(y,px,"CONF",D=0.3,TC=2)$conf)
print(mmetric(y,px,metric="ALL",D=0.3,TC=2))
print(mmetric(y,px,metric=c("ACC","AUC","AUCCLASS","BRIER","BRIERCLASS","CE"),D=0.3,TC=2))
# ACC and confusion matrix:
print(mmetric(y,px,metric=c("ACC","CONF"),D=0.3,TC=2))
# ACC and ROC curve:
print(mmetric(y,px,metric=c("ACC","ROC"),D=0.3,TC=2))
# ACC, ROC and LIFT curve:
print(mmetric(y,px,metric=c("ACC","ROC","LIFT"),D=0.3,TC=2))

### pure multi-class classification 
y=c('A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A',
'A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A',
'A','A','B','B','B','B','B','B','B','B','B','B','C','C','C','C','C','C','C','C','C','C',
'C','C','C','C','C','D','D','D','D','D','D','D','D','D','D','D','D','D','D','D','D','D',
'D','D','D','D','D','D','D','D','E','E','E','E','E')
x=c('A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A',
'A','A','A','A','A','A','A','A','A','A','A','A','A','A','E','E','E','E','E','D','D','D',
'D','D','B','B','B','B','B','B','B','B','B','D','C','C','C','C','C','C','C','B','B','B',
'B','B','C','C','C','D','D','D','D','D','D','D','D','D','D','D','D','D','D','D','D','D',
'D','D','D','D','D','D','C','C','E','A','A','B','B')
y=factor(y)
x=factor(x)
print(mmetric(y,x,metric="CONF")$conf) # confusion matrix
print(mmetric(y,x,metric="CONF",TC=-1)$conf) # same thing
print(mmetric(y,x,metric="CONF",TC=1)$conf) # for target class TC=1: "A"
mshow=function(y,x,metric) print(round(mmetric(y,x,metric),digits=0))
mshow(y,x,"ALL")
mshow(y,x,c("ACCLASS","BAL_ACC","KAPPA"))
mshow(y,x,c("PRECISION")) # precision
mshow(y,x,c("TPR")) # recall 
mshow(y,x,c("F1")) # F1 score

# micro (=ACC), macro and weighted average:
mshow(y,x,c("ACC","macroPRECISION","weightedPRECISION")) 
mshow(y,x,c("ACC","macroTPR","weightedTPR")) 
mshow(y,x,c("ACC","macroF1","weightedF1"))
mshow(y,x,c("ACC","macroACC","weightedACC"))

# several metrics in a single returned object:
print(mmetric(y,x,metric=c("CONF","macroF1","weightedF1","ACC")))

### probabilities multi-class 
y=factor(c("a","a","b","b","c","c"))
px=matrix(nrow=6,ncol=3)
px[,1]=c(1.0,0.7,0.5,0.3,0.1,0.7)
px[,2]=c(0.0,0.2,0.4,0.7,0.3,0.2)
px[,3]=1-px[,1]-px[,2]
print(px)
print(mmetric(y,px,metric="ALL",TC=-1,val=0.1))
print(mmetric(y,px,metric=c("AUC","AUCCLASS","NAUC"),TC=-1,val=0.1))
print(mmetric(y,px,metric=c("AUC","NAUC"),TC=3,val=0.1))
print(mmetric(y,px,metric=c("ACC","ACCLASS"),TC=-1))
print(mmetric(y,px,metric=c("CONF"),TC=3,D=0.5)$conf)
print(mmetric(y,px,metric=c("ACCLASS"),TC=3,D=0.5))
print(mmetric(y,px,metric=c("CONF"),TC=3,D=0.7)$conf)
print(mmetric(y,px,metric=c("ACCLASS"),TC=3,D=0.7))

### ordinal multi-class (example in Ricardo Sousa PhD thesis 2012)
y=ordered(c(rep("a",4),rep("b",6),rep("d",3)),levels=c("a","b","c","d"))
x=ordered(c(rep("c",(4+6)),rep("d",3)),levels=c("a","b","c","d"))
print(mmetric(y,x,metric="CONF")$conf)
print(mmetric(y,x,metric=c("CE","MAEO","MSEO","KENDALL")))
# note: only y needs to be ordered
x=factor(c(rep("b",4),rep("a",6),rep("d",3)),levels=c("a","b","c","d"))
print(mmetric(y,x,metric="CONF")$conf)
print(mmetric(y,x,metric=c("CE","MAEO","MSEO","KENDALL")))
print(mmetric(y,x,metric="ALL"))

### ranking (multi-class) 
y=matrix(nrow=1,ncol=12);x=y
# http://www.youtube.com/watch?v=D56dvoVrBBE
y[1,]=1:12
x[1,]=c(2,1,4,3,6,5,8,7,10,9,12,11)
print(mmetric(y,x,metric="KENDALL"))
print(mmetric(y,x,metric="ALL"))

y=matrix(nrow=2,ncol=7);x=y
y[1,]=c(2,6,5,4,3,7,1)
y[2,]=7:1
x[1,]=1:7
x[2,]=1:7
print(mmetric(y,x,metric="ALL"))

### regression examples: y - desired values; x - predictions
y=c(95.01,96.1,97.2,98.0,99.3,99.7);x=95:100
print(mmetric(y,x,"ALL"))
print(mmetric(y,x,"MAE"))
mshow=function(y,x,metric) print(round(mmetric(y,x,metric),digits=2))
mshow(y,x,c("MAE","RMSE","RAE","RSE"))
# getting NMAE:
m=mmetric(y,x,"NMAE")
cat("NMAE:",round(m,digits=2)," (denominator=",diff(range(y)),")\n")
m=mmetric(y,x,"NMAE",val=5) # usage of different range
cat("NMAE:",round(m,digits=2)," (denominator=",5,")\n")
# get REC curve and other measures:
m=mmetric(y,x,c("REC","TOLERANCEPERC","MAE"),val=5)
print(m)

# correlation or similar measures:
mshow(y,x,c("COR","R2","R22","EV")) # ideal is close to 1
mshow(y,x,c("q2","Q2")) # ideal is close to 0
# other measures:
print(mmetric(y,x,c("TOLERANCE","NAREC"),val=0.5)) # if admitted/accepted absolute error is 0.5
print(mmetric(y,x,"TOLERANCEPERC",val=0.05)) # tolerance for a 5% of yrange 
# tolerance for fixed 0.1 value and 5% of yrange:
print(mmetric(y,x,c("TOLERANCE","TOLERANCEPERC"),val=c(0.1,0.05))) 
print(mmetric(y,x,"THEILSU2",val=94.1)) # val = 1-ahead random walk, c(y,94.1), same as below
print(mmetric(y,x,"THEILSU2",val=c(94.1,y[1:5]))) # val = 1-ahead random walk (previous y values)
print(mmetric(y,x,"MASE",val=c(88.1,89.9,93.2,94.1))) # val = in-samples
val=vector("list",length=4)
val[[2]]=0.5;val[[3]]=94.1;val[[4]]=c(88.1,89.9,93.2,94.1)
print(mmetric(y,x,c("MAE","NAREC","THEILSU2","MASE"),val=val))
# user defined error function example:
# myerror = number of samples with absolute error above 0.1% of y: 
myerror=function(y,x){return (sum(abs(y-x)>(0.001*y)))}
print(mmetric(y,x,metric=myerror))
# example that returns a list since "REC" is included:
print(mmetric(y,x,c("MAE","REC","TOLERANCE","EV"),val=1))


### mining, several runs, prob multi-class
## Not run: 
data(iris)
M=mining(Species~.,iris,model="rpart",Runs=2)
R=mmetric(M,metric="CONF",aggregate="no")
print(R[[1]]$conf)
print(R[[2]]$conf)
print(mmetric(M,metric="CONF",aggregate="mean"))
print(mmetric(M,metric="CONF",aggregate="sum"))
print(mmetric(M,metric=c("ACC","ACCLASS"),aggregate="no"))
print(mmetric(M,metric=c("ACC","ACCLASS"),aggregate="mean"))
print(mmetric(M,metric="ALL",aggregate="no"))
print(mmetric(M,metric="ALL",aggregate="mean"))

## End(Not run)

### mining, several runs, regression
## Not run: 
data(sin1reg)
S=sample(1:nrow(sin1reg),40)
M=mining(y~.,data=sin1reg[S,],model="ksvm",search=2^3,Runs=10)
R=mmetric(M,metric="MAE")
print(mmetric(M,metric="MAE",aggregate="mean"))
miR=meanint(R) # mean and t-student confidence intervals
cat("MAE=",round(miR$mean,digits=2),"+-",round(miR$int,digits=2),"\n")
print(mmetric(M,metric=c("MAE","RMSE")))
print(mmetric(M,metric=c("MAE","RMSE"),aggregate="mean"))
R=mmetric(M,metric="REC",aggregate="no")
print(R[[1]]$rec)
print(mmetric(M,metric=c("TOLERANCE","NAREC"),val=0.2))
print(mmetric(M,metric=c("TOLERANCE","NAREC"),val=0.2,aggregate="mean"))

## End(Not run)

### pure binary classification 
y=factor(c("a","a","a","a","b","b","b","b"))
x=factor(c("a","a","b","a","b","a","b","a"))
print(mmetric(y,x,"CONF")$conf)
print(mmetric(y,x,metric=c("ACC","TPR","ACCLASS")))
print(mmetric(y,x,"ALL"))

### probabilities binary classification 
y=factor(c("a","a","a","a","b","b","b","b"))
px=matrix(nrow=8,ncol=2)
px[,1]=c(1.0,0.9,0.8,0.7,0.6,0.5,0.4,0.3)
px[,2]=1-px[,1]
print(px)
print(mmetric(y,px,"CONF")$conf)
print(mmetric(y,px,"CONF",D=0.5,TC=2)$conf)
print(mmetric(y,px,"CONF",D=0.3,TC=2)$conf)
print(mmetric(y,px,metric="ALL",D=0.3,TC=2))
print(mmetric(y,px,metric=c("ACC","AUC","AUCCLASS","BRIER","BRIERCLASS","CE"),D=0.3,TC=2))
# ACC and confusion matrix:
print(mmetric(y,px,metric=c("ACC","CONF"),D=0.3,TC=2))
# ACC and ROC curve:
print(mmetric(y,px,metric=c("ACC","ROC"),D=0.3,TC=2))
# ACC, ROC and LIFT curve:
print(mmetric(y,px,metric=c("ACC","ROC","LIFT"),D=0.3,TC=2))

### pure multi-class classification 
y=c('A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A',
'A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A',
'A','A','B','B','B','B','B','B','B','B','B','B','C','C','C','C','C','C','C','C','C','C',
'C','C','C','C','C','D','D','D','D','D','D','D','D','D','D','D','D','D','D','D','D','D',
'D','D','D','D','D','D','D','D','E','E','E','E','E')
x=c('A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A',
'A','A','A','A','A','A','A','A','A','A','A','A','A','A','E','E','E','E','E','D','D','D',
'D','D','B','B','B','B','B','B','B','B','B','D','C','C','C','C','C','C','C','B','B','B',
'B','B','C','C','C','D','D','D','D','D','D','D','D','D','D','D','D','D','D','D','D','D',
'D','D','D','D','D','D','C','C','E','A','A','B','B')
y=factor(y)
x=factor(x)
print(mmetric(y,x,metric="CONF")$conf) # confusion matrix
print(mmetric(y,x,metric="CONF",TC=-1)$conf) # same thing
print(mmetric(y,x,metric="CONF",TC=1)$conf) # for target class TC=1: "A"
mshow=function(y,x,metric) print(round(mmetric(y,x,metric),digits=0))
mshow(y,x,"ALL")
mshow(y,x,c("ACCLASS","BAL_ACC","KAPPA"))
mshow(y,x,c("PRECISION")) # precision
mshow(y,x,c("TPR")) # recall 
mshow(y,x,c("F1")) # F1 score

# micro (=ACC), macro and weighted average:
mshow(y,x,c("ACC","macroPRECISION","weightedPRECISION")) 
mshow(y,x,c("ACC","macroTPR","weightedTPR")) 
mshow(y,x,c("ACC","macroF1","weightedF1"))
mshow(y,x,c("ACC","macroACC","weightedACC"))

# several metrics in a single returned object:
print(mmetric(y,x,metric=c("CONF","macroF1","weightedF1","ACC")))

### probabilities multi-class 
y=factor(c("a","a","b","b","c","c"))
px=matrix(nrow=6,ncol=3)
px[,1]=c(1.0,0.7,0.5,0.3,0.1,0.7)
px[,2]=c(0.0,0.2,0.4,0.7,0.3,0.2)
px[,3]=1-px[,1]-px[,2]
print(px)
print(mmetric(y,px,metric="ALL",TC=-1,val=0.1))
print(mmetric(y,px,metric=c("AUC","AUCCLASS","NAUC"),TC=-1,val=0.1))
print(mmetric(y,px,metric=c("AUC","NAUC"),TC=3,val=0.1))
print(mmetric(y,px,metric=c("ACC","ACCLASS"),TC=-1))
print(mmetric(y,px,metric=c("CONF"),TC=3,D=0.5)$conf)
print(mmetric(y,px,metric=c("ACCLASS"),TC=3,D=0.5))
print(mmetric(y,px,metric=c("CONF"),TC=3,D=0.7)$conf)
print(mmetric(y,px,metric=c("ACCLASS"),TC=3,D=0.7))

### ordinal multi-class (example in Ricardo Sousa PhD thesis 2012)
y=ordered(c(rep("a",4),rep("b",6),rep("d",3)),levels=c("a","b","c","d"))
x=ordered(c(rep("c",(4+6)),rep("d",3)),levels=c("a","b","c","d"))
print(mmetric(y,x,metric="CONF")$conf)
print(mmetric(y,x,metric=c("CE","MAEO","MSEO","KENDALL")))
# note: only y needs to be ordered
x=factor(c(rep("b",4),rep("a",6),rep("d",3)),levels=c("a","b","c","d"))
print(mmetric(y,x,metric="CONF")$conf)
print(mmetric(y,x,metric=c("CE","MAEO","MSEO","KENDALL")))
print(mmetric(y,x,metric="ALL"))

### ranking (multi-class) 
y=matrix(nrow=1,ncol=12);x=y
# http://www.youtube.com/watch?v=D56dvoVrBBE
y[1,]=1:12
x[1,]=c(2,1,4,3,6,5,8,7,10,9,12,11)
print(mmetric(y,x,metric="KENDALL"))
print(mmetric(y,x,metric="ALL"))

y=matrix(nrow=2,ncol=7);x=y
y[1,]=c(2,6,5,4,3,7,1)
y[2,]=7:1
x[1,]=1:7
x[2,]=1:7
print(mmetric(y,x,metric="ALL"))

### regression examples: y - desired values; x - predictions
y=c(95.01,96.1,97.2,98.0,99.3,99.7);x=95:100
print(mmetric(y,x,"ALL"))
print(mmetric(y,x,"MAE"))
mshow=function(y,x,metric) print(round(mmetric(y,x,metric),digits=2))
mshow(y,x,c("MAE","RMSE","RAE","RSE"))
# getting NMAE:
m=mmetric(y,x,"NMAE")
cat("NMAE:",round(m,digits=2)," (denominator=",diff(range(y)),")\n")
m=mmetric(y,x,"NMAE",val=5) # usage of different range
cat("NMAE:",round(m,digits=2)," (denominator=",5,")\n")
# get REC curve and other measures:
m=mmetric(y,x,c("REC","TOLERANCEPERC","MAE"),val=5)
print(m)

# correlation or similar measures:
mshow(y,x,c("COR","R2","R22","EV")) # ideal is close to 1
mshow(y,x,c("q2","Q2")) # ideal is close to 0
# other measures:
print(mmetric(y,x,c("TOLERANCE","NAREC"),val=0.5)) # if admitted/accepted absolute error is 0.5
print(mmetric(y,x,"TOLERANCEPERC",val=0.05)) # tolerance for a 5% of yrange 
# tolerance for fixed 0.1 value and 5% of yrange:
print(mmetric(y,x,c("TOLERANCE","TOLERANCEPERC"),val=c(0.1,0.05))) 
print(mmetric(y,x,"THEILSU2",val=94.1)) # val = 1-ahead random walk, c(y,94.1), same as below
print(mmetric(y,x,"THEILSU2",val=c(94.1,y[1:5]))) # val = 1-ahead random walk (previous y values)
print(mmetric(y,x,"MASE",val=c(88.1,89.9,93.2,94.1))) # val = in-samples
val=vector("list",length=4)
val[[2]]=0.5;val[[3]]=94.1;val[[4]]=c(88.1,89.9,93.2,94.1)
print(mmetric(y,x,c("MAE","NAREC","THEILSU2","MASE"),val=val))
# user defined error function example:
# myerror = number of samples with absolute error above 0.1% of y: 
myerror=function(y,x){return (sum(abs(y-x)>(0.001*y)))}
print(mmetric(y,x,metric=myerror))
# example that returns a list since "REC" is included:
print(mmetric(y,x,c("MAE","REC","TOLERANCE","EV"),val=1))


### mining, several runs, prob multi-class
## Not run: 
data(iris)
M=mining(Species~.,iris,model="rpart",Runs=2)
R=mmetric(M,metric="CONF",aggregate="no")
print(R[[1]]$conf)
print(R[[2]]$conf)
print(mmetric(M,metric="CONF",aggregate="mean"))
print(mmetric(M,metric="CONF",aggregate="sum"))
print(mmetric(M,metric=c("ACC","ACCLASS"),aggregate="no"))
print(mmetric(M,metric=c("ACC","ACCLASS"),aggregate="mean"))
print(mmetric(M,metric="ALL",aggregate="no"))
print(mmetric(M,metric="ALL",aggregate="mean"))

## End(Not run)

### mining, several runs, regression
## Not run: 
data(sin1reg)
S=sample(1:nrow(sin1reg),40)
M=mining(y~.,data=sin1reg[S,],model="ksvm",search=2^3,Runs=10)
R=mmetric(M,metric="MAE")
print(mmetric(M,metric="MAE",aggregate="mean"))
miR=meanint(R) # mean and t-student confidence intervals
cat("MAE=",round(miR$mean,digits=2),"+-",round(miR$int,digits=2),"\n")
print(mmetric(M,metric=c("MAE","RMSE")))
print(mmetric(M,metric=c("MAE","RMSE"),aggregate="mean"))
R=mmetric(M,metric="REC",aggregate="no")
print(R[[1]]$rec)
print(mmetric(M,metric=c("TOLERANCE","NAREC"),val=0.2))
print(mmetric(M,metric=c("TOLERANCE","NAREC"),val=0.2,aggregate="mean"))

## End(Not run)

Function that returns a list of searching (hyper)parameters for a particular model (classification or regression) or for a multiple list of models (automl or ensembles).

Description

Easy to use function that returns a list of searching (hyper)parameters for a particular model (classification or regression) or for a multiple list of models (automl or ensembles). The result is to be put in a search argument, used by fit or mining functions. Something like:
search=list(search=mparheuristic(...),...).

Usage

mparheuristic(model, n = NA, lower = NA, upper = NA, by = NA, exponential = NA, 
              kernel = "rbfdot", task = "prob", inputs = NA)
mparheuristic(model, n = NA, lower = NA, upper = NA, by = NA, exponential = NA, 
              kernel = "rbfdot", task = "prob", inputs = NA)

Arguments

`model`	model type name. See `fit` for the individual model details (e.g., `"ksvm"`). For multiple models use: `automl` - 5 individual machine learning algorithms: generalized linear model (GLM, via `cv.glmnet`), support vector machine (SVM, via `ksvm`), multilayer perceptron (MLP, via `mlpe`), random forest (RF, via `randomForest`) and extreme gradient boosting (XG, via `xgboost`). The `n="heuristic"` setting (see below) is assumed for all algorithms, thus just one hyperparameter is tested for each model. This option is thus the fastest automl to run. `automl2` - same 5 individual machine learning algorithms as `automl`. For each algorithm, a grid search is executed with 10 searches (same as: `n="heuristic10"`), except for `ksvm`, which uses 13 searches of an uniform design (`"UD"`). `automl3` - same as `automl2` except that a six extra stacking ensemble (`"SE"`) model is performed using the 5 best tuned algorithm versions (GLM, SVM, MLP, RF and XG). a character vector with several models - see the example section for a demonstration of this option.
`n`	number of searches or heuristic or numeric vector (either `n` or `by` should be used, `n` has prevalence over `by`). By default, the searches are linear for all models except for SVM several `rbfdot` kernel based models (`"ksvm"`,`"rsvm"`,`"lssvm"`, which can assume `2^`search-range; please check the result of this function to confirm if the search is linear or `2^`search-range). If this argument is 1 then the main hyperparameter of the model is set to the value of `lower` (if lower is not NA). If this argument is a numeric vector, then the main hyperparameter of the model is set to this vector. If this argument is a character type, then it is assumed to be an heuristic. Possible heuristic values are: `heuristic` - only one model is fit, uses default rminer values, same as `mparheuristic(model)`. `heuristic5` - 5 hyperparameter searches from lower to upper, only works for the following models: `ctree`, `rpart`, `kknn`, `ksvm`, `lssvm`, `mlp`, `mlpe`, `randomForest`, `multinom`, `rvm`, `xgboost`. Notes: `rpart` - different `cp` values (see `rpart.control`); `ctree` - different `mincriterion` values (see `ctree_control`); `randomForest` – upper argument is limited by the number of `inputs` (`mtry` is searched); `ksvm`, `lssvm` or `rvm` - the optional `kernel` argument can be used. `heuristic10` - same as `heuristic5` but with 10 searches from `lower` to `upper`. `UD` or `UD1` - `UD` or `UD1` uniform design search (only for `ksvm` and `rbfdof` kernel). This option assumes 2 hyperparameters for classification (sigma, C) and 3 hyperparameters (sigma, C, epsilon) for regression, thus `task="reg"` argument needs to be set when regression is used. `xgb9` - 9 searches (3 for `eta` and 3 for `max_depth`, works only when `model=xgboost`. `mlp_t` - heuristic 33 from Delgado 2014 paper, 10 searches, works only when `model=mlp` or `model=mlpe`. `avNNet_t` - heuristic 34 from Delgado 2014 paper, 9 searches, works only when `model=mlpe`. `nnet_t` - heuristic 36 from Delgado 2014 paper, 25 searches, works only when `model=mlp` or `model=mlpe`. `svm_C` - heuristic 48 from Delgado 2014 paper, 130 searches (may take time), works only when `model=ksvm`. `svmRadial_t` - heuristic 52 from Delgado 2014 paper, 25 searches, works only when `model=ksvm`. `svmLinear_t` - heuristic 54 from Delgado 2014 paper, 5 searches, works only when `model=ksvm`. `svmPoly_t` - heuristic 55 from Delgado 2014 paper, 27 searches, works only when `model=ksvm`. `lsvmRadial_t` - heuristic 56 from Delgado 2014 paper, 10 searches, works only when `model=lssvm`. `rpart_t` - heuristic 59 from Delgado 2014 paper, 10 searches, works only when `model=rpart`. `rpart2_t` - heuristic 60 from Delgado 2014 paper, 10 searches, works only when `model=rpart`. `ctree_t` - heuristic 63 from Delgado 2014 paper, 10 searches, works only when `model=ctree`. `ctree2_t` - heuristic 64 from Delgado 2014 paper, 10 searches, works only when `model=ctree`. `rf_t` - heuristic 131 from Delgado 2014 paper, 10 searches, works only when `model=randomForest`. `knn_R` - heuristic 154 from Delgado 2014 paper, 19 searches, works only when `model=kknn`. `knn_t` - heuristic 155 from Delgado 2014 paper, 10 searches, works only when `model=kknn`. `multinom_t` - heuristic 167 from Delgado 2014 paper, 10 searches, works only when `model=multinom`.
`lower`	lower bound for the (hyper)parameter (if `NA` a default value is assumed). If `n=1` and `!is.na(lower)` then lower is the main hyperparameter value for some models (e.g., "kknn", "mlpe", "ksvm", "xgboost").
`upper`	upper bound for the (hyper)parameter (if `NA` a default value is assumed).
`by`	increment in the sequence (if `NA` a default value is assumed depending on `n`).
`exponential`	if an exponential scale should be used in the search sequence (the `NA` is a default value that assumes a linear scale unless `model` is a support vector machine).
`kernel`	optional kernel type, only used when `model="ksvm"`, `model="rsvm"` or `model="lssvm"`. Currently mapped kernels are `"rbfdot"` (Gaussian), `"polydot"` (polynomial) and `"vanilladot"` (linear); see `ksvm` for kernel details.
`task`	optional task argument, only used for uniform design (`UD` or `UD1`) (with `"ksvm"` and `"rbfdot"`).
`inputs`	optional inputs argument: the number of inputs, only used by `"randomForest"`.

Details

This function facilitates the definition of the search argument used by fit or mining functions. Using simple heuristics, reasonable (hyper)parameter search values are suggested for several rminer models. For models not mapped in this function, the function returns NULL, which means that no hyperparameter search is executed (often, this implies using rminer or R function default values).

The simple usage of heuristic assumes lower and upper bounds for a (hyper)parameter. If n=1, then rminer or R defaults are assumed. Else, a search is created using seq(lower,upper,by), where by was set by the used or computed from n. For some model="ksvm" setups, 2^seq(...) is used for sigma and C, (1/10)^seq(...) is used for scale. Please check the resulting object to inspect the obtained final search values.

This function also allows to easily set multiple model searches, under the: "automl", "automl2", "automl3" or vector character options (see below examples).

Value

A list with one ore more (hyper)parameter values to be searched.

Note

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

To check for more details about rminer and for citation purposes:
P. Cortez.
Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool.
In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects 10th Industrial Conference on Data Mining (ICDM 2010), Lecture Notes in Artificial Intelligence 6171, pp. 572-583, Berlin, Germany, July, 2010. Springer. ISBN: 978-3-642-14399-1.
@Springer: https://link.springer.com/chapter/10.1007/978-3-642-14400-4_44
http://www3.dsi.uminho.pt/pcortez/2010-rminer.pdf
The automl is inspired in this work:
L. Ferreira, A. Pilastri, C. Martins, P. Santos, P. Cortez.
An Automated and Distributed Machine Learning Framework for Telecommunications Risk Management. In J. van den Herik et al. (Eds.), Proceedings of 12th International Conference on Agents and Artificial Intelligence – ICAART 2020, Volume 2, pp. 99-107, Valletta, Malta, February, 2020, SCITEPRESS, ISBN 978-989-758-395-7.
@INSTICC: https://www.insticc.org/Primoris/Resources/PaperPdf.ashx?idPaper=89528
This tutorial shows additional code examples:
P. Cortez.
A tutorial on using the rminer R package for data mining tasks.
Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015.
http://hdl.handle.net/1822/36210
Some lower/upper bounds and heuristics were retrieved from:
M. Fernandez-Delgado, E. Cernadas, S. Barro and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems?. In The Journal of Machine Learning Research, 15(1), 3133-3181, 2014.

Examples

## "kknn"
s=mparheuristic("kknn",n="heuristic")
print(s) 
s=mparheuristic("kknn",n=1) # same thing
print(s) 
s=mparheuristic("kknn",n="heuristic5")
print(s) 
s=mparheuristic("kknn",n=5) # same thing
print(s)
s=mparheuristic("kknn",lower=5,upper=15,by=2)
print(s)
# exponential scale:
s=mparheuristic("kknn",lower=1,upper=5,by=1,exponential=2)
print(s)

## "mlpe"
s=mparheuristic("mlpe")
print(s) # "NA" means set size with min(inputs/2,10) in fit
s=mparheuristic("mlpe",n="heuristic10")
print(s) 
s=mparheuristic("mlpe",n=10) # same thing
print(s) 
s=mparheuristic("mlpe",n=10,lower=2,upper=20) 
print(s) 

# numeric (single number or vector) usage of n:
s=mparheuristic("mlpe",n=NA) #
print(s) 
s=mparheuristic("mlpe",n=1,lower=NA) # same thing
print(s) 
s=mparheuristic("mlpe",n=1,lower=2) # size=2
print(s) 
s=mparheuristic("mlpe",n=1:9) # size=1:9
print(s) 


## "randomForest", upper should be set to the number of inputs = max mtry
s=mparheuristic("randomForest",n=10,upper=6)
print(s) 

## "ksvm"
s=mparheuristic("ksvm",n=10)
print(s) 
s=mparheuristic("ksvm",n=10,kernel="vanilladot")
print(s) 
s=mparheuristic("ksvm",n=10,kernel="polydot")
print(s) 

## lssvm
s=mparheuristic("lssvm",n=10)
print(s) 

## rvm 
s=mparheuristic("rvm",n=5)
print(s) 
s=mparheuristic("rvm",n=5,kernel="vanilladot")
print(s) 

## "rpart" and "ctree" are special cases (see help(fit,package=rminer) examples):
s=mparheuristic("rpart",n=3) # 3 cp values
print(s) 
s=mparheuristic("ctree",n=3) # 3 mincriterion values
print(s) 

### examples with fit
## Not run: 
### classification
data(iris)
# ksvm and rbfdot:
model="ksvm";kernel="rbfdot"
s=mparheuristic(model,n="heuristic5",kernel=kernel)
print(s) # 5 sigma values
search=list(search=s,method=c("holdout",2/3,123))
# task "prob" is assumed, optimization of "AUC":
M=fit(Species~.,data=iris,model=model,search=search,fdebug=TRUE)
print(M@mpar)

# different lower and upper range:
s=mparheuristic(model,n=5,kernel=kernel,lower=-5,upper=1)
print(s) # from 2^-5 to 2^1 
search=list(search=s,method=c("holdout",2/3,123))
# task "prob" is assumed, optimization of "AUC":
M=fit(Species~.,data=iris,model=model,search=search,fdebug=TRUE)
print(M@mpar)

# different exponential scale: 
s=mparheuristic(model,n=5,kernel=kernel,lower=-4,upper=0,exponential=10)
print(s) # from 10^-5 to 10^1 
search=list(search=s,method=c("holdout",2/3,123))
# task "prob" is assumed, optimization of "AUC":
M=fit(Species~.,data=iris,model=model,search=search,fdebug=TRUE)
print(M@mpar)

# "lssvm" Gaussian model, pure classification and ACC optimization, full iris:
model="lssvm";kernel="rbfdot"
s=mparheuristic("lssvm",n=3,kernel=kernel)
print(s)
search=list(search=s,method=c("holdout",2/3,123))
M=fit(Species~.,data=iris,model=model,search=search,fdebug=TRUE)
print(M@mpar)

# test several heuristic5 searches, full iris:
n="heuristic5";inputs=ncol(iris)-1
model=c("ctree","rpart","kknn","ksvm","lssvm","mlpe","randomForest")
for(i in 1:length(model))
 {
  cat("--- i:",i,"model:",model[i],"\n")
  if(model[i]=="randomForest") s=mparheuristic(model[i],n=n,upper=inputs) 
  else s=mparheuristic(model[i],n=n)
  print(s)
  search=list(search=s,method=c("holdout",2/3,123))
  M=fit(Species~.,data=iris,model=model[i],search=search,fdebug=TRUE)
  print(M@mpar)
 }


# test several Delgado 2014 searches (some cases launch warnings):
model=c("mlp","mlpe","mlp","ksvm","ksvm","ksvm",
        "ksvm","lssvm","rpart","rpart","ctree",
        "ctree","randomForest","kknn","kknn","multinom")
n=c("mlp_t","avNNet_t","nnet_t","svm_C","svmRadial_t","svmLinear_t",
    "svmPoly_t","lsvmRadial_t","rpart_t","rpart2_t","ctree_t",
    "ctree2_t","rf_t","knn_R","knn_t","multinom_t")
inputs=ncol(iris)-1
for(i in 1:length(model))
 {
  cat("--- i:",i,"model:",model[i],"heuristic:",n[i],"\n")
  if(model[i]=="randomForest") s=mparheuristic(model[i],n=n[i],upper=inputs) 
  else s=mparheuristic(model[i],n=n[i])
  print(s)
  search=list(search=s,method=c("holdout",2/3,123))
  M=fit(Species~.,data=iris,model=model[i],search=search,fdebug=TRUE)
  print(M@mpar)
 }

## End(Not run) #dontrun

### regression
## Not run: 
data(sa_ssin)
s=mparheuristic("ksvm",n=3,kernel="polydot")
print(s)
search=list(search=s,metric="MAE",method=c("holdout",2/3,123))
M=fit(y~.,data=sa_ssin,model="ksvm",search=search,fdebug=TRUE)
print(M@mpar)

# regression task, predict iris "Petal.Width":
data(iris)
ir2=iris[,1:4]
names(ir2)[ncol(ir2)]="y" # change output name
n=3;inputs=ncol(ir2)-1 # 3 hyperparameter searches
model=c("ctree","rpart","kknn","ksvm","mlpe","randomForest","rvm")
for(i in 1:length(model))
 {
  cat("--- i:",i,"model:",model[i],"\n")
  if(model[i]=="randomForest") s=mparheuristic(model[i],n=n,upper=inputs)
  else s=mparheuristic(model[i],n=n)
  print(s)
  search=list(search=s,method=c("holdout",2/3,123))
  M=fit(y~.,data=ir2,model=model[i],search=search,fdebug=TRUE)
  print(M@mpar)
 }

## End(Not run) #dontrun

### multiple model examples:
## Not run: 
data(iris)
inputs=ncol(iris)-1; task="prob"

# 5 machine learning (ML) algorithms, 1 heuristic hyperparameter per algorithm:
sm=mparheuristic(model="automl",task=task,inputs=inputs)
print(sm)

# 5 ML with 10/13 hyperparameter searches:
sm=mparheuristic(model="automl2",task=task,inputs=inputs)
# note: mtry only has 4 searches due to the inputs limit:
print(sm)

# regression example:
ir2=iris[,1:4]
inputs=ncol(ir2)-1; task="reg"
sm=mparheuristic(model="automl2",task=task,inputs=inputs)
# note: ksvm contains 3 UD hyperparameters (and not 2) since task="reg": 
print(sm)

# 5 ML and stacking:
inputs=ncol(iris)-1; task="prob"
sm=mparheuristic(model="automl3",task=task,inputs=inputs)
# note: $ls only has 5 elements, one for each individual ML 
print(sm)

# other manual design examples: --------------------------------------

# 5 ML and three ensembles:
# the fit or mining functions will search for the best option
# between any of the 5 ML algorithms and any of the three 
# ensemble approaches:
sm2=mparheuristic(model="automl3",task=task,inputs=inputs)
# note: ensembles need to be at the end of the $models field:
sm2$models=c(sm2$models,"AE","WE") # add AE and WE
sm2$smethod=c(sm2$smethod,rep("grid",2)) # add grid to AE and WE
# note: $ls only has 5 elements, one for each individual ML 
print(sm2)

# 3 ML example:
models=c("cv.glmnet","mlpe","ksvm") # just 3 models
# note: in rminer the default cv.glmnet does not have "hyperparameters"
# since the cv automatically sets lambda 
n=c(NA,10,"UD") # 10 searches for mlpe and 13 for ksvm 
sm3=mparheuristic(model=models,n=n)
# note: $ls only has 5 elements, one for each individual ML 
print(sm3)

# usage in sm2 and sm3 for fit (see mining help for usages in mining):
method=c("holdout",2/3,123)
d=iris
names(d)[ncol(d)]="y" # change output name
s2=list(search=sm2,smethod="auto",method=method,metric="AUC",convex=0)
M2=fit(y~.,data=d,model="auto",search=s2,fdebug=TRUE)

s3=list(search=sm3,smethod="auto",method=method,metric="AUC",convex=0)
M3=fit(y~.,data=d,model="auto",search=s3,fdebug=TRUE)
# -------------------------------------------------------------------

## End(Not run)

## "kknn"
s=mparheuristic("kknn",n="heuristic")
print(s) 
s=mparheuristic("kknn",n=1) # same thing
print(s) 
s=mparheuristic("kknn",n="heuristic5")
print(s) 
s=mparheuristic("kknn",n=5) # same thing
print(s)
s=mparheuristic("kknn",lower=5,upper=15,by=2)
print(s)
# exponential scale:
s=mparheuristic("kknn",lower=1,upper=5,by=1,exponential=2)
print(s)

## "mlpe"
s=mparheuristic("mlpe")
print(s) # "NA" means set size with min(inputs/2,10) in fit
s=mparheuristic("mlpe",n="heuristic10")
print(s) 
s=mparheuristic("mlpe",n=10) # same thing
print(s) 
s=mparheuristic("mlpe",n=10,lower=2,upper=20) 
print(s) 

# numeric (single number or vector) usage of n:
s=mparheuristic("mlpe",n=NA) #
print(s) 
s=mparheuristic("mlpe",n=1,lower=NA) # same thing
print(s) 
s=mparheuristic("mlpe",n=1,lower=2) # size=2
print(s) 
s=mparheuristic("mlpe",n=1:9) # size=1:9
print(s) 


## "randomForest", upper should be set to the number of inputs = max mtry
s=mparheuristic("randomForest",n=10,upper=6)
print(s) 

## "ksvm"
s=mparheuristic("ksvm",n=10)
print(s) 
s=mparheuristic("ksvm",n=10,kernel="vanilladot")
print(s) 
s=mparheuristic("ksvm",n=10,kernel="polydot")
print(s) 

## lssvm
s=mparheuristic("lssvm",n=10)
print(s) 

## rvm 
s=mparheuristic("rvm",n=5)
print(s) 
s=mparheuristic("rvm",n=5,kernel="vanilladot")
print(s) 

## "rpart" and "ctree" are special cases (see help(fit,package=rminer) examples):
s=mparheuristic("rpart",n=3) # 3 cp values
print(s) 
s=mparheuristic("ctree",n=3) # 3 mincriterion values
print(s) 

### examples with fit
## Not run: 
### classification
data(iris)
# ksvm and rbfdot:
model="ksvm";kernel="rbfdot"
s=mparheuristic(model,n="heuristic5",kernel=kernel)
print(s) # 5 sigma values
search=list(search=s,method=c("holdout",2/3,123))
# task "prob" is assumed, optimization of "AUC":
M=fit(Species~.,data=iris,model=model,search=search,fdebug=TRUE)
print(M@mpar)

# different lower and upper range:
s=mparheuristic(model,n=5,kernel=kernel,lower=-5,upper=1)
print(s) # from 2^-5 to 2^1 
search=list(search=s,method=c("holdout",2/3,123))
# task "prob" is assumed, optimization of "AUC":
M=fit(Species~.,data=iris,model=model,search=search,fdebug=TRUE)
print(M@mpar)

# different exponential scale: 
s=mparheuristic(model,n=5,kernel=kernel,lower=-4,upper=0,exponential=10)
print(s) # from 10^-5 to 10^1 
search=list(search=s,method=c("holdout",2/3,123))
# task "prob" is assumed, optimization of "AUC":
M=fit(Species~.,data=iris,model=model,search=search,fdebug=TRUE)
print(M@mpar)

# "lssvm" Gaussian model, pure classification and ACC optimization, full iris:
model="lssvm";kernel="rbfdot"
s=mparheuristic("lssvm",n=3,kernel=kernel)
print(s)
search=list(search=s,method=c("holdout",2/3,123))
M=fit(Species~.,data=iris,model=model,search=search,fdebug=TRUE)
print(M@mpar)

# test several heuristic5 searches, full iris:
n="heuristic5";inputs=ncol(iris)-1
model=c("ctree","rpart","kknn","ksvm","lssvm","mlpe","randomForest")
for(i in 1:length(model))
 {
  cat("--- i:",i,"model:",model[i],"\n")
  if(model[i]=="randomForest") s=mparheuristic(model[i],n=n,upper=inputs) 
  else s=mparheuristic(model[i],n=n)
  print(s)
  search=list(search=s,method=c("holdout",2/3,123))
  M=fit(Species~.,data=iris,model=model[i],search=search,fdebug=TRUE)
  print(M@mpar)
 }


# test several Delgado 2014 searches (some cases launch warnings):
model=c("mlp","mlpe","mlp","ksvm","ksvm","ksvm",
        "ksvm","lssvm","rpart","rpart","ctree",
        "ctree","randomForest","kknn","kknn","multinom")
n=c("mlp_t","avNNet_t","nnet_t","svm_C","svmRadial_t","svmLinear_t",
    "svmPoly_t","lsvmRadial_t","rpart_t","rpart2_t","ctree_t",
    "ctree2_t","rf_t","knn_R","knn_t","multinom_t")
inputs=ncol(iris)-1
for(i in 1:length(model))
 {
  cat("--- i:",i,"model:",model[i],"heuristic:",n[i],"\n")
  if(model[i]=="randomForest") s=mparheuristic(model[i],n=n[i],upper=inputs) 
  else s=mparheuristic(model[i],n=n[i])
  print(s)
  search=list(search=s,method=c("holdout",2/3,123))
  M=fit(Species~.,data=iris,model=model[i],search=search,fdebug=TRUE)
  print(M@mpar)
 }

## End(Not run) #dontrun

### regression
## Not run: 
data(sa_ssin)
s=mparheuristic("ksvm",n=3,kernel="polydot")
print(s)
search=list(search=s,metric="MAE",method=c("holdout",2/3,123))
M=fit(y~.,data=sa_ssin,model="ksvm",search=search,fdebug=TRUE)
print(M@mpar)

# regression task, predict iris "Petal.Width":
data(iris)
ir2=iris[,1:4]
names(ir2)[ncol(ir2)]="y" # change output name
n=3;inputs=ncol(ir2)-1 # 3 hyperparameter searches
model=c("ctree","rpart","kknn","ksvm","mlpe","randomForest","rvm")
for(i in 1:length(model))
 {
  cat("--- i:",i,"model:",model[i],"\n")
  if(model[i]=="randomForest") s=mparheuristic(model[i],n=n,upper=inputs)
  else s=mparheuristic(model[i],n=n)
  print(s)
  search=list(search=s,method=c("holdout",2/3,123))
  M=fit(y~.,data=ir2,model=model[i],search=search,fdebug=TRUE)
  print(M@mpar)
 }

## End(Not run) #dontrun

### multiple model examples:
## Not run: 
data(iris)
inputs=ncol(iris)-1; task="prob"

# 5 machine learning (ML) algorithms, 1 heuristic hyperparameter per algorithm:
sm=mparheuristic(model="automl",task=task,inputs=inputs)
print(sm)

# 5 ML with 10/13 hyperparameter searches:
sm=mparheuristic(model="automl2",task=task,inputs=inputs)
# note: mtry only has 4 searches due to the inputs limit:
print(sm)

# regression example:
ir2=iris[,1:4]
inputs=ncol(ir2)-1; task="reg"
sm=mparheuristic(model="automl2",task=task,inputs=inputs)
# note: ksvm contains 3 UD hyperparameters (and not 2) since task="reg": 
print(sm)

# 5 ML and stacking:
inputs=ncol(iris)-1; task="prob"
sm=mparheuristic(model="automl3",task=task,inputs=inputs)
# note: $ls only has 5 elements, one for each individual ML 
print(sm)

# other manual design examples: --------------------------------------

# 5 ML and three ensembles:
# the fit or mining functions will search for the best option
# between any of the 5 ML algorithms and any of the three 
# ensemble approaches:
sm2=mparheuristic(model="automl3",task=task,inputs=inputs)
# note: ensembles need to be at the end of the $models field:
sm2$models=c(sm2$models,"AE","WE") # add AE and WE
sm2$smethod=c(sm2$smethod,rep("grid",2)) # add grid to AE and WE
# note: $ls only has 5 elements, one for each individual ML 
print(sm2)

# 3 ML example:
models=c("cv.glmnet","mlpe","ksvm") # just 3 models
# note: in rminer the default cv.glmnet does not have "hyperparameters"
# since the cv automatically sets lambda 
n=c(NA,10,"UD") # 10 searches for mlpe and 13 for ksvm 
sm3=mparheuristic(model=models,n=n)
# note: $ls only has 5 elements, one for each individual ML 
print(sm3)

# usage in sm2 and sm3 for fit (see mining help for usages in mining):
method=c("holdout",2/3,123)
d=iris
names(d)[ncol(d)]="y" # change output name
s2=list(search=sm2,smethod="auto",method=method,metric="AUC",convex=0)
M2=fit(y~.,data=d,model="auto",search=s2,fdebug=TRUE)

s3=list(search=sm3,smethod="auto",method=method,metric="AUC",convex=0)
M3=fit(y~.,data=d,model="auto",search=s3,fdebug=TRUE)
# -------------------------------------------------------------------

## End(Not run)

predict method for fit objects (rminer)

Description

predict method for fit objects (rminer)

Arguments

`object`	a model object created by `fit`
`newdata`	a data frame or matrix containing new data

Details

Returns predictions for a fit model. Note: the ... optional argument is currently only used by cubist model (see example).

Value

If task is prob returns a matrix, where each column is the class probability.
If task is class returns a factor.
If task is reg returns a numeric vector.

Methods

signature(object = "model"): describe this method here

References

To check for more details about rminer and for citation purposes:
P. Cortez.
Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool.
In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects 10th Industrial Conference on Data Mining (ICDM 2010), Lecture Notes in Artificial Intelligence 6171, pp. 572-583, Berlin, Germany, July, 2010. Springer. ISBN: 978-3-642-14399-1.
@Springer: https://link.springer.com/chapter/10.1007/978-3-642-14400-4_44
http://www3.dsi.uminho.pt/pcortez/2010-rminer.pdf
This tutorial shows additional code examples:
P. Cortez.
A tutorial on using the rminer R package for data mining tasks.
Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015.
http://hdl.handle.net/1822/36210

Examples

### simple classification example with logistic regression
data(iris)
M=fit(Species~.,iris,model="lr")
P=predict(M,iris)
print(mmetric(iris$Species,P,"CONF")) # confusion matrix

### simple regression example
data(sa_ssin)
H=holdout(sa_ssin$y,ratio=0.5,seed=12345)
Y=sa_ssin[H$ts,]$y # desired test set
# fit multiple regression on training data (half of samples)
M=fit(y~.,sa_ssin[H$tr,],model="mr") # multiple regression
P1=predict(M,sa_ssin[H$ts,]) # predictions on test set
print(mmetric(Y,P1,"MAE")) # mean absolute error

### fit cubist model
M=fit(y~.,sa_ssin[H$tr,],model="cubist") #
P2=predict(M,sa_ssin[H$ts,],neighbors=3) #
print(mmetric(Y,P2,"MAE")) # mean absolute error
P3=predict(M,sa_ssin[H$ts,],neighbors=7) #
print(mmetric(Y,P3,"MAE")) # mean absolute error

### check fit for more examples
### simple classification example with logistic regression
data(iris)
M=fit(Species~.,iris,model="lr")
P=predict(M,iris)
print(mmetric(iris$Species,P,"CONF")) # confusion matrix

### simple regression example
data(sa_ssin)
H=holdout(sa_ssin$y,ratio=0.5,seed=12345)
Y=sa_ssin[H$ts,]$y # desired test set
# fit multiple regression on training data (half of samples)
M=fit(y~.,sa_ssin[H$tr,],model="mr") # multiple regression
P1=predict(M,sa_ssin[H$ts,]) # predictions on test set
print(mmetric(Y,P1,"MAE")) # mean absolute error

### fit cubist model
M=fit(y~.,sa_ssin[H$tr,],model="cubist") #
P2=predict(M,sa_ssin[H$ts,],neighbors=3) #
print(mmetric(Y,P2,"MAE")) # mean absolute error
P3=predict(M,sa_ssin[H$ts,],neighbors=7) #
print(mmetric(Y,P3,"MAE")) # mean absolute error

### check fit for more examples

Synthetic regression and classification datasets for measuring input importance of supervised learning models

Description

5 Synthetic regression (sa_fri1, sa_ssin, sa_psin, sa_int2, sa_tree) and 4 classification (sa_ssin_2, sa_ssin_n2p, sa_int2_3c, sa_int2_8p) datasets for measuring input importance of supervised learning models

Usage

data(sa_fri1)data(sa_fri1)

Format

A data frame with 1000 observations on the following variables.

xn: input (numeric or factor, depends on the dataset)
y: output target (numeric or factor, depends on the dataset)

Details

Check reference or source for full details

Source

See references

References

To cite the Importance function, sensitivity analysis methods or synthetic datasets, please use:
P. Cortez and M.J. Embrechts.
Using Sensitivity Analysis and Visualization Techniques to Open Black Box Data Mining Models.
In Information Sciences, Elsevier, 225:1-17, March 2013.
doi:10.1016/j.ins.2012.10.039

Examples

data(sa_ssin)
print(summary(sa_ssin))
## Not run: plot(sa_ssin$x1,sa_ssin$y)
data(sa_ssin)
print(summary(sa_ssin))
## Not run: plot(sa_ssin$x1,sa_ssin$y)

Load/save into a file the result of a fit (model) or mining functions.

Description

Load/save into a file the result of a fit (model) or mining functions.

Usage

savemining(mmm_mining, file, ascii = TRUE)savemining(mmm_mining, file, ascii = TRUE)

Arguments

`mmm_mining`	the list object that is returned by the `mining` function.
`file`	filename that should include an extension
`ascii`	if `TRUE` then ascii format is used to store the file (larger file size), else a binary format is used.

Details

Very simple functions that do what their names say. Additional usages are:
loadmining(file)
savemodel(MM_model,file,ascii=FALSE)
loadmodel(file)

Value

loadmining returns a mining mining list, while loadmodel returns a model object (from fit).

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

See fit.

Examples

### dontrun is used here to avoid the creation of a new file
### in the CRAN servers. The example should work fine:
## Not run: 
data(iris)
M=fit(Species~.,iris,model="rpart")
tempdirpath=tempdir()
filename=paste(tempdirpath,"/iris.model",sep="")
savemodel(M,filename) # saves to file
M=NULL # cleans M
M=loadmodel(filename) # load from file
print(M)

## End(Not run)
### dontrun is used here to avoid the creation of a new file
### in the CRAN servers. The example should work fine:
## Not run: 
data(iris)
M=fit(Species~.,iris,model="rpart")
tempdirpath=tempdir()
filename=paste(tempdirpath,"/iris.model",sep="")
savemodel(M,filename) # saves to file
M=NULL # cleans M
M=loadmodel(filename) # load from file
print(M)

## End(Not run)

sin1 regression dataset

Description

Simple synthetic dataset with 1000 points, where y=0.7*sin(pi*x1/2000)+0.3*sin(pi*x2/2000)

Usage

data(sin1reg)data(sin1reg)

Format

The format is: chr "sin1reg"

Details

Simple synthetic dataset with 1000 points, where y=0.7*sin(pi*x1/2000)+0.3*sin(pi*x2/2000)

Source

See references

References

To cite the Importance function, sensitivity analysis methods or synthetic datasets, please use:
P. Cortez and M.J. Embrechts.
Using Sensitivity Analysis and Visualization Techniques to Open Black Box Data Mining Models.
In Information Sciences, Elsevier, 225:1-17, March 2013.
doi:10.1016/j.ins.2012.10.039

Examples

data(sin1reg)
print(summary(sin1reg))
data(sin1reg)
print(summary(sin1reg))

VEC plot function (to use in conjunction with Importance function).

Description

VEC plot function (to use in conjunction with Importance function).

Usage

vecplot(I, graph = "VEC", leg = NULL, xval = 1, sort = FALSE, data = NULL, 
digits = c(1, 1), TC = 1, intbar = NULL, lty = 1, pch = 19, col = NULL, 
datacol = NULL, main = "", main2 = "", Grid = 0, 
xlab = "", ylab = "", zlab = "", 
levels = NULL, levels2 = NULL, showlevels = FALSE, 
screen = list(z = 40, x = -60), zoom = 1, cex = 1)
vecplot(I, graph = "VEC", leg = NULL, xval = 1, sort = FALSE, data = NULL, 
digits = c(1, 1), TC = 1, intbar = NULL, lty = 1, pch = 19, col = NULL, 
datacol = NULL, main = "", main2 = "", Grid = 0, 
xlab = "", ylab = "", zlab = "", 
levels = NULL, levels2 = NULL, showlevels = FALSE, 
screen = list(z = 40, x = -60), zoom = 1, cex = 1)

Arguments

`I`	the output list of the `Importance` function.
`graph`	type of VEC graph: `VEC` – 1-D VEC curve; `VECB` – 1-D VEC curve with box plots (only valid for SA methods: "DSA", "MSA"); `VEC3` – 2-D VEC surface; `VECC` – 2-D VEC contour;
`leg`	see `mgraph`
`xval`	the attribute input index (e.g. 1), only used if `graph="VEC"` or (`graph="VEC3" or "VECC"` and `length(interactions)=1`, see `Importance`). if a vector, then several VEC curves are plotted (in this case, x-axis is scaled).
`sort`	if factor inputs are sorted: `increasing` – sorts the first attribute (if factor) according to the response values, increasing order; `decreasing` – similar to `increasing` but uses reverse order; `TRUE` – similar to `increasing`; `increasing2` – sorts the second attribute (for `graph="VEC3" or "VECC"`, if factor, according to the response values), increasing order; `decreasing2` – similar to `increasing2` but uses reverse order; `FALSE` – no sort is used;
`data`	see `mgraph`
`digits`	see `mgraph`
`TC`	see `mgraph`
`intbar`	see `mgraph`
`lty`	see `mgraph`
`pch`	point type for the `graph="VEC"` curve, can be a vector if there are several VEC curve plots
`col`	color (e.g. "black", "grayrange", "white")
`datacol`	color of the data histogram for `graph="VEC"`
`main`	see `mgraph`
`main2`	key title for `graph="VECC"`
`Grid`	see `mgraph`
`xlab`	x-axis label
`ylab`	y-axis label
`zlab`	z-axis label
`levels`	if x1 is factor you can choose the order of the levels to this argument
`levels2`	if x2 is factor you can choose the order of the levels to this argument
`showlevels`	if you want to show the factor levels in x1 or x2 axis in `graph="VEC3"`: `FALSE` or `TRUE` – do not (do) show the levels in x1, x2 and z axis for factor variables; vector with 3 logical values – if you want to show the levels in each of the x1, x2 or z axis for factor variables (e.g. `c(FALSE,FALSE,TRUE)` only shows for z-axis).
`screen`	select the perspective angle of the `VEC3` graph: `x` – assumes `list(z=0,x=-90,y=0)`; `X` – assumes `list(x=-75)`; `y` – assumes `list(z=0,x=-90,y=-90)`; `Y` – assumes `list(z=10,x=-90,y=-90)`; `z` – assumes `list(z=0,x=0,y=0)`; `xy` – assumes `list(z=10,x=-90,y=-45)`; else you need to specify a list with z, x an y angles, see `wireframe`
`zoom`	zoom of the wireframe (`graph="VEC3"`)
`cex`	label font size

Details

For examples and references check: Importance

Value

A VEC curve/surface/contour plot.

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

To cite the Importance function or sensitivity analysis method, please use:

P. Cortez and M.J. Embrechts.
Using Sensitivity Analysis and Visualization Techniques to Open Black Box Data Mining Models.
In Information Sciences, Elsevier, 225:1-17, March 2013.

doi:10.1016/j.ins.2012.10.039

Package 'rminer'

Help Index

Create a training set (data.frame) from a time series using a sliding window.

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Computes k-fold cross validation for rminer models.

Description

Usage

Arguments

Details

Value

Note

Author(s)

References

See Also

Examples

Reduce, replace or transform levels of a data.frame or factor variable (useful for preprocessing datasets).

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Fit a supervised data mining model (classification or regression) model

Description

Usage

Arguments

Details

Value

Note

Author(s)

References

See Also

Examples

Computes indexes for holdout data split into training and test sets.

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Measure input importance (including sensitivity analysis) given a supervised data mining model.

Description

Usage

Arguments

Details

Value

Note

Author(s)

References

See Also

Examples

Missing data imputation (e.g. substitution by value or hotdeck method).

Description

Usage

Arguments

Details

Value

Note

Author(s)

References

See Also

Examples

Compute long term forecasts.

Description

Usage

Arguments