library(tidymodels)
library(neuralnet)
Hide code cell output
── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels 1.3.0 ──
 broom        1.0.7      recipes      1.1.1
 dials        1.4.0      rsample      1.2.1
 dplyr        1.1.4      tibble       3.2.1
 ggplot2      3.5.1      tidyr        1.3.1
 infer        1.0.7      tune         1.3.0
 modeldata    1.4.0      workflows    1.2.0
 parsnip      1.3.0      workflowsets 1.1.0
 purrr        1.0.4      yardstick    1.3.2
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
 purrr::%||%()    masks base::%||%()
 purrr::discard() masks scales::discard()
 dplyr::filter()  masks stats::filter()
 dplyr::lag()     masks stats::lag()
 recipes::step()  masks stats::step()
Attaching package: ‘neuralnet’
The following object is masked from ‘package:dplyr’:

    compute

Neural Networks and Deep Learning#

Neural network models emerged from early attempts to model how neurons in the brain might work. While they have had only limited success in actually modeling anything biological and their popularity has ebbed and flowed (mostly due to computational limitations over time), they have become quite useful in machine learning.

In fact, you may have heard of the phrase ``deep learning”, which is used to solve problems in speed recognition, imagine recognition, 3-D object recognition. Deep learning is just a very large, very complex neural network.

So how does it work?

In 1958, Frank Rosenblatt developed the perceptron (in a sense is the most primitive neural network) to make classifications.

The perceptron takes input \(x_1\), \(x_2\), etc. (the values of predictor variables) and calculates the weighted sum \(w_0 + w_1 x_1 + w_2 x_2 + \ldots\), where weights can be positive or negative. However, it outputs either a 0 or a 1, depending on whether the value of the weighted sum is positive or negative.

If you think the process of making a prediction with the perceptron sounds like linear regression, you’re right.

\[ Regression = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots \]
\[\begin{split} Perceptron = \begin{cases} 0 & if w_0 + w_1 x_1 + w_2 x_2 + \ldots \le 0 \\ 1 & if w_0 + w_1 x_1 + w_2 x_2 + \ldots > 0 \end{cases} \end{split}\]

In fact, the perceptron looks a lot like a linear regression model trying to predict the probability of a level and then classifying accordingly. Linear regression is usually a pretty bad model for this application (which is why we use logistic regression instead).

Activation Function#

People realized that instead of transforming the weighted sum of predictors to an output using a ``step function” (0 if less than some threshold, 1 if greater), a better idea would be to feed the weighted sum into a more general activation function. One popular choice is the logistic function.

\[z = w_0 + w_1 x_1 + w_2 x_2 + \ldots\]
\[y = \frac{e^z}{1+ e^z} = \frac{1}{1 + e^{-z}}\]

When the perceptron uses the logistic activation function, the end result is logistic regression, which we can use the model the probability an individual possesses one of two classes.

By taking the weighted sum of predictor variables and passing it through the logistic activation function, the output becomes a number between 0 and 1 and could be interpreted as a probability. The upshot has been the logistic regression has been re-invented.

There are many choices of activation functions, though by far the most common is the logistic (aka sigmoid) due to some ``nice” theoretical properties.

Hide code cell source
par(mfrow=c(2,2))
plot( seq(-4,4,by=0.01), c( rep(0,401),rep(1,400)),type="l",xlab="Weighted Sum",ylab="Output")
legend("topleft","Step")
curve( 1/(1+exp(-x)), from=-4,to=4,xlab="Weighted Sum",ylab="Output")
legend("topleft","Logistic/Sigmoid")
curve( atan(x), from=-4,to=4,xlab="Weighted Sum",ylab="Output")
legend("topleft","ArcTan")
curve( x/(1+abs(x)), from=-4,to=4,xlab="Weighted Sum",ylab="Output")
legend("topleft","Softsign")
par(mfrow=c(1,1))
../_images/2275e899dedbedaaa35bee75890bb4df9392e7406e627690b70dbe7ca9a3c0d7.png

There’s not a huge variety in what activation functions look like.

Hidden Layer#

The model is greatly improved by adding what’s known as a hidden layer.

  • Each neuron in the hidden layer received a weighted sum of predictor variables (each one receives a different sum), then outputs an ``activated sum” (e.g., passing it through the logistic function).

  • Whereas the perceptron uses the output from the activation function as the predicted value of \(y\), these outputs in the intermediate layer (the hidden layer) are essentially used as new predictor variables.

  • A weighted sum of these new predictor variables is created, run through one more activation function and transformed, and the output is finally the predicted value of \(y\).

  • The hidden layer is where this ``feature engineering” (variable creation) takes place. It constructs better predictors of \(y\) from the measured variables.

Imagine forecasting what will happen to a medical patient. A neural network would create weighted sums of predictor variables like age, sex, BMI, properties of a tumor and run them through a hidden layer to transform them into new predictors, which are then combined in a weighted sum to make the prognosis. Note: the Bias in this picture is analogous to the ``intercept” term in a regression (weighted sum of predictors, plus a numeric constant).

Multiple Hidden Layers (Deep Learning)#

Multiple hidden layers can be added to a neural network. They all work the same: transforming a weighted sum of inputs via an activation function and outputting the result. In effect, each hidden layer helps to improve on the transformations made of the layer before it, so it’s ``like” a boosted tree in that sense.

Deep learning means having a neural network with many hidden layers. As you can imagine, figuring out the correct weights in the weighted sums can be a complex, computationally intensive task.

Example: MNIST Digits#

How do you personally figure out what digit is what? What features do you look for?

The hidden layer in the neural network might ``construct” features like:

A ``weighted sum” of those four features could easily produce a 0! Other sets of features would be able to produce other digits.

Tuning Parameters#

When building a neural network for predictive modeling, you need to design it:

  • Number of hidden layers: More layers = lower bias (fitting the training data well) but larger variance and larger risk of overfitting unless you have a really big dataset.

  • Number of neurons in each hidden layer: essentially, the number of new variables to create from the original predictors. More neurons = lower bias but larger variance and larger risk of overfitting.

  • “Weight decay” (regularization): penalty given to large weights in the weighted sum to prevent overfitting and to improve generalization. This will prevent any particular predictor/feature from contributing “too much” and makes the model somewhat less sensitive to the set of individuals in the training data. Small weights make the modeled relationships “more linear” and prevents crazy curviness that may be unique to the training set. Not all implementations of the algorithm have this parameter.

Example and Pros and Cons#

The \(neuralnet\) lets us explore some aspects of the neural network model (we will not be using it for \(train\)), but the data frame has to have categorical variables converted to indicator variables and each variable needs to be scaled.

data(TIPS, package='regclass')
#Replace categorical variables with indicator variables (using only some of the predictors here)
DATA <- model.matrix(~TipPercentage+Bill+Gender+Smoker+PartySize,data=TIPS)[,-1]
#Create variables to store the mean and standard deviation of the y variable; need those later
mean.y <- mean(TIPS$TipPercentage); sd.y <- sd(TIPS$TipPercentage) 
DATA <- scale(DATA)  #Scale the data
NNET <- neuralnet(TipPercentage~Bill+GenderMale+SmokerYes+PartySize,data=DATA,
                  hidden=3,linear.output = TRUE,stepmax=1e6)

The \(hidden\) argument tells it how many neurons in the hidden layer (can make this a vector to add more hidden nodes). \(stepmax\) dictates how many search steps the algorithm is allowed to take to find the optimal set of weights.

Visualizing the neural network with neuralnet#

Warning: you may not be able to knit the plot of a neural network because of some odd choices the authors of the package made with the plot syntax (in fact, there is some chance your computer may not even be able to make the plot at all).

plot(NNET,rep="best")
../_images/e7acec67751dbdba6328b018106eb9c6cec074df4d43330917027f913a3f88b6.png

Making predictions with neuralnet#

The writers of the \(neuralnet\) package also decided to make their neural network model incompatible with \(predict\). Instead, you must use \(compute\).

Note: when giving it a dataset on which to make predictions, the columns must match up exactly with the columns used to train the model (unlike every other model we’ve discussed). The example below uses everything but the TipPercentage column (since that is the \(y\) variable and not an \(x\) variable that appears in the predictor matrix).

predictions.nnet <- as.numeric( compute(NNET,DATA[,-1])$net.result )  #ie without column of y
predictions.nnet <- predictions.nnet*sd.y + mean.y  #unscale
#RMSE on training data; not the most interesting
sqrt( mean( (TIPS$TipPercentage - predictions.nnet)^2 ) )  
4.72070819670374

Examples with \(tidymodels\)#

Example: neural networks (classification)#

data(EX6.WINE, package='regclass')

# This is not usually right if want to investigate testing performances.
TRAIN <- EX6.WINE

REC <- recipe(Quality ~ ., TRAIN) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_dummy(all_nominal_predictors()) %>%
    step_nzv(all_predictors()) %>%
    step_corr(all_predictors()) %>%
    step_lincomb(all_predictors())

WF <- workflow() %>%
    add_recipe(REC) %>%
    add_model(mlp( mode = 'classification', hidden_units = tune(), penalty = tune() ))

GRID <- expand.grid( hidden_units = 1:7, penalty = 10^seq(-2,2,length=20) )

RES <- WF %>%
    tune_grid(
        resamples = vfold_cv(TRAIN, v = 5),
        grid = GRID,
    )

METRICS <- collect_metrics(RES)

METRICS
A tibble: 420 × 8
hidden_unitspenalty.metric.estimatormeannstd_err.config
<int><dbl><chr><chr><dbl><int><dbl><chr>
10.01000000accuracy binary0.818888950.004771888Preprocessor1_Model001
10.01000000brier_classbinary0.161543650.002081009Preprocessor1_Model001
10.01000000roc_auc binary0.896008650.006585751Preprocessor1_Model001
20.01000000accuracy binary0.843703750.007041909Preprocessor1_Model002
20.01000000brier_classbinary0.150912050.002011134Preprocessor1_Model002
20.01000000roc_auc binary0.914829550.005190027Preprocessor1_Model002
30.01000000accuracy binary0.845185250.013902463Preprocessor1_Model003
30.01000000brier_classbinary0.148245550.003201304Preprocessor1_Model003
30.01000000roc_auc binary0.917920350.007189008Preprocessor1_Model003
40.01000000accuracy binary0.838888950.004721314Preprocessor1_Model004
40.01000000brier_classbinary0.149196550.001408213Preprocessor1_Model004
40.01000000roc_auc binary0.915085550.005121324Preprocessor1_Model004
50.01000000accuracy binary0.855555650.007790994Preprocessor1_Model005
50.01000000brier_classbinary0.140310950.001886026Preprocessor1_Model005
50.01000000roc_auc binary0.928760250.005023629Preprocessor1_Model005
60.01000000accuracy binary0.850370450.007300149Preprocessor1_Model006
60.01000000brier_classbinary0.143350250.002394401Preprocessor1_Model006
60.01000000roc_auc binary0.922209950.006045245Preprocessor1_Model006
70.01000000accuracy binary0.852592650.009633190Preprocessor1_Model007
70.01000000brier_classbinary0.142032150.003782513Preprocessor1_Model007
70.01000000roc_auc binary0.920695050.007970750Preprocessor1_Model007
10.01623777accuracy binary0.818888950.005629142Preprocessor1_Model008
10.01623777brier_classbinary0.161603450.002086724Preprocessor1_Model008
10.01623777roc_auc binary0.895983950.006575368Preprocessor1_Model008
20.01623777accuracy binary0.836296350.007444352Preprocessor1_Model009
20.01623777brier_classbinary0.153829650.001384560Preprocessor1_Model009
20.01623777roc_auc binary0.905256750.002742090Preprocessor1_Model009
30.01623777accuracy binary0.836296350.006186405Preprocessor1_Model010
30.01623777brier_classbinary0.149626750.002705010Preprocessor1_Model010
30.01623777roc_auc binary0.915366650.007604319Preprocessor1_Model010
5 61.58482accuracy binary0.757777850.0115410640Preprocessor1_Model131
5 61.58482brier_classbinary0.213826750.0011151672Preprocessor1_Model131
5 61.58482roc_auc binary0.844909250.0061148461Preprocessor1_Model131
6 61.58482accuracy binary0.767407450.0098444695Preprocessor1_Model132
6 61.58482brier_classbinary0.212755050.0010660400Preprocessor1_Model132
6 61.58482roc_auc binary0.845316950.0060603434Preprocessor1_Model132
7 61.58482accuracy binary0.771851950.0086741025Preprocessor1_Model133
7 61.58482brier_classbinary0.212064750.0010521195Preprocessor1_Model133
7 61.58482roc_auc binary0.845577350.0060577024Preprocessor1_Model133
1100.00000accuracy binary0.607407450.0124363951Preprocessor1_Model134
1100.00000brier_classbinary0.236859350.0009043861Preprocessor1_Model134
1100.00000roc_auc binary0.831072550.0056761244Preprocessor1_Model134
2100.00000accuracy binary0.607407450.0124363951Preprocessor1_Model135
2100.00000brier_classbinary0.235703350.0009584880Preprocessor1_Model135
2100.00000roc_auc binary0.829736150.0056658875Preprocessor1_Model135
3100.00000accuracy binary0.607407450.0124363951Preprocessor1_Model136
3100.00000brier_classbinary0.235778150.0009893254Preprocessor1_Model136
3100.00000roc_auc binary0.828666950.0056627580Preprocessor1_Model136
4100.00000accuracy binary0.607407450.0124363951Preprocessor1_Model137
4100.00000brier_classbinary0.235789050.0012697318Preprocessor1_Model137
4100.00000roc_auc binary0.828435350.0058590361Preprocessor1_Model137
5100.00000accuracy binary0.607407450.0124363951Preprocessor1_Model138
5100.00000brier_classbinary0.234738250.0011317313Preprocessor1_Model138
5100.00000roc_auc binary0.828856950.0056654048Preprocessor1_Model138
6100.00000accuracy binary0.607407450.0124363951Preprocessor1_Model139
6100.00000brier_classbinary0.234597350.0011312843Preprocessor1_Model139
6100.00000roc_auc binary0.828672250.0056750661Preprocessor1_Model139
7100.00000accuracy binary0.607407450.0124363951Preprocessor1_Model140
7100.00000brier_classbinary0.234339450.0011876321Preprocessor1_Model140
7100.00000roc_auc binary0.828784850.0056720982Preprocessor1_Model140
for(metric in unique(METRICS$.metric)){
    metrics <- METRICS %>% filter(.metric==metric)
    plot <- ggplot(metrics, aes(x=penalty, y=mean, color=as.factor(hidden_units))) + geom_line() + scale_x_log10() + labs(title=metric)
    print(plot)
}
../_images/561974c0df24305c7eea3d566836468f0fc385944887a4e334eb332b5d99998c.png ../_images/5768fd170e50f66ae6a3c44157dc6638ecae2568e45b9ce39130f9a2119ebcc2.png ../_images/53a524dea0e6b7d9d7e922d4267e0e056119dd2c22b469379066c6347c845e63.png
BEST <- select_best(RES, metric = 'accuracy')
BEST
A tibble: 1 × 3
hidden_unitspenalty.config
<int><dbl><chr>
70.1128838Preprocessor1_Model042
MODEL <- WF %>%
    finalize_workflow(BEST) %>%
    fit(TRAIN)
MODEL
══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: mlp()

── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_normalize()
• step_dummy()
• step_nzv()
• step_corr()
• step_lincomb()

── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
a 11-7-1 network with 92 weights
inputs: fixed.acidity volatile.acidity citric.acid residual.sugar free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol chlorides_Lots 
output(s): ..y 
options were - entropy fitting  decay=0.1128838

Example: neural networks (regression)#

data(EX9.BIRTHWEIGHT, package='regclass')

# This is not usually right if want to investigate testing performances.
TRAIN <- EX9.BIRTHWEIGHT

REC <- recipe(Birthweight ~ ., TRAIN) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_dummy(all_nominal_predictors()) %>%
    step_nzv(all_predictors()) %>%
    step_corr(all_predictors()) %>%
    step_lincomb(all_predictors())

WF <- workflow() %>%
    add_recipe(REC) %>%
    add_model(mlp( mode = 'regression', hidden_units = tune(), penalty = tune() ))

GRID <- expand.grid( hidden_units = c(2,3), penalty = 10^seq(-2,1,length=15) )

RES <- WF %>%
    tune_grid(
        resamples = vfold_cv(TRAIN, v = 5),
        grid = GRID,
    )

METRICS <- collect_metrics(RES)

METRICS
A | warning: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
There were issues with some computations   A: x1
There were issues with some computations   A: x1

A tibble: 60 × 8
hidden_unitspenalty.metric.estimatormeannstd_err.config
<dbl><dbl><chr><chr><dbl><int><dbl><chr>
2 0.01000000rmsestandard500.34200416527.78098862Preprocessor1_Model01
2 0.01000000rsq standard 0.115573295 0.05032802Preprocessor1_Model01
3 0.01000000rmsestandard514.41492660525.03946676Preprocessor1_Model02
3 0.01000000rsq standard 0.075099095 0.02697816Preprocessor1_Model02
2 0.01637894rmsestandard492.35315438519.92295040Preprocessor1_Model03
2 0.01637894rsq standard 0.120626145 0.02622690Preprocessor1_Model03
3 0.01637894rmsestandard491.98653197518.37399980Preprocessor1_Model04
3 0.01637894rsq standard 0.121646425 0.03845223Preprocessor1_Model04
2 0.02682696rmsestandard516.37838731528.15773717Preprocessor1_Model05
2 0.02682696rsq standard 0.072085464 0.03804960Preprocessor1_Model05
3 0.02682696rmsestandard505.25170670521.36287174Preprocessor1_Model06
3 0.02682696rsq standard 0.083234155 0.04277127Preprocessor1_Model06
2 0.04393971rmsestandard525.74500034519.65821297Preprocessor1_Model07
2 0.04393971rsq standard 0.043419095 0.02249748Preprocessor1_Model07
3 0.04393971rmsestandard510.49917381522.08933728Preprocessor1_Model08
3 0.04393971rsq standard 0.094570314 0.01637357Preprocessor1_Model08
2 0.07196857rmsestandard490.09232893520.78144716Preprocessor1_Model09
2 0.07196857rsq standard 0.131482805 0.02021711Preprocessor1_Model09
3 0.07196857rmsestandard475.98731310524.07979661Preprocessor1_Model10
3 0.07196857rsq standard 0.170490925 0.04620822Preprocessor1_Model10
2 0.11787686rmsestandard471.04992567514.09626010Preprocessor1_Model11
2 0.11787686rsq standard 0.181468045 0.02421311Preprocessor1_Model11
3 0.11787686rmsestandard502.03561725516.16797090Preprocessor1_Model12
3 0.11787686rsq standard 0.118485985 0.02220887Preprocessor1_Model12
2 0.19306977rmsestandard514.19557350520.74797506Preprocessor1_Model13
2 0.19306977rsq standard 0.058390875 0.01773080Preprocessor1_Model13
3 0.19306977rmsestandard505.10256344524.47833230Preprocessor1_Model14
3 0.19306977rsq standard 0.110856285 0.02902798Preprocessor1_Model14
2 0.31622777rmsestandard510.32814056522.04282333Preprocessor1_Model15
2 0.31622777rsq standard 0.082087135 0.03050509Preprocessor1_Model15
3 0.31622777rmsestandard497.31421593514.03847181Preprocessor1_Model16
3 0.31622777rsq standard 0.095850075 0.03419205Preprocessor1_Model16
2 0.51794747rmsestandard501.06108491515.59500375Preprocessor1_Model17
2 0.51794747rsq standard 0.102747165 0.02478217Preprocessor1_Model17
3 0.51794747rmsestandard509.04153544522.07454587Preprocessor1_Model18
3 0.51794747rsq standard 0.099856245 0.03019326Preprocessor1_Model18
2 0.84834290rmsestandard493.68760683517.52614032Preprocessor1_Model19
2 0.84834290rsq standard 0.100837365 0.02973864Preprocessor1_Model19
3 0.84834290rmsestandard498.820059415 8.34972820Preprocessor1_Model20
3 0.84834290rsq standard 0.143463205 0.02431361Preprocessor1_Model20
2 1.38949549rmsestandard504.56243534514.14084938Preprocessor1_Model21
2 1.38949549rsq standard 0.098709685 0.02500284Preprocessor1_Model21
3 1.38949549rmsestandard489.76809898514.97913350Preprocessor1_Model22
3 1.38949549rsq standard 0.119927755 0.02481531Preprocessor1_Model22
2 2.27584593rmsestandard491.91889212520.71737596Preprocessor1_Model23
2 2.27584593rsq standard 0.111309285 0.04203917Preprocessor1_Model23
3 2.27584593rmsestandard486.16095784522.57745993Preprocessor1_Model24
3 2.27584593rsq standard 0.151559215 0.03405783Preprocessor1_Model24
2 3.72759372rmsestandard496.04322669518.10167754Preprocessor1_Model25
2 3.72759372rsq standard 0.112307765 0.01724078Preprocessor1_Model25
3 3.72759372rmsestandard513.14136223520.32292991Preprocessor1_Model26
3 3.72759372rsq standard 0.103688695 0.02293471Preprocessor1_Model26
2 6.10540230rmsestandard534.69164622510.94583540Preprocessor1_Model27
2 6.10540230rsq standard 0.078187855 0.01840885Preprocessor1_Model27
3 6.10540230rmsestandard515.85816570515.29563700Preprocessor1_Model28
3 6.10540230rsq standard 0.064055645 0.02175738Preprocessor1_Model28
210.00000000rmsestandard521.08322173524.08594502Preprocessor1_Model29
210.00000000rsq standard 0.080967755 0.01654549Preprocessor1_Model29
310.00000000rmsestandard521.26876409522.82716363Preprocessor1_Model30
310.00000000rsq standard 0.090453555 0.02116048Preprocessor1_Model30
for(metric in unique(METRICS$.metric)){
    metrics <- METRICS %>% filter(.metric==metric)
    plot <- ggplot(metrics, aes(x=penalty, y=mean, color=as.factor(hidden_units))) + geom_line() + scale_x_log10() + labs(title=metric)
    print(plot)
}
../_images/fa6aa2d1bc9191ffa4a99e1f4c87a48f9f99aef9d31b28de4b9c58adfba15675.png ../_images/f13d110a8f514bc9ac39c6a7fb3212b02fbbbba532c442878cc1360bb6cc48ff.png

The results are very close to each other. What if we add \(std\_err\) as error bars?

for(metric in unique(METRICS$.metric)){
    metrics <- METRICS %>% filter(.metric==metric)
    plot <- ggplot(metrics, aes(x=penalty, y=mean, color=as.factor(hidden_units))) + geom_line() + scale_x_log10() + labs(title=metric)
    plot <- plot + geom_errorbar(aes(ymin=mean-std_err, ymax=mean+std_err))
    print(plot)
}
../_images/6e7b9cfa25df19d585922d56cbbbe38fdee8a8e5439261ae2ab9b231dc761128.png ../_images/e9ecad924ebcbc1fc44fa7fc3b0c270e865091ec9e6b538a0947d17cc1a00a80.png

The error bars show that indeed the different models have very similar performances on this particular data. In the following, we still select the best model, but technically others are expected to work as well as the selected one.

BEST <- select_best(RES, metric = 'rmse')
BEST
A tibble: 1 × 3
hidden_unitspenalty.config
<dbl><dbl><chr>
20.1178769Preprocessor1_Model11
MODEL <- WF %>%
    finalize_workflow(BEST) %>%
    fit(TRAIN)
MODEL
══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: mlp()

── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_normalize()
• step_dummy()
• step_nzv()
• step_corr()
• step_lincomb()

── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
a 14-2-1 network with 33 weights
inputs: Gestation MotherAge MotherHeight MotherWeight FatherAge FatherHeight FatherWeight MotherEducation_College MotherEducation_HS FatherRace_Black FatherRace_White Father_Education_College Father_Education_HS Smoking_now 
output(s): ..y 
options were - linear output units  decay=0.1178769

Pros and Cons#

Pros:

  • Can learn nonlinear relationships and create relevant predictors automatically through the use of hidden layers.

  • Does well when distributions resemble a Normal (symmetric, bell-shaped) curve (many image, text, or speech based problems that humans do) as long as there is a lot of data.

  • Once trained, predictions are very fast.

  • Deep learning, which are massive neural networks, actually are amazingly effective at what they are tuned to do.

Cons:

  • Overhyped early in its history and now. Neural networks will not solve every problem ever created.

  • Often doesn’t work the best for business problems where distributions do not resemble a Normal curve.

  • Hard to interpret (like most other models).

  • Computationally intensive to train and tune (like most other good models).

  • Well-known for neural network researchers: A neural network is the second best way to solve any problem. The best way is to actually understand the problem.