10. What is broadcasting?
Broadcasting is a concept in NumPy that’s used to describe the ability to perform operations on arrays with different shapes. It provides a way to vectorize the operations so the looping occurs in C which can perform calculations 1000 times faster than Python. It also needs the shape of each dimension in the arrays to be equal or one of the dimensions must be one.
11. Are metrics generally calculated using the training set or the validation set? Why?
The model evaluation stage of the machine learning process uses metrics to evaluate the performance of the trained model using the validation set. It uses the metrics to detect overfitting and to tune the hyperparameters to improve the model’s performance. It also trains a new model with the best hyperparameters to evaluate the model’s performance using the test set.
12. What is SGD?
Stochastic Gradient Descent (SGD) is an algorithm in machine learning that’s used to find the model parameters that correspond to the best fit between the predicted values and the actual values. It calculates the gradient using random instances of the training data and updates the model parameters on each iteration which removes the computational burden associated with gradient descent. It can also adjust the model parameters in a way that moves the model out of a local minimum and towards the global minimum.
13. Why does SGD use mini-batches?
Optimization algorithms calculate the gradients using one or more data items. It can use the average of the whole dataset, but that takes a long time and may not fit into memory, or it can use a single data item, but that can be imprecise and unstable. It can also use the average of a mini-batch of a few data items which can be more accurate and stable for larger batch sizes.
14. What are the seven steps in SGD for machine learning?
Imagine being lost in the mountains with a car parked at the lowest point. It would be good to always take steps downhill which eventually leads to the destination. It would also be good to know how big of a step to take and to continue taking steps until the bottom is reached which is the parking lot.
Initialize the Random Parameters
Calculate the Predictions
Calculate the Loss
Calculate the Gradients
Update the Weights
Go to Step Two and Repeat the Process
Stop When the Model is Good Enough
15. How do we initialize the weights in a model?
The first step in training the model is to initialize the parameters, which are also referred to as the weights and biases. It can be initialized using random numbers, which works most of the time, except for training neural networks with many layers, which causes exploding or vanishing gradients. It can also be initialized using special weight initialization techniques which use random numbers but ensures the gradients stay within a reasonable range.
16. What is loss?
Loss is an evaluation metric that’s used in machine learning to measure how wrong the predictions are. It calculates the distance between the predicted values and the actual values where zero represents a perfect score. It also gets calculated using one of several different loss functions that vary based on whether the model is solving a classification or a regression problem. 17. Why can’t we always use a high learning rate?
Learning Rate is a hyperparameter that’ used in machine learning to control how much to adjust the weights at each iteration of the training process. It can be too low, which takes too long to train, and makes the model more likely to get stuck in a local minimum. It can also be too high, which over-shoots the global minimum, and bounces around without ever reaching it.
18. What is a gradient?
The Gradient is a vector that’s used in machine learning to identify the direction in which the loss function produces the steepest ascent. It measures the change in all weights with regard to the change in error. It also gets used to update the weights during the training process where the product of the gradient and learning rate is subtracted from the weights.
19. Do you need to know how to calculate gradients yourself?
No, it’s not necessary to know how to manually calculate gradients. It can be calculated automatically with respect to the associated variable using the
requires_grad_ method in the
Tensor class from the
PyTorch library. It also tags the variable which keeps track of every operation that’s applied to the tensor in order to perform backward propagation to calculate the gradients.
variable_name = Tensor(3.).requires_grad_() 20. Why can’t we use accuracy as a loss function?
Accuracy isn’t good to use as a loss function because it only changes when the predictions of the model change. It can improve the confidence of its predictions, but unless the predictions actually change, the accuracy will remain the same. It also produces gradients that are mostly equal to zero which prevents the parameters from updating during the training process.
21. Draw the sigmoid function. What is special about its shape?
sigmoid function is an activation function that’s named after its shape which resembles the letter “S” when plotted. It has a smooth curve that gradually transitions from values above 0.0 to values just below 1.0. It also only goes up which makes it easier for SGD to find meaningful gradients.
22. What is the difference between a loss function and a metric?
The loss function is used to evaluate and diagnose how well the model is learning during the optimization step of the training process. It responds to small changes in confidence levels which helps to minimize the loss and monitor for things like overfitting, underfitting, and convergence. It also gets calculated for each item in the dataset, and at the end of each epoch where the loss values are all averaged and the overall mean is reported.
The metric is used to evaluate the model and perform model selection during the evaluation process after the training process. It provides an interpretation of the performance of the model that’s easier for humans to understand which helps give meaning to the performance in the context of the goals of the overall project and project stakeholders. It also gets printed at the end of each epoch which reports the performance of the model.
23. What is the function to calculate new weights using a learning rate?
Optimizer is an optimization algorithm that’s used in machine learning to update the weights based on the gradients during the optimization step of the training process. It starts by defining some kind of loss function and ends by minimizing the loss using one of the optimization routines. It can also make the difference between getting a good accuracy in hours or days.
24. What does the DataLoader class do?
DataLoader is a class that’s used in PyTorch to preprocess the dataset into the format that’s expected by the model. It specifies the dataset to load, randomly shuffles the dataset, creates the mini-batches, and loads the mini-batches in parallel. It also returns a dataloader object that contains tuples of tensors that represent the batches of independent and dependent variables.
25. Write pseudocode showing the basic steps taken in each epoch for SGD.
26. Create a function that, if passed two arguments [1,2,3,4] and ‘abcd’, returns [(1, ‘a’), (2, ‘b’), (3, ‘c’), (4, ‘d’)]. What is special about that output data structure?
The output is special because it has the same data structure as the Dataset object that’s used in PyTorch. It contains a list of tuples where each tuple stores an item with the associated label. It also contains all the items and labels from the first and second parameters which are paired at each index.
27. What does view do in PyTorch?
View is a method that’s used in PyTorch to reshape the tensor without changing its contents. It doesn’t create a copy of the data which allows for efficient memory-efficient reshaping, slicing, and element-wise operations. It also shares the underlying data with the original tensor which means any changes made to the data in the view will be reflected in the original tensor.
28. What are the bias parameters in a neural network? Why do we need them?
Bias is a parameter that’s used in machine learning to offset the output inside the model to better fit the data during the training process. It shifts the activation function to the left or right which moves the entire curve to delay or accelerate the activation. It also gets added to the product of the inputs and weights before being passed through the activation function.
parameters = sum(inputs * weights) + bias
29. What does the @ operator do in Python?
@ is an operator that’s used in Python to perform matrix multiplication between two arrays. It performs the same operation as the matmul function from the NumPy library. It also makes matrix formulas much easier to read which makes it much easier to work with for both experts and non-experts.
np.matmul(np.matmul(np.matmul(A, B), C), D) A @ B @ C @ D
30. What does the backward method do? Backward is a method that’s used in PyTorch to calculate the gradient of the loss. It performs the backpropagation using the backward method in the Tensor class from the PyTorch library. It also adds the gradients to any other gradients that are currently stored in the grad attribute in the tensor object.
31. Why do we have to zero the gradients?
In PyTorch, the gradients accumulate on subsequent backward passes by default. It helps train recurrent neural networks that work with time-series data where the backpropagation is repeated to perform backpropagation through time. It also must be manually set to zero for most neural networks before the backward pass is performed to update the parameters correctly.
learning_rate = 1e-5 parameters.data -= learning_rate * parameters.grad.data parameters.grad = None
32. What information do we have to pass to Learner?
Learner is a class that’s used in Fastai to train the model. It specifies the data loaders and model objects that are required to train the model and perform transfer learning. It can also specify the optimizer function, loss function, and other optional parameters that already have default values.
learner = Learner(dataloaders, model, loss_function, optimizer_function, metrics)
33. Show Python or pseudocode for the basic steps of a training loop.
Training is a process in machine learning that’s used to build a model that can make accurate predictions on unseen data. It involves an architecture, dataset, hyperparameters, loss function, and optimizer. It also involves splitting the dataset into training, validation, and testing data, making predictions about the data, calculating the loss, and updating the weights.
for _ in range(epochs): prediction = model(x_batch, parameters) loss = loss(prediction, label) loss.backward() for parameterin parameters: parameter.grad.data += learning_rate * parameter.grad.data parameter.grad.data = None
34. What is ReLU? Draw a plot of it for values from -2 to +2.
Rectified Linear Unit (ReLU) is an activation function that’s used in machine learning to address the vanishing gradient problem. It activates the input value for all the positive values and replaces all the negative values with zero. It also decreases the ability of the model to train properly when there are too many activations as zero because the gradient of zero is zero which prevents those parameters from being updated during the backward pass.
35. What is an activation function?
Activation Function is a function that’s used in machine learning to decide whether the input is relevant or irrelevant. It gets attached to each neuron in the artificial network and determines whether to activate based on whether the input is relevant for the prediction of the model. It also helps normalize the output of each neuron to a range between -1 and 1.
output = activation_function(parameters) 36. What’s the difference between F.relu and nn.ReLU?
F.relu is a function that’s used in PyTorch to apply the rectified linear unit function to the layers in the model that’s manually defined in the class. It must be manually defined in the class of the artificial neural network where the layers and functions are defined as class attributes. It also does the same thing as the nn.ReLU class which builds the model with sequential modules.
nn.ReLU is a class that’s used in PyTorch to apply the rectified linear unit function to the layers in the model that’s defined using sequential modules. It must be used with other sequential modules which represent the layers and functions that build the artificial neural network. It also does the same thing as the F.relu function which builds the model by defining the class.
37. The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?
An artificial neural network with two layers and a nonlinear activation function can approximate any function but there are performance benefits for using more layers. It turns out that smaller matrices with more layers perform better than large matrices with fewer layers. It also means the model will train faster, use fewer parameters, and take up less memory.