In my about page, I detail a little bit about my machine learning (ML) involvement. I’ve worked through a good portion of Coursera’s Machine Learning course, read a few chapters from Jurafsky’s Speech and Language Processing textbook, and have learned a lot about the general framework for ML research from assisting natural language processing (NLP) research.
Despite all that, starting my internship, I realized my understanding of ML and NLP is still pretty poor. So, I decided to go through a few resources to get my understanding up to speed with my peers. Below, I’ve dumped some notes.
Recurrent Neural Networks (RNN)
- Vanilla neural networks only work with fixed-size inputs and fixed-size outputs. RNNs allow us to have variable-length sequences as both inputs and outputs.
- Examples include machine translation and sentiment analysis
-
RNNs are recurrent because they use the same set of weights for each step
- weights: W_xh, W_hh, W_hy
- weights = matrices, biases = vectors
-
equations for h_t and y_t:
- h_t = tanh(W_xh * x_t + W_hh * h_(t-1) + b_h)
- y_t = W_hy * h_t + b_y
-
One-hot vectors
- construct vocab of all words —> assign int index to ea word
- the “one” in each one-hot vector is at word’s correspond. index
- serves as the input to forward phase
-
Backward phase
- cross-entropy loss: L = -ln(p_c)
- use gradient descent to minimize loss
- to fully calculate gradient of W_xh back propagation through time (BPTT)
- clipping gradient values —> mitigate exploding gradient problem
Sequence to Sequence (Seq2Seq) Models (with attention)
- Takes a sequence of items (words, letters, feature of imgs) —> outputs another sequence of items
-
An encoder processes input —> compiles into a context vector
- input is one word (represented using word embedding) from input sentence + hidden state
- last hidden state of encoder is passed as context —> decoder
- Limitation: Seq2Seq with no attention —> context vector is bad for long sentences
- Using attention, encoder can pass ALL hidden states to decoder (instead of just one hidden state)
-
Attention decoder
- prepare inputs (encoder hidden states, decoder hidden state)
- score each hidden state
- softmax the scores
- multiply ea vector by its softmaxed score (amplifies hidden states with high scores)
- sum up the weighed vectors = context vector
- context vector + hidden state —> feedfoward NN —> output word
Convolutional Neural Networks (CNN)
- Neural networks that have a convolutional layer (conv layer -> have filters (2d matrices of numbers))
-
Steps for filtering (ex: vertical 3x3 Sobel filter)
- overlay filter on input image at some location
- perform element-wise multiplication between vals of filter and correspond. val.s in image
- sum all element-wise products (sum = output val for destination pixel in output image)
- repeat for all locations
- Sobel filters act as edge detectors!
-
Padding: add zeros around input image —> output image is now same size as input
- this is called “same” padding
- no padding = “valid” padding
- Primary parameter for conv layer is the number of filters
- Pooling: reduce size of input by pooling values together (i.e. max, min, avg)
-
Softmax layer: fully-connected (dense) layer using Softmax funct. as activation
- digit represented by node with highest prob. —> output of CNN
Where to Learn
Victor Zhou’s Blog for RNN and CNN
Jay Alammar’s Blog for Seq2Seq
Google’s Paper on Attention (that I also need to go through)