# Negative Sampling

Knowledge of [[CBOW]] CBOW: Continuous Bag of Words Use the context to predict the center word or [[skipgram]] skipgram: Continuous skip-gram Use the center word to predict the context is required.

A naive model to train a model of words is to

- encode input words and output words using vectors,
- use the input word vector to predict the output word vector,
- calculate the errors between predicted output word vector and real output word vector,
- minimize the errors.

However, it is very expensive to project out the output words and calculate the error every time. A trick is to use **negative sampling**.

Negative sampling adds a new column to the data as the predictions.

Input (Center Word) | Output (Context) | Target (is Neighbour) |
---|---|---|

`intended` | `extravagant` | 1 |

`intended` | `display` | 1 |

`intended` | `to` | 1 |

`intended` | `attract` | 1 |

Now we have a problem. The target is always 1. This dataset might lead to network that outputs 1 all the time. We need some nagative samples to make it noisy. We randomly sampled words from the dictionary.

Input (Center Word) | Output (Context) | Target (is Neighbour) |
---|---|---|

`intended` | `extravagant` | 1 |

`intended` | `display` | 1 |

`intended` | `to` | 1 |

`intended` | `attract` | 1 |

`intended` | `I` | 0 |

`intended` | `a` | 0 |

`intended` | `intellect` | 0 |

`intended` | `mating` | 0 |

`intended` | `course` | 0 |

For more rigorous derivations, please follow Goldberg2014^{1}.

- The Illustrated Word2vec

