Natural language processing is a method for computer analysis of the language that humans use on a daily basis.
A huge amount of sentences are written by humans on the Internet, including SNS and blogs. Because of the tremendous amount of processing required to analyze them, there is a demand for more efficient analysis.
Released in 2013, Word2vec was a breakthrough in natural language processing despite its relative simplicity. In this article, we will introduce an overview of Word2vec, basic concepts such as “semantic vectors” that are the basis of its technology, and how to study Word2vec.
- What is Word2vec
- History of Word2vec development
- natural language processing
- What you can do with Word2vec
- sentiment analysis
- Arithmetic processing
- Automatically generate sentences in combination with RNN
- How Word2vec works
- CBoW model and Skip-gram model
- Use case of Word2vec
- Analysis of recommendations
- review analysis
- machine translation
- Q&A system
- How to study Word2vec
What is Word2vec
Word2vec is a natural language processing technique that converts words in sentences into numerical vectors and grasps their meanings. Since it can be implemented using Python, which is relatively easy to learn, it is a natural language processing method that is easy to use even for beginners.
History of Word2vec development
Word2vec is a technique proposed by Google researcher Thomas Mikoloff in 2013. The Word2vec method is more accurate than conventional natural language processing methods, and in particular, it has become possible to dramatically improve accuracy in understanding the meaning of sentences.
Word2vec is a technique that enables a vector representation of words called “semantic vectors”. Hence the name Word2vec (word to vector). Its design concept is based on the distribution hypothesis that “ words appearing in the same context have similar meanings ”.
natural language processing
Among natural language processing methods, there are two typical methods of representing words as vectors: “one-hot representation” and “distributed representation.” Word2vec’s “semantic vector” corresponds to distributed representation.
For example, if the words appearing in a sentence are “Alice, Bob, Carol, Dave”, the one-hot representation for “Alice” would be:
Similarly, each word is represented as follows.
There are two drawbacks to the one-hot representation.
The first is that operations between vectors do not yield meaningful results, and the second is that the number of words in a sentence increases and the dimensionality becomes enormous. For example, if there are 10,000 words that appear in a sentence, the size of the one-hot expression table above will be 10,000 x 10,000.
Distributed representation is a method of expressing one word with a vector of about several hundred dimensions. Representing the previous word in distributed representation is as follows.
In contrast to the one-hot representation, in which all vector values are either 1 or 0, the distributed representation enables operations between words. This allows us to calculate similarities between words. .
In addition, since one word is represented by a vector with several hundred dimensions, distributed representation can reduce the size of data even if the number of words appearing in a sentence is enormous. In this way, distributed representation is a method that can overcome the disadvantages of one-hot representation.
What you can do with Word2vec
Word2vec makes it possible to mathematically express the “meaning” of words through distributed representation.
The “meaning” can be numerically measured and calculated (by adding or subtracting) the closeness of the meaning, and the meaning of words can be grasped in a completely different way from the way we grasp the meaning of words with our brains and minds. It is what is done.
Word2vec can be used for “sentiment analysis” that analyzes the emotions contained in sentences.
For example, by comparing the magnitudes of the semantic vectors between words that appear in a sentence with six basic emotions proposed by American psychologist Paul Ekman: anger, disgust, fear, happiness, sadness, and surprise. It measures how close a sentence is to the six basic emotions.
Word2vec grasps the words in the sentences not only by 0-1 but also by real number vectors, so it can perform arithmetic processing such as adding and subtracting words.
This enables the famous operation “King – Masculinity + Femininity = Queen”.
Automatically generate sentences in combination with RNN
Word2vec can automatically generate sentences by combining with RNN (Recurrent Neural Network).
For example, it is possible to learn Natsume Soseki’s sentences and create new sentences in the “Natsume Soseki style”. By using RNN, even “habits” such as Natsume Soseki’s “arrangement of words” and “frequently used expressions” can be reproduced.
How Word2vec works
Word2vec adjusts several parameters such as the type of model, the distribution of words extracted from the text, and the number of dimensions of the vector.
In the following, we will introduce the mechanism of two neural network models, CBoW (Continuous Bag-of-Words Model) and Skip-gram (Continuous Skip-Gram Model), which are installed in Word2vec.
CBoW model and Skip-gram model
|structure||Predict a word from surrounding words||Predict surrounding words from a word|
The basic mechanism of Word2vec is to predict words in a sentence using a two-layer neural network (output layer → intermediate layer → output layer). There are two models with completely opposite methods of prediction.
CBoW predicts a word from surrounding words (context), and Skip-gram predicts surrounding words (context) from a given word. In other words, each model solves the following fill-in-the-blank problem:
Although Skip-gram generally requires more processing time, it is said to be more accurate.
Use case of Word2vec
Analysis of recommendations
In recommendations, an algorithm called “collaborative filtering” is widely used. Collaborative filtering is a mechanism such as “This user is interested in product A, but people who are interested in product A tend to be interested in product B, so let’s recommend product B to this user.”
In today’s Internet space, where the number of users and products is growing rapidly, collaborative filtering requires a tremendous amount of processing.
However, by introducing Word2vec to the product ID and user ID data, the “size of the vector” with the product ID will represent the “recommendation level”, so the processing of the recommendation system can be greatly simplified. will be
By analyzing a review of a service using Word2vec’s “sentiment analysis” technology, you can understand the sentiment of the review.
Therefore, it is possible to grasp the general emotional trend without directly reading a large number of reviews. These data can be used to improve products, develop new products, and predict trends.
Word2vec developer Thomas Mikolov was interested in applying Word2vec technology to machine translation. Word2vec’s system of predicting a word from surrounding words (or vice versa) has succeeded in dramatically improving the accuracy of machine translation by combining it with deep learning.
In the official tutorial of Google’s deep learning framework ” TensorFlow “, you can experience building a model that translates from Spanish to English by combining TensorfFlow and Word2vec.
Word2vec is also applied to question-and-answer systems, so-called chatbots.
Here, the above-mentioned technology for automatically creating sentences is used, and by introducing it into a company’s Q&A system, etc., a question-and-answer system that provides more natural and accurate answers can be realized.
How to study Word2vec
Finally, I will introduce materials for studying how to use Word2vec in detail.
The following books are recommended books for studying not only the theoretical mechanism of Word2vec but also other natural language processing methods.
Also, although it is quite applied, if you want to learn from the theory firmly about natural language processing in general including Word2vec, you can watch the lecture CS224N: Natural Language Processing with Deep Learning | Winter 2019 , which Stanford University has released for free on YouTube. It would be nice to use it as a reference.
A breakthrough in natural language processing has occurred with the advent of Word2vec. Its mechanism is simple and widely applicable, so it can be said that it is truly a genius invention.
Word2vec can be introduced relatively easily even by people who are not familiar with natural language processing. Why don’t you actually run Word2vec once and experience the charm of natural language processing?