NEGATIVE-SAMPLING WORD-EMBEDDING METHOD

. One of the most famous authors of the method is Tomas Mikolov. His software and method of theoretical application are the major ones for our consideration today. It is better to pay attention that it is more mathematically oriented. The use of embedding models to turn KGs into vector space has become a well-known field of research. In recent years, a plethora of embedding learning approaches have been proposed in the literature. Many of these models rely on data already stored in the input KG. Following the closed world assumption, the knowl edge not presented in the KG cannot be judged untrue; instead, it may only be labeled as un -known. On the other hand, embedding models, like most machine learning algorithms, require negative instances to learn embeddings efficiently. To deal with this, a variety of negative sample generating strategies have been developed. The author himself had more to do with mathematics, and his method concerns, first of all, a mathematical solution for a theoretical, and then a practical solution for creating this and the method we are analyzing. Dense vec tor word representations have lately gained popularity as fixed-length features for machine learning algorithms, and Mikolov’s system is now widely used. We investigate one of its main components, Negative Sampling, and offer efficient distributed methods that allow us to scale to indicate and exclude the possibility of probability loss in a similar value. Furthermore, this method is laser-focused on a single action in the broad sense for processing the recognition of the above-mentioned vector or words. It is important to pay attention to mathematical theory and understand the importance of the neural network in this field.


Introduction
It is better to start with the fact that the development of computer technology, namely computers, began in the 20th century, in 1920. Since then, computers have evolved into fullfledged personal computers. Vivid examples that we can observe in the 1960s 1 . With the development of computer technology, there was an urgent need to enter texts, as computers became not only a means of calculus but also of text input and search. Embedding itself can be understood as a translation of the text into the language of the computer itself. Accordingly, there is only an open and closed gate for the processor architecture. That is the supply of current and its absence. Hence, the existence of 1 and 0 modes. The easiest way that can come to mind is the orientation of words by numbers. After all, combinations can be made up to 01 to the power of 26 to subtract 1. But this method is considered extremely inconvenient and takes up productivity. And with the existence of programming languages and explanations through the writing of certain formulas of action, and we are considering single actions that reduce the load on the constant computing power of the processor itself. This way, we find more acceptable ways for embedding. This paper presents the implementation of this method for embedding and the convenience of recognizing and using the formulas below in the present tense. It's vital not to forget that the important formulas and people involved in the evaluation of the method will be impacted. It is critical to consider that the transition to this way of embedding uses the existing search engines such as Google and Yandex in their search engines, where the much-needed ability methods of word recognition the actual cell servers by the processor in place. In this article, we will answer these questions: What is this method? How is it applied and where? What formulas exist? Graphs and an illustrative demonstration of the translation of a word into a programming language that we understand are presented. These and many other questions that you would like to find answers to may be present in this article. [1, p. 25] The following information is to get acquainted with the method itself and its application in practice.

Acknowledgements
In the first part, it is worth considering conditional probabilities and these include words from the corpus of the letter w and the letter d. In this case, the probability of loss of p(d/w; θ) will be considered 1) in the practical equation shown, D(w) is a set of contexts of the word w and its corresponding resulting probability. Generation that allows you to find out the subsequent possibilities. The alternative for this equation is tracking : 2) In this case, D is a set of all available pairs of words from a given context that we extract from the selected text. This method is an alternative to the first one with high possibilities for accurate calculation since we have a basic set in the form of D, which allows us to perform an exponential calculation.

Main body
Meanwhile, Mikolov introduced a new negative sampling system that would be much more efficient based on probability and resource capacity. This method of negative sampling is based on the theories of the probability of loss or skip-gram. It is worth considering this formula below. [2, p. 20] 1) Here are a couple of (w, d) words from the context under consideration. Was it taken from the relevant database? It is worth denoting them as follows (D = 1 | w, d) this will be the probability that (w. d) obtained not mediocre from an array or corpus of words. 2) Then p (D = 0 | w, d) will be the probability that (w, d) is not obtained from the data of the actual case itself. As before, we assume that there are parameters θ. governing the distribution p (D = 1 | w, d).
3) Throughout the computational process, we believe that the words and contexts are taken from separate arrays or dictionaries so that the vector associated with the word "airplane" will be quite different from the vector associated with the context of the word "aircraft". This follows from the very logic of context separation, from the vector itself. Existing points in their places and the probability of their loss. Depending on the location of the points, we introduce variables into the corresponding formula, whereas vectors and their relationships may remain completely different in one case or another. One of the reasons for prompting this question is the separation of the word by its direct meaning and the context for comparison. Finding points and vectors. For example, an airplane is an aircraft and an airplane. Words have the same motivated context and vector. In this case, all the words will be shared in the probable context of D. It is rather possible to assign a low probability. In practical application, the theory itself will look like this: П (aircraft | aircraft), which will subsequently lead to a low value of V · V, where we can assume the impossibility. Such questions appear everywhere, and we proceed from the fact that we divide a word into a direct meaning and in context, and then we perform a matching operation.
Our real task in the future is to answer this question and build steps to improve this method to maximize the probability loss. Let's imagine the given in the formulas below. To demonstrate our actions. Maximizing our actions is necessary for the code to be viable and adapted to work in a negative sample.
Тhe rate in this equation p ( = 1 | d , w ; θ) the datum can be shown by referring to soft max And the corresponding definition of the solution looks practical as follows: In this case, it is indicative that the problem has a trivial solution, if we establish θ from here we can conclude that p (D = 1 | w, c; θ) = 1 for a pair of each, otherwise we will proceed from (w ,d) . This is not difficult to achieve given the set value of θ and in such a way that vc = vw and vc · vw = K for each subsequent standing vc , vw, where K is a sufficiently large number (in a practical sense, we can get a probability of 1, but only as K ≈ 40. In a general sense, an indicative solution is also necessary for theoretical understanding. The loss of words with the practical and meaning of the vector can be very different and it is worth considering without fail. And to have a sufficient understanding of the indicator, a decision is needed regarding the probability itself. The loss of words with a similar meaning and in general words in general in context or vectorial basically has a direct relationship to the probability of loss. The practice in this case is quite easy to show. It is obligatory to have a tool or other mechanism that would not allow existing vectors to have the same value. Prohibiting similar combinations (w, d). A good way for this representation exists and it looks like this for existing pairs (w, d), for them p (D = 1 | w, d; θ) must have a low value for them. For those cases where pairs do not exist, we can achieve the goal by referring to the generation of a set of D' random pairs (w, d), we can assume that they are not true.(the name of this set "negative sampling" actually comes from a set of D' randomly selected negative examples). In this case, the goal of optimization is to select or use another language to generate negative possibilities. If we let to be σ(x) = 1 % 1+e−x we will have answer like this: following a certain method of construction, there is a difference, and it is extremely clearer and easier to master, from Mikolov. The most striking difference we want to show you to readers is that we represent the goal in general for the corpus D ∪ D' at the same moment they represent it as follows in one example (w, c) ∈ D and k examples (w, cj ) ∈ D' following a certain example of construction D' following the basic law of construction in the negative sampling method. [4, c.30] In particular, a negative sample k Mikołów and other related scientists have created a methodical method to perform the task D' k times erected more than D, and for each of the existing (w, d) ∈ D, we create k samples (w, d1), . . . , (w, dk), where cj is shown later in relation to its distribution across anagramme, raised to a power is necessary to derive the indicator 3/4. This is equivalent to extracting samples from the given case (w, d) in D' from the unigram distribution (w, d) ∼ pw(w) pcontexts (c) ¾ % Z, where pwords(w) and pcontexts(c) are the oneprogram distributions of words in this equation and the contexts of this represented word and, respectively, that Z is the normalization constant in the verbal sense in the equation. In the work of Mikolov and other scientists, each context is in a vector in the existing equation -this is the represented word (and all words are displayed as contexts for probability derivation), therefore we assume that pcontext(x) = pwords(x) = count(x) % |Text| [6,p.130] 5.1 Remarks on the topic and the corresponding definition of the solution look practical as follows: 1) The main significant difference in this case, from the skip-gram model described in the skip-gram topic, is that the main formulation in this section does not model p(d|w), but rather a specific model compatible with the distribution of w and d.

19
The transition from theory to an indicator of understandable practice.
Having understood the theoretical meaning and structure and solved the problem of the probability of falling out in a negative selection by example, we decided to show readers the main practical steps for understanding a negative selection in the simplest way below. For example, the word "Train" or any word. To show how the code works. Below is the code itself and the graph. As we can see, there are some sides of the code that we can use to make a word in computer language. Negative sampling: [7] Thus, we find out that it is also possible for the processor to recognize words by single actions with the calculation of a certain probability of loss. In the future, it is worth considering that in the vector, if we are talking about a linear vector, there is a certain array of letters forming a verbal composition through a prepared sample. This saves resources and has only advantages. For example, processing an array is more than a single action, unlike processing by a processor with several actions to create its own calculation and refer not to existing letters. Thus, during negative processing, the processor runs in information referring to an existing array, which is stored in a certain way. [3, p.20] This greatly reduces the load not only on the processor, but also on the service itself. In this case, we are talking about the existing advantages.
Basically, we will allude to the method of selection with a square. A square in which the current supply and its absence are located on the left side. A certain sequence creates the probability of memorizing the most convenient ways of marking letters. 1A and 1B in this case contiguously create 1 "a" and 1 "b". Given the matrix, we can imagine a complete working software system for our own word recognition and the probability of a semantic hit on the graph. Next, it should be borne in mind that it is the negatively excluding probability that is the main meaning for a negative selection in embedding. The main idea is to eliminate duplicate values, which clutters up the code and loads it with unnecessary calculations. Having completely analyzed the topic, we can see that by eliminating such a scenario in advance, we can finally come to the most convenient computational method. [4, p.20] Nevertheless, the general indicator is the very concept of an array from where the basic information for processing comes from. Without understanding the array, it is impossible to imagine a general semantic selection for words. Arrays can be quite different, but the classic way of the indicator in square brackets with the form of processing the prepared combinations mainly in the language itself can look like this [a, b, c, d, e, f, g, etc.]. Thus, we create a word by selection and exclusion in negative processing in the simplest way, which is the concept of embedding. Do not forget that the code is extremely large, however in this work we have examined the main examples for a complete understanding of the working state of the codes and the method of non-reactive selection. [5, p. 30] The existing algorithm. An algorithm is a basic principle in programming and life. An algorithm is a certain sequence of actions. For example, when processing, first of all, we refer to an array or resource. After accessing the resource, we proceed to identify variables, for example, and from here we proceed from the negative selection method to prevent the repetition of certain probabilities D, which we presented above.

Conclusion
It is worth saying that the main analysis for understanding the general features of this article is given with the concepts of practical examples with references in the form of theoretical material. In general, the method of negative selection itself was mainly presented in 2013. [9] Speaking about the development of the neural network, do not forget about the mathematical basic theory of understanding the probability of word selection. [10] In this context, a deep theory is related to as neural network. It is also crucial to concentrate on the real application of the code in practice. We apply practical code referring to the developments. The method of negative selection serves to indicate the possibility of probability loss in a similar value and excludes it. Also, this method is deeply focused on a single action in the general sense for processing the recognition of the above-proposed vector or in a linguistic context. [12, p.10] For an approximate understanding, we started calculating existing points in relation to each other and considered vectors. In the presented abstractness, we would talk about a square that could show us several adjacent points with similar meanings. We must exclude other existing points that are not related to our selection. Often speaking about this method, many forget that the main task of this negative selection method is to create certainty in the answer in the recognition itself.