Tóm tắt
Từ và tập hợp từ có nghĩa thông qua cách mà nó được sử dụng trong xã hội, từ các ngữ nghĩa tương đối đến các từ và tập hợp từ khác. Đối với máy tính, "xã hội" có nghĩa là "cơ sở dữ liệu" và hành động "sử dụng" có nghĩa là "cách thức tìm kiếm trong cơ sở dữ liệu". Bài báo này giới thiệu một phương pháp mới để đo độ tương tự (similarity) giữa các từ và tập hợp từ dựa trên khoảng cách thông tin và độ phức tạp Kolmogorov. Để thực hiện được điều này chúng tôi sử dụng WWW như là cơ sở dữ liệu và Google như là một máy tìm kiếm. Phương pháp này cũng có thể ứng dụng với các máy tìm kiếm và các cơ sở dữ liệu khác. Lý thuyết này sau đó sẽ được áp dụng để xây dựng một phương pháp để tự động trích xuất độ tương tự (similarity), khoảng cách tương Google (the Google similarity distance) của các từ và tập hợp từ từ WWW sử dụng việc đếm các trang Google. WWW là cơ sở dữ liệu lớn nhất trên trái đất, và thông tin ngữ cảnh được nhập vào bởi trung bình hàng tỷ người sử dụng độc lập cung cấp ngữ nghĩa có chất lượng một cách tự động. Chúng tôi sẽ giới thiệu ứng dụng lý thuyết trên trong các ứng dụng phân cụm phân cấp, phân lớp và dịch ngôn ngữ.
Bài báo này chúng tôi cũng đưa ra các ví dụ để phân biệt giữa màu và số, tên cụm hội họa (thế kỷ 17, Hà Lan) và tên của các quyển sách tiểu thuyết tiếng Anh, khả năng có thể hiểu các sự cố, các vụ trộm, và chúng tôi demo khả năng có thể ứng dụng để thực hiện một công việc đơn giản dịch tự đồng giữa tiếng Anh - tiếng Tây Ban Nha. Cuối cùng, chúng tôi sử dụng cơ sở dữ liệu WordNet như là một ranh giới mục tiêu qua đó để đánh giá thực thi của phương pháp này. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87% with the expert crafted WordNet categories.
I. Giới thiệu
Objects can be given literally, like the literal four-letter genome of a mouse, or the literal text of War and Peace by Tolstoy. For simplicity we take it that all meaning of the object is represented by the literal object itself. Các đối tượng cũng có thể được cho bởi tên như là “the four-letter genome of a mouse,” hoặc “the text of War and Peace by Tolstoy". Cũng có các đối tượng mà không thể đưa ra nghĩa theo từng chữ (literally), mà chỉ bởi tên, và điều đó đòi hỏi xác định nghĩa của chúng theo ngữ cảnh trong tri thức nền chung của nhân loại, như là “home” hoặc “red”. To make computers more intelligent one would like to represent meaning in computer-digestable form. Long-term and laborintensive efforts like the Cyc project [22] and the WordNet project [33] try to establish semantic relations between common objects, or, more precisely, names for those objects. The idea is to create a semantic web of such vast proportions that rudimentary intelligence, and knowledge about the real world, spontaneously emerge. This comes at the great cost of designing structures capable of manipulating knowledge, and entering high quality contents in these structures by knowledgeable human experts. While the efforts are longrunning and large scale, the overall information entered is minute compared to what is available on the world-wide-web.
The rise of the world-wide-web has enticed millions of users to type in trillions of characters to create billions of web pages of on average low quality contents. The sheer mass of the information about almost every conceivable topic makes it likely that extremes will cancel and the majority or average is meaningful in a low-quality approximate sense. We devise a general method to tap the amorphous low-grade knowledge available for free on the world-wide-web, typed in by local users aiming at personal gratification of diverse objectives, and yet globally achieving what is effectively the largest semantic electronic database in the world. Moreover, this database is available for all by using any search engine that can return aggregate page-count estimates for a large range of searchqueries, like Google.
Previously, we and others developed a compression-based method to establish a universal similarity metric among objects given as finite binary strings [2], [25], [26], [7], [8], [39], [40], which was widely reported [20], [21], [13]. Such objects can be genomes, music pieces in MIDI format, computer programs in Ruby or C, pictures in simple bitmap formats, or time sequences such as heart rhythm data. This method is feature-free in the sense that it doesn’t analyze the files looking for particular features; rather it analyzes all features simultaneously and determines the similarity between every pair of objects according to the most dominant shared feature. The crucial point is that the method analyzes the objects themselves. This precludes comparison of abstract notions or other objects that don’t lend themselves to direct analysis, like emotions, colors, Socrates, Plato, Mike Bonanno and Albert Einstein. While the previous method that compares the objects themselves is particularly suited to obtain knowledge about the similarity of objects themselves, irrespective of common beliefs about such similarities, here we develop a method that uses only the name of an object and obtains knowledge about the similarity of objects, a quantified relative Google semantics, by tapping available information generated by multitudes of web users. Here we are reminded of the words of D.H. Rumsfeld [31] “A trained ape can know an awful lot/ Of what is going on in this world / Just by punching on his mouse / For a relatively modest cost!” In this paper, the Google semantics of a word or phrase consists of the set of web pages returned by the query concerned.
A. An Example:
While the theory we propose is rather intricate, the resulting method is simple enough. We give an example: At the time of doing the experiment, a Google search for “horse”, returned 46,700,000 hits. The number of hits for the search term “rider” was 12,200,000. Searching for the pages where both “horse” and “rider” occur gave 2,630,000 hits, and Google indexed 8,058,044,651 web pages. Using these numbers in the main formula (III.3) we derive below, with N = 8, 058, 044, 651, this yields a Normalized Google Distance between the terms “horse” and “rider” as follows:
In the sequel of the paper we argue that the NGD is a normed semantic distance between the terms in question, usually (but not always, see below) in between 0 (identical) and 1 (unrelated), in the cognitive space invoked by the usage of the terms on the world-wide-web as filtered by Google. Because of the vastness and diversity of the web this may be taken as related to the current use of the terms in society.
We did the same calculation when Google indexed only onehalf of the number of pages: 4,285,199,774. It is instructive that the probabilities of the used search terms didn’t change significantly over this doubling of pages, with number of hits for “horse” equal 23,700,000, for “rider” equal 6,270,000, and for “horse, rider” equal to 1,180,000. The NGD(horse, rider) we computed in that situation was ≈ 0.460. This is in line with our contention that the relative frequencies of web pages containing search terms gives objective information about the semantic relations between the search terms. If this is the case, then the Google probabilities of search terms and the computed NGD ’s should stabilize (become scale invariant) with a growing Google database.
B. Related Work:
There is a great deal of work in both cognitive psychology [37], linguistics, and computer science, about using word (phrases) frequencies in text corpora to develop measures for word similarity or word association, partially surveyed in [34], [36], going back to at least [35]. One of the most successful is Latent Semantic Analysis (LSA) [37] that has been applied in various forms in a great number of applications. We discuss LSA and its relation to the present approach in Appendix VII. As with LSA, many other previous approaches of extracting corollations from text documents are based on text corpora that are many order of magnitudes smaller, and that are in local storage, and on assumptions that are more refined, than what we propose. In contrast, [11], [1] and the many references cited there, use the web and Google counts to identify lexicosyntactic patterns or other data. Again, the theory, aim, feature analysis, and execution are different from ours, and cannot meaningfully be compared. Essentially, our method below automatically extracts semantic relations between arbitrary objects from the web in a manner that is feature-free, up to the search-engine used, and computationally feasible. This seems to be a new direction altogether.
C. Outline:
The main thrust is to develop a new theory of semantic distance between a pair of objects, based on (and unavoidably biased by) a background contents consisting of a database of documents. An example of the latter is the set of pages constituting the world-wide-web. Similarity relations between pairs of objects is distilled from the documents by just using the number of documents in which the objects occur, singly and jointly (irrespective of location or multiplicity). For us, the Google semantics of a word or phrase consists of the set of web pages returned by the query concerned. Note that this can mean that terms with different meaning have the same semantics, and that opposites like ”true” and ”false” often have a similar semantics. Thus, we just discover associations between terms, suggesting a likely relationship. As the web grows, the Google semantics may become less primitive. The theoretical underpinning is based on the theory of Kolmogorov complexity [27], and is in terms of coding and compression.
This allows to express and prove properties of absolute relations between objects that cannot even be expressed by other approaches. The theory, application, and the particular NGD formula to express the bilateral semantic relations are (as far as we know) not equivalent to any earlier theory, application, and formula in this area. The current paper is a next step in a decade of cumulative research in this area, of which the main thread is [27], [2], [28], [26], [7], [8] with [25], [3] using the related approach of [29]. We first start with a technical introduction outlining some notions underpinning our approach: Kolmogorov complexity, information distance, and compression-based similarity metric (Section II). Then we give a technical description of the Google distribution, the Normalized Google Distance, and the universality of these notions (Section III). While it may be possible in principle that other methods can use the entire world-wide-web to determine semantic similarity between terms, we do not know of a method that both uses the entire web, or computationally can use the entire web, and (or) has the same aims as our method. To validate our method we therefore cannot compare its performance to other existing methods. Ours is a new proposal for a new task. We validate the method in the following way: by theoretical analysis, by anecdotical evidence in a plethora of applications, and by systematic and massive comparison of accuracy in a classification application compared to the uncontroversial body of knowledge in the WordNet database. In Section III we give the theoretic underpinning of the method and prove its universality. In Section IV we present a plethora of clustering and classification experiments to validate the universality, robustness, and accuracy of our proposal. In Section V we test repetitive automatic performance against uncontroversial semantic knowledge: We present the results of a massive randomized classification trial we conducted to gauge the accuracy of our method to the expert knowledge as implemented over the decades in the WordNet database. The preliminary publication [9] of this work on the web archives was widely reported and discussed, for example [16], [17]. The actual experimental data can be downloaded from [5]. The method is implemented as an easy-to-use software tool available on the web [6], available to all.
D. Materials and Methods:
The application of the theory we develop is a method that is justified by the vastness of the world-wide-web, the assumption that the mass of information is so diverse that the frequencies of pages returned by Google queries averages the semantic information in such a way that one can distill a valid semantic distance between the query subjects. It appears to be the only method that starts from scratch, is feature-free in that it uses just the web and a search engine to supply contents, and automatically generates relative semantics between words and phrases. A possible drawback of our method is that it relies on the accuracy of the returned counts. As noted in [1], the returned google counts are inaccurate, and especially if one uses the boolean OR operator between search terms, at the time of writing. The AND operator we use is less problematic, and we do not use the OR operator. Furthermore, Google apparently estimates the number of hits based on samples, and the number of indexed pages changes rapidly. To compensate for the latter effect, we have inserted a normalizing mechanism in the CompLearn software. Generally though, if search engines have peculiar ways of counting number of hits, in large part this should not matter, as long as some reasonable conditions hold on how counts are reported. Linguists judge the accuracy of Google counts trustworthy enough: In [23] (see also the many references to related research) it is shown that web searches for rare two-word phrases correlated well with the frequency found in traditional corpora, as well as with human judgments of whether those phrases were natural. Thus, Google is the simplest means to get the most information. Note, however, that a single Google query takes a fraction of a second, and that Google restricts every IP address to a maximum of (currently) 500 queries per day—although they are cooperative enough to extend this quotum for noncommercial purposes. The experimental evidence provided here shows that the combination of Google and our method yields reasonable results, gauged against common sense (‘colors’ are different from ‘numbers’) and against the expert knowledge in the WordNet data base. A reviewer suggested downscaling our method by testing it on smaller text corpora. This does not seem useful. Clearly perfomance will deteriorate with decreasing data base size. A thought experiment using the extreme case of a single web page consisting of a single term suffices. Practically addressing this issue is begging the question. Instead, in Section III we theoretically analyze the relative semantics of search terms established using all of the web, and its universality with respect to the relative semantics of search terms using subsets of web pages.
II. TECHNICAL PRELIMINARIES
I. Giới thiệu
Objects can be given literally, like the literal four-letter genome of a mouse, or the literal text of War and Peace by Tolstoy. For simplicity we take it that all meaning of the object is represented by the literal object itself. Các đối tượng cũng có thể được cho bởi tên như là “the four-letter genome of a mouse,” hoặc “the text of War and Peace by Tolstoy". Cũng có các đối tượng mà không thể đưa ra nghĩa theo từng chữ (literally), mà chỉ bởi tên, và điều đó đòi hỏi xác định nghĩa của chúng theo ngữ cảnh trong tri thức nền chung của nhân loại, như là “home” hoặc “red”. To make computers more intelligent one would like to represent meaning in computer-digestable form. Long-term and laborintensive efforts like the Cyc project [22] and the WordNet project [33] try to establish semantic relations between common objects, or, more precisely, names for those objects. The idea is to create a semantic web of such vast proportions that rudimentary intelligence, and knowledge about the real world, spontaneously emerge. This comes at the great cost of designing structures capable of manipulating knowledge, and entering high quality contents in these structures by knowledgeable human experts. While the efforts are longrunning and large scale, the overall information entered is minute compared to what is available on the world-wide-web.
The rise of the world-wide-web has enticed millions of users to type in trillions of characters to create billions of web pages of on average low quality contents. The sheer mass of the information about almost every conceivable topic makes it likely that extremes will cancel and the majority or average is meaningful in a low-quality approximate sense. We devise a general method to tap the amorphous low-grade knowledge available for free on the world-wide-web, typed in by local users aiming at personal gratification of diverse objectives, and yet globally achieving what is effectively the largest semantic electronic database in the world. Moreover, this database is available for all by using any search engine that can return aggregate page-count estimates for a large range of searchqueries, like Google.
Previously, we and others developed a compression-based method to establish a universal similarity metric among objects given as finite binary strings [2], [25], [26], [7], [8], [39], [40], which was widely reported [20], [21], [13]. Such objects can be genomes, music pieces in MIDI format, computer programs in Ruby or C, pictures in simple bitmap formats, or time sequences such as heart rhythm data. This method is feature-free in the sense that it doesn’t analyze the files looking for particular features; rather it analyzes all features simultaneously and determines the similarity between every pair of objects according to the most dominant shared feature. The crucial point is that the method analyzes the objects themselves. This precludes comparison of abstract notions or other objects that don’t lend themselves to direct analysis, like emotions, colors, Socrates, Plato, Mike Bonanno and Albert Einstein. While the previous method that compares the objects themselves is particularly suited to obtain knowledge about the similarity of objects themselves, irrespective of common beliefs about such similarities, here we develop a method that uses only the name of an object and obtains knowledge about the similarity of objects, a quantified relative Google semantics, by tapping available information generated by multitudes of web users. Here we are reminded of the words of D.H. Rumsfeld [31] “A trained ape can know an awful lot/ Of what is going on in this world / Just by punching on his mouse / For a relatively modest cost!” In this paper, the Google semantics of a word or phrase consists of the set of web pages returned by the query concerned.
A. An Example:
While the theory we propose is rather intricate, the resulting method is simple enough. We give an example: At the time of doing the experiment, a Google search for “horse”, returned 46,700,000 hits. The number of hits for the search term “rider” was 12,200,000. Searching for the pages where both “horse” and “rider” occur gave 2,630,000 hits, and Google indexed 8,058,044,651 web pages. Using these numbers in the main formula (III.3) we derive below, with N = 8, 058, 044, 651, this yields a Normalized Google Distance between the terms “horse” and “rider” as follows:
In the sequel of the paper we argue that the NGD is a normed semantic distance between the terms in question, usually (but not always, see below) in between 0 (identical) and 1 (unrelated), in the cognitive space invoked by the usage of the terms on the world-wide-web as filtered by Google. Because of the vastness and diversity of the web this may be taken as related to the current use of the terms in society.
We did the same calculation when Google indexed only onehalf of the number of pages: 4,285,199,774. It is instructive that the probabilities of the used search terms didn’t change significantly over this doubling of pages, with number of hits for “horse” equal 23,700,000, for “rider” equal 6,270,000, and for “horse, rider” equal to 1,180,000. The NGD(horse, rider) we computed in that situation was ≈ 0.460. This is in line with our contention that the relative frequencies of web pages containing search terms gives objective information about the semantic relations between the search terms. If this is the case, then the Google probabilities of search terms and the computed NGD ’s should stabilize (become scale invariant) with a growing Google database.
B. Related Work:
There is a great deal of work in both cognitive psychology [37], linguistics, and computer science, about using word (phrases) frequencies in text corpora to develop measures for word similarity or word association, partially surveyed in [34], [36], going back to at least [35]. One of the most successful is Latent Semantic Analysis (LSA) [37] that has been applied in various forms in a great number of applications. We discuss LSA and its relation to the present approach in Appendix VII. As with LSA, many other previous approaches of extracting corollations from text documents are based on text corpora that are many order of magnitudes smaller, and that are in local storage, and on assumptions that are more refined, than what we propose. In contrast, [11], [1] and the many references cited there, use the web and Google counts to identify lexicosyntactic patterns or other data. Again, the theory, aim, feature analysis, and execution are different from ours, and cannot meaningfully be compared. Essentially, our method below automatically extracts semantic relations between arbitrary objects from the web in a manner that is feature-free, up to the search-engine used, and computationally feasible. This seems to be a new direction altogether.
C. Outline:
The main thrust is to develop a new theory of semantic distance between a pair of objects, based on (and unavoidably biased by) a background contents consisting of a database of documents. An example of the latter is the set of pages constituting the world-wide-web. Similarity relations between pairs of objects is distilled from the documents by just using the number of documents in which the objects occur, singly and jointly (irrespective of location or multiplicity). For us, the Google semantics of a word or phrase consists of the set of web pages returned by the query concerned. Note that this can mean that terms with different meaning have the same semantics, and that opposites like ”true” and ”false” often have a similar semantics. Thus, we just discover associations between terms, suggesting a likely relationship. As the web grows, the Google semantics may become less primitive. The theoretical underpinning is based on the theory of Kolmogorov complexity [27], and is in terms of coding and compression.
This allows to express and prove properties of absolute relations between objects that cannot even be expressed by other approaches. The theory, application, and the particular NGD formula to express the bilateral semantic relations are (as far as we know) not equivalent to any earlier theory, application, and formula in this area. The current paper is a next step in a decade of cumulative research in this area, of which the main thread is [27], [2], [28], [26], [7], [8] with [25], [3] using the related approach of [29]. We first start with a technical introduction outlining some notions underpinning our approach: Kolmogorov complexity, information distance, and compression-based similarity metric (Section II). Then we give a technical description of the Google distribution, the Normalized Google Distance, and the universality of these notions (Section III). While it may be possible in principle that other methods can use the entire world-wide-web to determine semantic similarity between terms, we do not know of a method that both uses the entire web, or computationally can use the entire web, and (or) has the same aims as our method. To validate our method we therefore cannot compare its performance to other existing methods. Ours is a new proposal for a new task. We validate the method in the following way: by theoretical analysis, by anecdotical evidence in a plethora of applications, and by systematic and massive comparison of accuracy in a classification application compared to the uncontroversial body of knowledge in the WordNet database. In Section III we give the theoretic underpinning of the method and prove its universality. In Section IV we present a plethora of clustering and classification experiments to validate the universality, robustness, and accuracy of our proposal. In Section V we test repetitive automatic performance against uncontroversial semantic knowledge: We present the results of a massive randomized classification trial we conducted to gauge the accuracy of our method to the expert knowledge as implemented over the decades in the WordNet database. The preliminary publication [9] of this work on the web archives was widely reported and discussed, for example [16], [17]. The actual experimental data can be downloaded from [5]. The method is implemented as an easy-to-use software tool available on the web [6], available to all.
D. Materials and Methods:
The application of the theory we develop is a method that is justified by the vastness of the world-wide-web, the assumption that the mass of information is so diverse that the frequencies of pages returned by Google queries averages the semantic information in such a way that one can distill a valid semantic distance between the query subjects. It appears to be the only method that starts from scratch, is feature-free in that it uses just the web and a search engine to supply contents, and automatically generates relative semantics between words and phrases. A possible drawback of our method is that it relies on the accuracy of the returned counts. As noted in [1], the returned google counts are inaccurate, and especially if one uses the boolean OR operator between search terms, at the time of writing. The AND operator we use is less problematic, and we do not use the OR operator. Furthermore, Google apparently estimates the number of hits based on samples, and the number of indexed pages changes rapidly. To compensate for the latter effect, we have inserted a normalizing mechanism in the CompLearn software. Generally though, if search engines have peculiar ways of counting number of hits, in large part this should not matter, as long as some reasonable conditions hold on how counts are reported. Linguists judge the accuracy of Google counts trustworthy enough: In [23] (see also the many references to related research) it is shown that web searches for rare two-word phrases correlated well with the frequency found in traditional corpora, as well as with human judgments of whether those phrases were natural. Thus, Google is the simplest means to get the most information. Note, however, that a single Google query takes a fraction of a second, and that Google restricts every IP address to a maximum of (currently) 500 queries per day—although they are cooperative enough to extend this quotum for noncommercial purposes. The experimental evidence provided here shows that the combination of Google and our method yields reasonable results, gauged against common sense (‘colors’ are different from ‘numbers’) and against the expert knowledge in the WordNet data base. A reviewer suggested downscaling our method by testing it on smaller text corpora. This does not seem useful. Clearly perfomance will deteriorate with decreasing data base size. A thought experiment using the extreme case of a single web page consisting of a single term suffices. Practically addressing this issue is begging the question. Instead, in Section III we theoretically analyze the relative semantics of search terms established using all of the web, and its universality with respect to the relative semantics of search terms using subsets of web pages.
II. TECHNICAL PRELIMINARIES
No comments:
Post a Comment