Location Intelligence: 2015

Saturday, 15 August 2015

A Study of Hybrid Similarity Measures for Semantic Relation Extraction

Abstract

This paper describes several novel hybrid semantic similarity measures. We study various combinations of 16 baseline measures based on WordNet, Web as a corpus, corpora, dictionaries, and encyclopedia. The hybrid measures rely on 8 combination methods and 3 measure selection techniques and are evaluated on (a) the task of predicting semantic similarity scores and (b) the task of predicting semantic relation between two terms. Our results show that hybrid measures outperform single measures by a wide margin, achieving a correlation up to 0.890 and MAP(20) up to 0.995.

1 Introduction
Semantic similarity measures and relations are proven to be valuable for various NLP and IR applications, such as word sense disambiguation, query expansion, and question answering.
Let R be a set of synonyms, hypernyms, and co-hyponyms of terms C, established by a lexicographer. A semantic relation extraction method aims at discovering a set of relations Rˆ approximating R. The quality of the relations provided by existing extractors is still lower than the quality of the manually constructed relations. This motivates the development of new relation extraction methods.

Monday, 13 April 2015

Phương pháp trích xuất thể hiện quan hệ dựa trên dư thừa thông tin

In this chapter we present an approach, that extracts instances of relations defined in an ontology. We use generic, domain independent techniques to extract candidate relation instances from the Web and exploit the redundancy of information on the Web to compensate for loss of precision caused by the use of these generic methods. The candidate relation instances are ranked based on co-occurrence with a small seed set. In an experiment, we extracted instances of the relation between artists and art styles. The results were manually evaluated against selected art resources. The method was also tested in a football domain. We also compare the performance of our ranking to that of a Google-hit count based method.

This chapter is based on an article co-authored with Maarten van Someren and Bob Wielinga, "A redundancy-based method for the extraction of relation instances from the Web" [de Boer et al., 2007], which appeared in the International Journal of Human-Computer Studies, volume 65(9), 2007. Elements of this chapter have been presented at the 3rd European Semantic Web Conference (ESWC 2006) [de Boer et al., 2006b], the 17th European Conference of Artificial Intelligence (ECAI 2006) [de Boer et al., 2006a], and the 29th Annual German Conference on AI (KI 2006) [de Boer et al., 2006d].

2.1 Giới thiệu
The emerging notion of the Semantic Web envisions a next generation of the World Wide Web in which content can be semantically interpreted with the use of ontologies. Following Maedche and Staab [2001], we distinguish two parts of an ontology: the data model and the knowledge base. The ontology data model consists of its classes (concepts) and relations that make up a conceptualization of a domain. The knowledge base contains the content of the ontology, consisting of the instances of the classes and relations in the underlying ontology data model. The Semantic Web calls for a large number of ontologies on different domains as well as knowledge base content.
It has been argued (e.g. [Maedche and Staab, 2001], [Cimiano et al., 2004]) that manual construction of ontologies is time consuming and that (semi-)automatic methods for the construction of the ontology data models would be of great benefit to the field. For the same reason, to avoid the knowledge acquisition bottleneck, we also would like to extract the content of the knowledge base in a (semi)-automatic way from existing sources of information such as the World Wide Web. This task is called ontology population. Automatic methods for ontology population are needed to avoid the tedious labor of manually identifying the instances in the corpus and adding them to the knowledge base.
The task of Ontology Learning can be decomposed into learning ontology classes for a domain, discovering the class hierarchy by identifying taxonomic relations and learning other relations between the classes. We can also decompose the task of ontology population into the extraction of class instances or instances of relations. In this chapter, we focus on the task of automatically extracting instances of relations that are defined in the ontology data model. This task, further defined in the next section, we call relation instantiation. For this task, we designed a method that extracts the information from heterogeneous sources on the World Wide Web.
The task of Ontology Learning and Population is related to Information Retrieval tasks such as Information Extraction and Question Answering. In these tasks, the goal is to extract a specific piece of information from a corpus of documents (such as the Web). Question Answering starts with a user-defined question formulated in natural language. This question is then analyzed and the answer is extracted from the corpus. In Information Extraction, the query is in a structured form. Named Entity Recognition tasks such as extracting all instances of geographical locations in a text are an example of such an Information Extraction task. Ontology Learning and Population also aim to extract relevant pieces of information from a corpus, but these are then used to form, populate or enrich ontologies. Both the query and the result are structured terms and not pieces of text. Our method is designed to extract instances of relations for partly populated ontologies.
Current approaches to Information Extraction are used for ontology population. However, these methods assume a specific structure of the corpus documents. Wrapper-induction techniques such as proposed by Kushmerick et al. [1997] exploit structures within documents (e.g. lists and tables) and structures that are similar across individual corpus documents. They perform best in domains with a highly structured corpus. Methods that learn natural language patterns such as Hearst patterns [Hearst, 1992] generally perform well on free text, but do not work as well for more structured data. We design our method to be structure-independent.
Methods that use some form of supervised Machine Learning assume a large number of tagged example instances to be able to learn patterns for extracting new instances and this is a serious limitation for large scale use [Cimiano, 2005]. We designed our method to require only a small amount of examples that are used as a seed set.
A number of methods that use learned patterns for Information Extraction perform very well on the domain they were constructed for. Their performance drops however when they are applied in a new, unknown domain (cf. [Riloff and Jones, 1999]). Our method as presented in this chapter is domain independent.
Our approach incorporates generic methods that do not rely on assumptions about the domain or the type of documents in the corpus. By using these general methods for the extraction, we will lose in precision since the general methods are not optimized for a specific corpus or domain. However, since we use more generic methods, we are able to extract information from a greater number of sources. The main assumption behind our method is that because of the redundancy of information on the Web and the ability to combine information from heterogeneous sources, we can compensate for this loss of precision.
In the next section we will give a more elaborate definition of the relation instantiation task and state our assumptions with respect to this task. In Section 2.3, we will describe our approach to this task. We tested the method in two domains. Experiments in the cultural heritage domain will be discussed in Section 2.4 and we report on experiments in the Football domain in Section 2.5. We compared the method to an other redundancy-based web metric: the Normalized Google Distance and describe this and other related research in Section 2.6. In the last section we will look at conclusions and further research.
2.2 Công việc trích xuất thể hiện quan hệ
We define an ontology data model as a set of labeled classes C1, ..., Cn, ordered by a subclass relation. Relations between classes other than the subclass relation are also defined (R : Ci × Cj). We speak of a (partly) populated ontology when, besides the ontology data model, a knowledge base with instances of both classes and relations from the ontology data model is also present.
We define the task of relation instantiation from a corpus as follows:
Given two classes Ci and Cj in a partly populated ontology, with sets of instances Ii and Ij and given a relation R : Ci × Cj, identify for an instance i ∈ Ii all instances j ∈ Ij such that the relation R(i, j) holds given the information in the corpus.
Furthermore, we make a number of additional assumptions:

R is not a one-to-one relation. The instance i is related to multiple elements of Ij.
We know all elements of Ij.
We have a method available that recognizes these elements elements of Ij in the documents in our corpus. For a textual corpus such as the Web, this implies that the instances in the knowledge base must have at least one textual label. Instances can have more than one label.
There must be multiple documents in the corpus, in which multiple instances of the relation are represented.
We have a (small) example set of instances of Ci and Cj for which the relation R holds. This is used as a seed set.

The second assumption states that we do not attempt to extract new instances of a class with this method but attempt to find instances of relations between already known instances. An example of such a relation instantiation task is the extraction of instances of the relation appears_in between films (instances of class Film) and actors (instances of class Actor) in an ontology about movies. Another example is finding the relation has_artist between instances of the class Art Style and instances of the class Artist in a populated ontology describing the cultural heritage domain. In Section2.4, we report on experiments with our method to extract instances of this relation.
2.3 Trích xuất thể hiện quan hệ dựa trên dư thừa thông tin
In Section 2.3.1, we present our general approach to the relation instantiation task. We have done experiments with a basic version of the method and a slightly expanded version that uses bootstrapping. These versions are described in Section 2.3.2.
2.3.1 Phương pháp
The basic approach of the method is outlined in Figure 1. To extract instances of the relation R : Ci × Cj, the method takes as input a single instance i of Ci and the set of instances of Cj. Further input is in the form of a (small) seed set of instances for which we already know that the given relation holds.
The method uses a generic method such as Named Entity Recognizers to identify instances of Cj in the individual documents from the Corpus, extracted from the Web, and marks them as candidates for the right-hand side of a relation instance. The documents are then given a score that reflects how well the relation R is represented in those documents. The level of co-occurrence of the candidates with the seed set is used to calculate this Document Score. All candidates are then scored based on the Document Scores of the pages they appear on, resulting in a ranked list of right-hand side instances. In the basic version of the method, this ranked list is the output. In the iterative method, on each iteration, the top n candidates are added to the seed set and all scores are recalculated.
The next section further specifies the four steps of the method. We describe the extraction methods used, as well as the formulas for scoring the documents and the candidates. In this section, the method is also expanded to an iterative bootstrapping method.
2.3.2 Mô tả phương pháp
The basic method consists of four steps, as shown in Figure 1. Step 1 takes the instance i of Ci as input and constructs a ‘working corpus’. This is done by constructing a query from the label(s) of the instance i. If there are multiple labels, the individual labels are double-quoted and separated by a disjunctive operator, specific to the search engine. In this chapter, we use Google [Brin and Page, 1998] as our search engine. For Google, a query formed with multiple labels looks like: ‘"Label A" OR "Label B" OR "Label C"’. This query is then presented to the search engine. The first M pages are then retrieved from the web, forming the working corpus. M, the size of this working corpus, is a parameter of the method. Larger working corpora will yield more reliable results, but processing these takes more time. In the experiments described in Section 2.4, we use M=200 and M = 1000.

Figure 1: Outline of the redundancy based method

In step 2, we identify the instances of Cj in the documents of the working corpus. Since we assume that we already know all instances of Cj, this step consists of matching the instances to their representations in the documents. These representations are extracted from the document using a extraction method as listed in our assumptions. Named Entity Recognizers can extract different types of entities such as dates, persons, locations, companies, etc. These extracted representations (strings) are then matched to the instances from the knowledge base. This matching process itself aims for a high precision. The use of a large number of documents to extract information from, will raise recall. The identified instances in the documents are the right-hand side instances of the candidate relation instances.
In steps 3 and 4, the method combines the evidence from the different documents to produce a ranking for these candidate instances. We base this ranking on the assumption that on average in individual web pages, a target relation is either well represented (the web page contains a relatively large number of correct right-hand side instances) or not represented (it contains few or none of these instances). We view this process as an instance of Semi-Supervised Learning, more precisely self-training. In self-training, one trains a classifier with a small amount of labeled data and classifies unlabeled data. Typically, the most confidently classified data points are then added to the labeled data set [Zhu, 2005]. We view the pages as classifiers for correct right-hand side instances. Since we have no negatively labeled instances, the confidence of such an individual classifier is the number of positive labeled instances divided by the number of positive and unknown instances on that page [Lin et al., 2003]. We use this confidence as our Document Score DS. In step 3, the value of DS is calculated for each of the documents:

where µdoc is the number of instances of Cj identified in document doc for which the relation is already in our seed set and νdoc is the total number of instances of Cj identified in document doc.
In step 4, we then combine all the evidence from the individual classifiers for each of the candidate instances. For this we take for each of the candidate instances the sum of the Document Scores from the documents in which the candidate instance has been identified. This number is then normalized by the total number of corpus documents, N. This results in an Instance Score IS:

where i ∈ Ij, i ∈ doc.
At the end of this step, we have an ordered list of candidates for new relation instances. In the basic method, this is the output of the method. The issue of how many instances are to be added to the knowledge base depends largely on the needs of the ontology engineer and the specific task for which the ontology population has been executed. If precision is preferred over recall, adding considerably less instances leads to higher quality. In Section 2.4.2, we examine the performance of the method with respect to a threshold based on the value of IS.
This threshold is domain- and task-specific and for a new task needs to be determined by hand. To make it more task-independent, we expanded the basic method with a bootstrapping cycle where after each step, we add new relation instances found in that iteration to the knowledge base. This cycle is also shown in Figure 1. Corpus construction, instance identification and the scoring of documents and candidate instances are done in the same way as in the basic method. From the resulting ordered list we add the top n candidates to the seed set. Then on the next iteration, the document and instance scores are again calculated, using the updated seed list. In our experiments, we set n = 1. This procedure iterates by recalculating all DS and IS, based on the expanded seed set. In the basic method, pages that do not have any instances from the seed set will never get a DS higher than zero. In this iterative method with its expanding seed set, these documents can receive a higher DS when instances in these documents appear in the seed set.
Again, an issue is after how many iterations to stop adding instances. However, now we can use the number of iterations rather than a value for IS as our threshold, making the threshold less domain-specific. In Section 2.4.4, we introduce a method to determine to stop iterating.
2.4 Trích xuất quan hệ <artist> - <art style>
We tested our method in the cultural heritage domain. In this section, we describe the setup of the experiments and the results in this domain. This section is structured as follows: In Section 2.4.1, we give a description of the cultural heritage domain and the specific implementation of the method in this domain. The experiments done in the domain are grouped in three sections. In Section 2.4.2, the experiments done with the basic method are covered and in Section 2.4.3 the experiments with the iterative method are described. The effect of the iterative threshold parameters is examined in the experiments that make up Section 2.4.4.
2.4.1 Miền di sản văn hóa
We used two well-known cultural heritage vocabularies that we interpret as partly populated ontologies. One is the Art and Architecture Thesaurus (AAT) [The Getty Foundation, 2000a], a thesaurus defining more then 133.000 terms used to describe and classify cultural heritage objects. The other is the Unified List of Artist names (ULAN) [The Getty Foundation, 2000c], a list of more then 255.000 names of artists. We also added a relation aua:has_artist 1 between the AAT class aat:Styles_and_Periods and the top-level ULAN class ulan:Artist. The aua:has_artist relation describes which artists represent a specific art style. In the experiments described in the coming three sections, the task is to find new instances of this aua:has_artist relation with the use of a seed set of relation instances. R is aua:has_artist, Ci is aat:Styles_and_Periods and Cj is ulan:Artist. Note that this relation satisfies the requirement that it is not a one-to-one relation since a single art style is represented by a number of artists. In each of the experiments, we manually added a number of instances of the aua:has_artist relation to the knowledge base.
These experiments were inspired by the need for enrichment of actual cultural heritage vocabularies for use in the MultimediaN E-culture project [Schreiber et al., 2006]. The goal of this project is to build a Semantic Web application that makes it possible to access different cultural heritage collections from across the Netherlands using existing vocabularies, including AAT and ULAN.
For each experiment, we first chose a number of instances of aat: Styles_and _Periods for which we extract new relation instances. Then in Step 1, a number of pages were extracted as a working corpus by querying Google with a combination of the preferred and non-preferred labels from the AAT (for ‘Dada’ this resulted in the query ‘Dada OR Dadaist OR Dadaism’), the number of pages differs per experiment and are noted in their respective sections. Since the right-hand side instances in this task are persons, we first identified in Step 2 all person names in the documents. For this we used the Person Name Extractor from the tOKo toolkit [Anjewierden et al., 2004]. We evaluated the performance of this Person Name Extractor on our domain. For this, we extracted person names from 11 authoritative web pages on the art style ‘Impressionism’. On these pages the person name extractor performed very well with an average precision of 0.95 and a recall of 0.93. To go from string representation in the document to a knowledge base instance, the extracted names are then matched to the ULAN list of artists. This matching step is problematic as the number of artists in the ULAN is very large and so is the number of possible occurrences of person names in the texts. For example, ‘Vincent van Gogh’ can also appear as ‘V. van Gogh’, ‘van Gogh, V.’ or ‘van Gogh’.
To tackle this matching problem, we performed tokenization on both the labels of all ULAN instances and the extracted Person Name strings. An ULAN instance is a possible match if all tokens in the extracted string are also tokens of that instance. If a string has exactly one possible match, we accept that match. If there still is ambiguity (the string ‘van Gogh’ matches three different artists), we reject the string and proceed to the next candidate string.
We assume that because of the redundancy of names from the corpus, a non-ambiguous name will eventually be extracted and correctly matched. If a non-ambiguous name exists in our corpus, we will be able to determine the correct individual. However, as we found in earlier experiments, some individuals can not be distinguished only using their names (we give an example in Section 2.4.2.3). In addition, some names will not be extracted due to imperfections of the Person Name Extractor.
After the candidate instances have been extracted, in step 3, we calculated the Document Scores for each document. In step 4 the evidence from all documents is combined into Instance Scores for each of the instances, resulting in an ordered list of scored candidates.
2.4.2 Các thử nghiệm cơ bản
Evaluation of Ontology Learning and Population still is an open issue. Since the specific task we tackle resembles Information Retrieval, we would like to calculate standard IR evaluation measures such as precision, recall and the combination: the F-measure [van Rijsbergen, 1979]. However, this requires a gold standard. Although we assume we know all artists, there is no classic gold standard that for a single art style indicates which artists represent that art style. This is due to the vagueness of the domain. Art websites, encyclopedias and experts disagree about which individual artists represent a certain art style. Although this fuzziness occurs in many domains, it is very apparent in the Art domain. Also, manually annotating the large number of web pages from the working corpus for each instance is too time-consuming.
In experiments described in this section, we opted to construct a strict gold standard. For this, we chose a number of representative web pages on a specific art style and manually identified the artists that were designated as representing that art style. If there was a relative consensus about an artist representing the art style among the pages, we added it to our ‘gold standard’. The gold standard we obtained using this method is used to measure recall, precision and F-measure values.

Table 1: Our gold standard for ‘Expressionism’. The names of the three artists selected for the seed set are italicized.

2.4.2.1 Thử nghiệm 1: Expressionism
Experiments 1 and 2 were conducted to test the performance of the basic method, specifically the ranking based on Instance Scores. In our first experiment, we chose ’Expressionism’ as the instance of Ci. We manually constructed a gold standard from 12 authoritative web pages. For a total of 30 artists that were considered Expressionists in three or more of these documents we used the relation aua:has_artist from Expressionism to those artists as our gold standard. The actual artists that make up our gold standard are shown in Table 1. From these 30 instances of the relation, we randomly selected three instances with the use of the MS Excel random function as our seed set. The three artists in the resulting seed set are italicized in Table 1. After that, we followed the approach described above to retrieve the remaining instances of the relation.
In Step 1 (the retrieval step) 200 documents were extracted. We now describe the resulting data set by focusing on the first ten documents of this working corpus. Of these ten pages seven pages were encyclopedic pages about our target art style ’Expressionism’. One page described a single Expressionist artist, ’Edvard Munch’. One page was an encyclopedic page about ’Abstract Expressionism’, a completely different art style. The last page was an online art poster store, advertising posters of different artists. These ten pages are heterogeneous in structure: Three of the ten pages consisted completely of natural language. On one page (the poster store page), artist names occurred only in structured tables. In the other six pages, the artist names occurred both in natural language and in list or table structures on the same page. All of the first ten pages from the working corpus were written in English. However, in the rest of the extracted documents other languages such as German and French occur. In Step 2, in six of the first ten documents, at least one artist instance could be identified. In total 65 artist occurrences were identified across the ten documents. For the whole working corpus, in 81 pages at least one artist was identified adding up to a total of 399 artist occurrences for the 200 documents.

Table 2: The first fifteen artists of the resulting ordered list for Experiment 1, where i =‘Expressionism’. For each identified artist, we have listed whether it appears in the gold standard (‘1’) or not (‘0’).

We then calculated the DS score for each document in Step 3 and the IS score for each artist in Step 4, as described in the previous sections. Table 2, shows the top 15 candidates for the instantiation of the relation according to the resulting ranked list. It shows that the first fourteen instances found are all correct instances according to the gold standard. The fifteenth instance is the first false positive. The table shows that the ranking we use does put the correct instances on top of the ranked list.
In Figure 2, we plotted the value for recall, precision and F-measure against the value of the Instance Score for all of the instances in the resulting ranked list. The F-measure value decreases as the value for the IS decreases. The highest Fmeasure value is 0.70 (this occurs at values for recall and precision of respectively 0.56 and 0.94). This highest F-measure value is obtained if we stop adding instances after the IS 0.0012.
To test the robustness of the method with respect to the content of the seed set, we performed the same experiments using two different seed sets selected from the gold standard, one consisting of the three most likely Expressionists from our gold standard and consisting of the least likely ones. The likelihood was based on the number of authoritative web pages used to create the gold standard on which they were considered Expressionists. The seed set with the most likely Expressionists consisted of the artists Emil Nolde, Edvard Munch and Egon Schiele, who were considered an Expressionist on respectively 11, 10 and 9 pages out of the 12. This seed set yielded the same results: a maximum value of F-measure of 0.69 was found (recall = 0.63, precision = 0.77). The seed set with the least likely Expressionists consisted of Jules Pascin, Ernst Barlach and Gustav Klimt all three of which were considered Expressionists on only three of the authoritative pages. This seed set resulted in a lower maximum value of F-measure: 0.58 (recall = 0.63, precision = 0.53). This suggests that using the seed set with highly likely Expressionists as our example extension of the has _artist relation results in a better representation of the relation. In this case, the correct artists are given higher scores than when the least likely Expressionists are used in the seed set. We also conducted this experiment using different sizes of the seed set with the same gold standard (15 seeds leaving 15 to be found and 9 seed leaving 21 to be found). These experiments yielded approximately the same maximum values for the F-measure. Before we discuss further findings, we first present the results of a second experiment within the art domain, using a different instance of Ci: Impressionism.

Figure 2: Recall, precision and F-measure for Experiment 1

2.4.2.2 Thử nghiệm 2: Impressionism
From 11 authoritative web pages on the art style Impressionism we identified 18 artists as our gold standard. From these 18 instances of the relation, we again randomly selected three as our seed set and followed the approach described above to retrieve the 15 remaining instances of the relation. Again, the actual artists are shown in Table 3.
We again extracted a working corpus of 200 documents and performed the described steps. In Table 4, we show a part of the resulting ordered list. These results are slightly worse than the results from Experiment 1. The first false positive is ‘Vincent van Gogh’. Still, in the first fifteen instances, only two errors are made. Again, evaluating the first fifteen results shows that the method performs relatively well.

Table 3: Our gold standard for ‘Impressionism’. The names of the three artists selected for the seed set are italicized.

Table 4: Part of the resulting ordered list for Experiment 2 (i =‘Impressionism’)

Again, to show the results of the evaluation of all extracted instances in the ranked list, we plot the value of precision, recall and F-measure(Figure 3). In this experiment, the F-measure reaches a maximum value of 0.83 (where recall = 0.80 and precision = 0.86) at a threshold value of 0.0084. We also tested for robustness by using different content for the seed set in the same way as in Experiment 1. If the seed set contained the most likely Impressionists gathered in the same way as in Experiment 1, the maximum value of the F-measure is 0.72 (recall = 0.60, precision is 0.90). If we start with the least likely Impressionists the maximum value of the F-measure is 0.69 (recall = 0.8, precision = 0.6), showing the same effect as in Experiment 1.

Figure 3: Recall, precision and F-measure for Experiment 2

2.4.2.3 Thảo luận
In the experiments, we find almost the same maximum value of F-measure under different conditions. In both cases, the first few found artist are always in the gold standard, after which the precision drops due to false positives. The values of F-measure are encouraging. There are several reasons that the F-measure does not reach higher values. These can be divided into reasons for lack of precision and for lack of recall.
First of all, one of the reasons for the false positives is precision errors of the Person Name Extraction module. For example, in Experiment 2 the string "d’Orsay" (name of a museum on impressionist art) is first misclassified as a person name and then passes the disambiguation step and is mapped to the ULAN entity "Comte d’Orsay". It appears as a false positive as the 29th candidate instance.
Another portion of the error in precision is caused by the strictness of the gold standard that we used. In Experiment 2, Vincent van Gogh is a false positive since he is not in our gold standard. However, two of the authoritative web pages cite him as an Impressionist painter. A less strict gold standard would have included this painter. This strictness of the gold standard accounts for portion of the lack of precision. Errors in recall are also caused by three factors. We find that 2 of the 15 Impressionists and 10 of the 27 Expressionists are not in our ordered list at all. As with precision, errors made by the Person Name Extraction module account for a part of the lack of recall. The module, has apparent difficulty with non-English names such as ‘Ernst Ludwig Kirchner’ and ‘Claude Monet’. A better Person Name Extractor would yield a higher recall and consequently, a better value for the F-measure.
Another cause for recall errors is the difficulty of the disambiguation of the artist names. From some extracted names, it is even impossible to identify the correct ULAN entity. An example is the string ‘Lyonel Feininger’. In the ULAN there are two different artists: one with the name ‘Lyonel Feininger’ and one with the name ‘Andreas Bernard Lyonel Feininger’. Our method cannot determine which one of these entities is found in the text and so the name is discarded.
Of course, a number of artists are not retrieved because they simply do not appear in the same (retrieved) page as one of the artist from a seed list. This shortcoming can be lifted if we use the iterative version of the method, as described in Section 2.4.3.
A problem that is not directly related to recall and precision is that from the experiments featured above, it is not possible to a priori determine a standard value for a threshold, with which the value of the F-measure is maximized. An optimal threshold value for Experiment 1 is 0.0012, whereas in Experiment 2 it is 0.0043. The lack of a method to determine this threshold value poses a problem when the method is used in different, real life situations. It requires experimentation to find the optimal value for the F-measure. In the next section we describe an extension to our method to eliminate the need for a threshold value.
2.4.3 Thử nghiệm 3: Iterative Experiment
Experiment 3 is conducted to determine if the performance of the iterative method is better than the basic method. In this experiment, we added bootstrapping as defined in Section 2.3.2 to the ‘Expressionism’ experiment from the previous section. We again used 200 pages, the same three artists in the seed set and evaluated the results against the same strict gold standard. The top fifteen instances results from the iterative method (shown in in Table 5) are comparable to those of Experiment 1 (‘Santiago Calatrava’ and ‘Chaim Soutine’ are replaced by ‘Vincent van Gogh’ and ‘James Ensor’ in the top 15 results).
The precision, recall and F-measure for all 50 iterations are plotted in Figure 4. While we find approximately the same values for the F-measure, we have indeed eliminated the need for a threshold on the value of the Instance Score. In the next section, we introduce a method for determining after which iteration to stop adding instances. We also find that three more Expressionists that were not extracted using the basic method have now been extracted. In the basic method, only Expressionists whose names occur on pages that also contain seed set artists can be identified. The Instance score for these artists is always 0. The use of bootstrapping makes it possible that these instances will get higher scores after a number of iterations by expanding this seed set. It is however not guaranteed that correct instances will be identified. In this experiment using the bootstrapping method 7 out of 27 Expressionists are still not retrieved after 50 iterations.

Table 5: The first 15 iterative results for Experiment 3 (‘Expressionism’).

Figure 4: Recall, precision and F-measure for the Iterative Experiment

Table 6: Art styles used in Experiment 4

2.4.4 Thử nghiệm 4: Ten Art Styles
In this experiment, we attempt to extract instances for a larger number of art styles. We also introduce and evaluate a domain-independent method for determining when to stop adding instances.
Due to the larger number of art styles for which we extracted relation instances, constructing a gold standard as in the previous experiments proved too costly. In this experiment, we opted to only calculate precision. We did this by having two annotators manually evaluate each of the 40 retrieved relation instances for each art style. For this, the annotators were allowed to consult a fixed set of sources: the articles on both the art style and the artist on the Wikipedia web encyclopedia2, the art style page on the Artcyclopedia website3 and any encyclopedic webpage that Google retrieved in the first ten results when queried with both the art style’s label and the artist’s name. If in any of these sources the artist was explicitly stated as a participant in the art style, the annotator was to mark the relation instance ‘correct’ and else mark it ‘incorrect’.
After separately evaluating the relation instances in this way, inter-annotator agreement was calculated using Cohen’s Kappa measure. Calculated over all ten art styles, this resulted in a value of 0.83. The annotators then reached agreement over the instances that initially differed. The consensus annotations are used to calculate precision.
From the instances of aat:Styles_and_Periods, we chose ten modern European art styles to extract. We list their preferred labels from the AAT in Table 6. For each of these art styles, we applied the iterative method.
2.4.4.1 Kết quả đối với 'Neue Sachlichkeit’
We first illustrate the results for a single art style: ‘Neue Sachlichkeit’ (‘New Objectivity’). The three artists we added to the seed set were ‘George Grosz’, ‘Otto Dix’ and ‘Christian Schad’. Table 7 shows the top 15 results of the 40 artists iteratively extracted from the documents. For each of the artists, we also list the Instance Score for that instance at the iteration that it was extracted. The last column shows the evaluation (1=‘correct’, 0=‘incorrect’). In these top 15 candidates, we again find only one false positive, ‘Erich Heckel’.
Figure 5 shows the Instance Scores for the top artists for all 40 iterations as well as the value for the precision (number of extracted candidates evaluated as correct divided by the total number of extracted candidates). The Instance Score represents the confidence at each iteration that for the top ranked artist a relation should be added to the knowledge base. As can be seen, this confidence for the first candidate instance is relatively high (0.0651), then drops to about 0.025 and stays relatively constant for a number of iterations. After 13 iterations, the Instance Score again drops to a new constant level of about 0.01.

Table 7: Top ranked candidate artists for the has_artist relation for the art style ‘Neue Sachlichkeit’ for the first 15 iterations

Looking at the precision curve, at this point the method starts adding more and more false relation instances. For this art style, we achieve the best precision/number of extractions ratio if we set the maximum number of iterations somewhere between 13 and 21 iterations (after 21 iterations, only incorrect instances are added).
For popular art styles, with many associated artists, this drop in precision will occur after more iterations than for relatively small art styles such as ‘Neue Sachlichkeit’, so we predict that this maximum number of iterations will depend on the specific art style. We also cannot cut off the iterations by setting an absolute threshold value for the Instance Score since it is highly variable for the different art styles.
As can be seen in the figure, the drop in precision co-occurs with a drop in the Instance Score. We choose the iteration threshold to be dependant on the relative drop in the Instance Score. The relative drop at an iteration is the Instance Score at that iteration divided by the maximum of the Instance Scores up to that iteration. We introduce a threshold on this relative drop, the Drop Factor, (DF). The algorithm stops adding relation instances to the knowledge base if the relative drop is lower than DF. We also stop adding instances after an absolute maximum number of iterations has been reached (Max). For example, in the case of ‘Neue Sachlichkeit’, if we set DF to 0.2 and Max to 40, the algorithm stops adding new relation instances after iteration 16, when the relative drop is lower than 0.2. This leads to a precision of 0.933, with 15 correct relations and one incorrect relation added to the knowledge base.

Figure 5: Instance score versus rank number for ‘Neue Sachlichkeit’

2.4.4.2 Kết quả đối với 10 Art Styles
In this section, we present the results for all ten art styles for which the relation instances were extracted.
In Table 8, we show the precision and the number of correct relation instances extracted for each of the ten art styles for an arbitrarily chosen value for the two threshold parameters (DF=0.3 and Max=20). For these values, the precision for each of the art styles ranges from 0.667 to 1. The number of correct extractions differs considerably between the art styles, for a ‘small’ art style such as ‘Surrealist’ only 5 correct new relation instances are extracted with a threshold at 7 iterations, resulting in a precision of 0.714. This seems to confirm our prediction in Section 2.4.4.1 that a good cutoff point varies between art styles. The average precision in this example is 0.84 with a standard deviation of 0.14.
In Table 9, we list both the average precision and the total sum of the number of correct relation instances extracted for the ten art styles for 24 combinations of the two threshold parameters DF and Max. The lowest value for precision is .616. This occurs at DF=0.1 and Max=40. In that case 241 extractions are evaluated as correct.
The highest average precision, 0.921 with a standard deviation of 0.11, is reached at DF=0.6 and Max=10, with only 55 correct relation instances added to the knowledge base. In this case, DF has a big effect. For some art styles (e.g. Expressionist, Impressionist) ten instances are extracted, while for other styles such as ‘Neue Sachlichkeit’, only one relation instance is extracted. The values for the standard deviation for each of these values of average precision ranged from 0.10 to 0.20.

Table 8: Precision and number of correct extractions (extr.) for the ten Art Styles for DF=0.3 and Max=20

Table 9: Average precision (prec.) and total number of correct extractions (extr.) for the ten Art Styles

2.4.4.3 Thảo luận
We observe a tradeoff between precision and number of correct extractions comparable to that of the traditional precision/recall tradeoff. Depending on further processing of these results, different parameter setting can be used. If the results will be manually validated by a domain expert before they are added to the knowledge base, the threshold parameters can be chosen in such a way that recall is maximized (high value for Max and a low value for DF). If the aim is for a high precision, one can use a higher DS and a lower value for Max. Since we don’t have real recall values, we cannot calculate the value of F-measure. But if, for the sake of finding an optimal parameter setting, we assume that 247 is the maximum number of artists that could have been found, we can calculate a type of recall with respect to this number (at DF=0 and Max=40, this ‘recall’ would be 1; at DF=0.6 and Max=10, recall would be 55/247=0.22). If we use the recall values obtained in this way to calculate a F-measure, we find that the F-measure is has a maximum value of 0.786 at DF=0.2 and Max=40.
2.5 Thử nghiệm trên miền thứ 2: football players
To test the generality of our method, we tested the iterative method on a similar relation instantiation task in a different domain. As our domain, we chose the football (soccer) domain, a popular domain for Ontology Learning and Population methods (eg. [Weber and Buitelaar, 2006]). We use the method to extract instances of a relation between football clubs and players. In the next sections, we give a more precise description of the task, we present the experimental setup and present the results.
2.5.1 Task Description
Unlike the cultural heritage experiments, the ‘ontology’ and the knowledge base consisting of the instances that are being used in the experiments from this section were created by hand specifically for this purpose. In the football domain, we chose to extract instances of the has_player relation between Football Club (Ci) and Football Player (Cj). Since there can be multiple relations between football clubs and (former) players, we specify the intrinsic meaning of the relation as including both current and past players for a the first team of that football club.
For our experiments, we populated the Football Club class with three Dutch clubs as its instances: ‘Ajax’, ‘Feyenoord’ and ‘AZ’. We also populated the class Football Player with 590 instances that we extracted from the Wikipedia page listing the most well-known Dutch football players from the past and present, thus forming our Ij. Each football player instance has exactly one label, as opposed to the artists example, where multiple labels were available for each instance.
The relation instantiation task in this domain is to extract the correct instances of the has_player relation between Football Club and Football Player, starting from a seed set of example relation instances.
2.5.2 Cài đặt thử nghiệm và đánh giá
We use the iterative version of the method, as described in Section 2.3.2. For each of the Football Club instances, we first manually constructed a seed set of three well known football players of that club, from both past and present, that were also in Ij. In Step 1, we extracted a working corpus of 1000 documents by querying the Google search engine with the label of the football club instance. In Step 2, we used the same Person Name Extractor and applied the same name matching method as in the cultural heritage experiments to identify the instances of Cj in the working corpus pages, although here we only have a single label for each instance. We use equations 2.1 and 2.2 on each iteration to calculate the next top candidate. For each of the three football clubs we evaluated the list that is the result of 100 iterations.
Evaluation is considerably more straightforward in this specific task. Since almost all of our candidates had a personal Wikipedia page listing their past and present clubs, we used these pages to evaluate for each football club whether a candidate football player is or was a player for that team. For the few players that did not have a personal Wikipedia page, we verified the relation in one of the first ten results from the Google search engine queried with both the player’s an the team’s name.
Again, calculating recall is problematic since we do not know which subset of Ij are correct candidates for a relation instance. Contrary to the previous experiments, this time Wikipedia pages for the football clubs are available listing past and present players for that club6. However, these are not completely exhaustive and in validating the results from our experiments, we found a small number instances that were evaluated as ’correct’ although they did not appear on this list. Since we did want to calculate some form of recall, in these experiments, we took the size of the intersection of this page and Ij as our gold standard to calculate recall values. Although in theory this could lead to recall values larger than 1, this did not occur.
2.5.3 Các kết quả
In Table 10, we first show the results for a single Football Club instance: ’Feyenoord’.
In Figure 6, we show the values for the F-measure after each iteration for each of the three Football Club instances and the average of these values. The value of the average F-measure reaches a maximum of 0.60 after 56 iterations.

Table 10: Top ranked candidate football players for the has_player relation for ’Feyenoord’ for the first 16 iterations

The individual F-measure maximums of instances ‘Feyenoord’ and ‘Ajax’ are higher than that of ‘AZ’ (respectively 0.73, 0.75 and 0.58). Also, the maximum of both ‘Feyenoord’ and ‘Ajax’ occurs after more iterations than the maximum of ‘AZ’. This can be explained by the fact that the former teams have been popular for a longer period of time than ‘AZ’ and therefore have a larger number of well-known past and present players. Recall therefore reaches a high value only later on in the iterative process, influencing the F-measure accordingly. We also examined the effect of the threshold parameters in these experiments. Other than in Table 9, we list the F-measure for a number of combinations of DF and Max in Table 11. We find the maximum F-measure value at DF = 0.2 and Max = 60. The value of 0.618 is slightly higher than the 0.60 found without the threshold parameters. At these values, the method stops adding candidates for ‘AZ’ after 47 iterations and for the other two instances after 60 iterations. Note that in this domain, we find values of the F-measure comparable to the experiments in sections 2.4.2 and 2.4.3. We observe that the maximum value of the F-measure in these experiments occurs at the same value for DF as that of the experiments in Section 2.4.
2.6 So sánh với phương pháp dựa trên khoảng cách google chuẩn
We are not aware of any state-of-the-art systems or methods that tackle the exact same task as our method as described in section 2.2. However, we are able to compare the result of using our proposed evidence measure as opposed to a state-of-the-art redundancy based measure, the Normalized Google Distance.

Figure 6: Average F-measure values for the three Football Club instances

Table 11: Average value for the F-measure for the three Football Clubs

Normalized Google Distance (NGD) was proposed by Cilibrasi and Vitanyi [2004] as a measure for the semantic distance between two terms. It uses the Google search engine to estimate the number of hits for each of the two terms and the combination of the terms. The semantic distance between two terms x and y is defined as:

where f(x) denotes the number of pages containing x, and f(x, y) denotes the number of pages containing both x and y, as reported by Google. N is the total number of webpages that Google indexes.
Although NGD is a measure for semantic relatedness of terms rather than ontological instances, Cilibrasi and Vitanyi in their paper argue that the NGD can be used for Ontology Learning and/or Population and present some examples.
For comparing the two methods, we chose the task from the experiment as described in Section 2.4. However, there are two main problems for Normalized Google Distance in handling the exact same task as our Redundancy Method.
The first problem is caused by the large number of Artist instances in the ULAN, that is 255.000. For a single Art Style, NGD requires two Google queries for each Artist instance to determine the semantic distance between the Art Style and the Artist. Geleijnse et al. [2006] identify as the Google complexity of a measure. The large number of queries combined with the restrictions the Google search engine imposes on automated queries does not allow us to calculate the Normalized Google Distance between all Art Styles and all Artists. To be able to make a comparison between the results of our measure and NGD, we only calculated the NGD between an Art Style and the first 40 Artists that our method returned for that Art Style (i.e. the results from step 2 in our algorithm).
A second issue to address is for a single Artist instance what the exact term should be to query Google. As our method uses the complete instance with all its preferred and non-preferred labels from the AAT, we would like for NGD to also use all available textual information. Therefore, for a single AAT instance, we presented Google with a binary query consisting of the disjunction of all these labels7, resulting in the value for f(x). For f(y), we also used the disjunction of the labels of each of the ten art styles from Section 2.4.4. For f(x, y), we combined these two queries using Google’s ‘AND’ operator. With these three values, we calculate the NGD value for the 40 Art Style-Artist pairs. In Table 12 we show the first ten artists according to the ranking by NGD for the art style ‘Neue Sachlichkeit’. We compared the resulting rank ordering by distance to our own results. For this we used Spearman’s Rank Correlation Coefficient [Lehmann and D’Abrera, 1998] (denoted by ρ).
In Table 13, we list the value of ρ between our ranking and the ranking produced by NGD for 10 art styles. With N=40 and a significance level of 0.05, positive correlations are significant if ρ > 0.304. Observe that the rankings of only four of the ten art styles are significantly positively correlated with the rankings from the Redundancy method.

Table 12: First 16 results from the resulting NGD rank ordering for the art style ‘Neue Sachlichkeit’, including NGD value and the evaluation (compare to Table 7). ρ=0.208

Table 13: Spearman’s Rank Correlation Coefficient (ρ) between the rankings produced by the Redundancy Method and the NGD-based method, for each of the ten art styles

Figure 7: Precision with respect to the ranking position for the Redundancy Method and Normalized Google Distance for the ‘Neue Sachlichkeit’ art style.

Since our threshold parameters do not apply to the Normalized Google Distance, we are not able to reproduce Table 9 for NGD. To compare the NGD and Redundancy Method rankings for one art style, we determined the precision at each ranking position for both methods. For ‘Neue Sachlichkeit’, we show this in Figure 7. Since both methods rank the same 40 instances, their precision at rank 40 is the equal.
Since both rankings use the same 40 artists, the resulting precision at position 40 is equal. However, in the case of ‘Neue Sachlichkeit’, the precision of the Redundancy Method is equal or greater than that of NGD at all positions of the ranking. Figure 8 plots the precision averaged over all ten art styles with respect to the ranking position. The graph shows that in these experiments, the average precision of the Redundancy Method at every position of the rankings is higher than the NGD’s precision at the same position.
2.7 Các nghiên cứu liên quan
Recent research presents multiple systems and method for the extraction of information including relations from the Web using redundancy. An example of such a system is the Armadillo system [Ciravegna et al., 2004]. Armadillo is designed to model a domain and construct a knowledge base by extracting information from the World Wide Web. It uses different modules to extract entities and objects and find relations between them. The Armadillo method starts with a seed set, extracted from highly structured and easily minable domain sources such as lists or databases. It then uses bootstrapping to train more complex modules to extract information from other sources. Evidence from different sources is combined, thus exploiting the redundancy of information on the Web. Contrary to our method, Armadillo does not require a complete list of instances and it does extract new instances from the web. However, for each new domain, a different Armadillo application is created and valuable Web Services for that domain have to be manually identified (such as the CiteSeer web site for modeling the Computer Science domain). Our method does not use domain specific extraction modules but instead uses a form of similarity-based reasoning on a working corpus extracted using Google. In the method proposed by Cimiano et al. [2004], evidence from heterogeneous sources is combined to extract taxonomic relations between concepts. One of these sources is the result of applying Hearst patterns on the Web. For a set of concepts a number of queries that use predefined patterns such as ‘C1 such as X’and ‘C2 such as X’ are formed. It then uses the Google hitcounts for those queries as evidence for taxonomic relations between the concepts. As an example, if the query ‘C1 such as X’ generates a higher Google hit count than ‘C2 such as X’, X is more likely to be a subclass of C1 than of C2. The method, through the use of Google, also uses redundancy and combines evidence from multiple sources to enrich an ontology. However, in contrast to our method, the method of Cimiano et al. uses (handcrafted) patterns and is only used to extract "is_a" relations.

Figure 8: Average Precision with respect to the ranking of the ten art styles for the Redundancy Method and Normalized Google Distance

A number of methods for the automatic extraction of relation instances differ from our method in that they use patterns for the extraction. The DIPRE system by Brin [1998] is an early example of how with bootstrapping techniques a small set of related pairs can be expanded to larger lists. This system identifies hypertext patterns that express the target relation. These patterns are learned for individual web pages. In SnowBall [Agichtein and Gravano, 2000], a named entity recognizer is combined with bootstrapping techniques to construct patterns for web Information Extraction. The idea of combining patterns and bootstrapping is also used in the paper by Geleijnse and Korst [2005]. They present an automatic and domain-independent method for ontology population by querying Google. They also combine evidence from multiple sources (i.e. Google excerpts) and use a form of bootstrapping that enables the method to start with a small seed set. The method differs from our method in that it currently uses handcrafted rules to extract these instances. In [Geleijnse et al., 2006], the method by Geleijnse and Korst and a slightly modified version of our method are compared.
The KnowItAll system [Etzioni et al., 2005] aims to automatically extract ‘facts’ (instances) from the web autonomously and domain-independently. The method, unlike our method, uses hyponym patterns such as Hearst patterns [Hearst, 1992] to extract instances. It starts with universal extraction patterns and uses grammar induction algorithms [Crescenzi and Mecca, 2004] to learn more specific extraction patterns. In combination with techniques that exploit list structures the method is able to extract information from heterogeneous sources. Downey et al. [2005] use a probabilistic model based on redundancy of information on the Web to evaluate results extracted by the KnowItAll system.
In general, our method differs significantly from the methods described above in that we do not extract new instances but only extract relations between known instances. Therefore, the task our method is designed for is in that way more restricted than for the methods described in this section that do extract new instances.
We here present a semi-supervised method, which uses a small set of labeled instances to learn more relation instances. In general, for Information Extraction tasks, effective supervised methods exist such as described by Bunescu and Mooney [2006] or Zelenko et al. [2003]. These methods however require significant amounts of examples to learn from.
2.8 Kết luận và các nghiên cứu tiếp theo
We presented a generic, domain-independent method for relation instantiation, a subtask of ontology population. Our method uses co-occurrence of relation instances on Web documents and exploits the redundancy of information on the Web to find new relation instances. The relation instantiation method we propose in this chapter does not use the label of the specific relation. It is based on the extension of the relation in the form of the seed set. If multiple, different relations would exist between two classes then the effect of the method depends on the extent to which the seed set separates the two relations. Documents with higher percentage of instances in the seed set will receive a higher Document Score leading to higher Instance Scores for the correct right hand side instances. Between the concepts from our experiments, no obvious other relations than the target relation (has_artist and has_player) existed that adhered to all assumptions from Section 2.2 and we did not conduct any experiments to test the level with which the method is able to distinguish between two or more relations between the same classes.
In the task description in Section 2.2, we define relation instantiation as finding the relations between instances in a knowledge base. However, the method described in this chapter could also be used to extract the instance_of relation between a (new) labeled class and a set of instances, this would expand the task to include classification. If, for example, ULAN is expanded with the concept ‘Impressionist Artist’ as a subclass of ‘Artist’, the method could be used to find the instances of this new concept in almost the same way as was done in Section 2.4. However, in this chapter we used the method to extract instances of relations between instances only.
To test the performance of the method, we used it in a number of experiments to extract instances of the Artist-Art Style relation. This was done using actual ontologies from the cultural heritage domain. Results show a tradeoff of precision and the number of correct extractions analogous to the precision/recall tradeoff. Considering the method uses very generic methods and intuitive ranking scores, the results are encouraging but also suggest that further processing of the results could improve the relation instantiation. For example, the top of the resulting ranking lists could be considered as hypothesis relations, where other methods could be used to verify this hypothesis. The relations could also be verified using other knowledge from the ontology. For the Art Style-Artist relation, biographical information such as birth- and death-dates could prove to be helpful. The same could hold for geographical knowledge about the style and artist. In this chapter, we do not use any knowledge stored in the ontology in the extraction process other than the different labels of an instance. In Chapter 5, we provide an example general guidelines on how ontological background knowledge can be used to aid the relation instantiation process.
For the aforementioned MultimediaN E-Culture application, The 247 Art StyleArtist relation instances found in Experiment 4 that were evaluated as ‘correct’ were added to the knowledge base. This knowledge base is used in the current implementation of the demo application8. Analysis of the documents from which information was extracted showed that the documents were highly heterogeneous in structure. Some documents were essays and consisted of free text while other documents such as art prints web shops featured list structures. Also, content was extracted from pages in a language different from English.
Improvement in the Person Name Extraction module or combining different Person Name Extractors could improve the extraction. Using a different, less strict named-entity matching procedure is also a possible improvement. Also, other measures for the Document Score and Instance Score could be considered.
To show that the method works in a different domain, the method was also used to extract instances of the Football Club-Player relation in the football domain. The performance of the method is similar across both domains.
The assumptions in Section 2.2 give an indication for the types of specific relation extraction tasks for which this method is useful: The second and third assumptions state that lists or gazetteers of right-hand side instances must be available for the task. This limits the method to domains where such lists exists and its elements can be found in texts using NER(-like) methods (geographical locations, person names, movies, etc.). The fourth assumption requires that in multiple documents in the corpus, multiple instances will be encountered. And because we rely on redundancy and co-occurrence, the method will not work if the target relation is only very sparsely represented in the corpus. This restricts the method to domains where the target elements are redundantly available. When extracting from the Web, a significant number of web pages will have to be available with the target instances. As an example, our method will be useful for extracting relations instances for relatively well known people but not for unknown people. In general, our method will be useful for instantiating relations between instances that occur in large numbers on the web and will be not so useful for more obscure instances.
The ranking procedure used by the method is compared to that of the Normalized Google Distance, a method for determining the semantic distance between two terms. We find that our ranking outperforms the Normalized Google Distance ranking.
An obvious direction for further research is to test this method on more domains where relations that satisfy our assumptions are to be instantiated. An example could be geography (eg. which cities are located in a country). In both of the relation instantiation tasks described in this chapter, we extracted the same type of right-hand side candidates: persons. Extracting instances of relations with different range types such as ‘cities’ in the aforementioned task could also be an interesting test case for this method.

Sunday, 12 April 2015

Sử dụng các nguồn tài nguyên Web để nâng cao chất lượng việc làm giàu Ontology

Trong chương này, chúng tôi sẽ nghiên cứu việc kết hợp thông tin từ nhiều nguồn khác nhau để cải tiến kết quả của quá trình trích xuất thông tin để làm giàu ontology. Chúng tôi kết hợp kết quả từ rất nhiều phương pháp trích xuất thông tin với các tri thức nền có sẵn trong ontology đích để lọc kết quả. Việc kết hợp thông tin từ các nguồn khác nhau này giúp cho các kết quả đầu ra của quá trình làm giàu ontology có chất lượng cao hơn. Ở đây, chúng tôi giới thiệu một phương pháp tổng quát để kết hợp các loại thông tin khác nhau này và giới thiệu một số quy tắc (rules) chung mà đã được sử dụng để xử lý thông tin nền. Thông tin từ các nguồn khác nhau có thể được kết hợp bằng cách sử dụng các phương pháp tích hợp. Trong chương này, chúng tôi giới thiệu ba trong số các phương pháp khả thi này và thảo luận về các ưu điểm và lợi ích của từng phương pháp. Chúng tôi kiểm tra các phương pháp này với 3 thử nghiệm trên các miền khác nhau mà nó minh họa cách thức mà phương pháp và các quy tắc được áp dụng. Theo báo cáo của chúng tôi thì các phương pháp này giúp tăng đáng kể hiệu năng của quá trình làm giàu ontology.
5.1 Giới thiệu
Trong các chương trước, chúng tôi đã giới thiệu một số phương pháp trích xuất thông tin từ Web để làm giàu ontology. Trong phần thảo luận của các phương pháp này và trong các kết quả thử nghiệm. chúng tôi đã đưa ra một số các nhận xét rằng việc sử dụng tri thức nền sẵn có trong ontology đích sẽ giúp cải tiến hiệu năng của các phương pháp này. Các tri thức nền này có thể được sử dụng để lọc các cấu trúc ontology mà đã được đề xuất bởi một loạt các phương pháp trích xuất thông tin. Một giải pháp khai thác các tri thức nền sẵn có này là sử dụng nó trực tiếp như là các bộ lọc trong từng module trích xuất thông tin. Một ví dụ của điều này được đưa ra trong chương 4, ở đó chúng tôi sử dụng tri thức nền mà một nghệ sĩ chỉ có thể có 1 nơi sinh kết hợp với nguyên tắc phân loại địa lý (geographical taxonomy) để lọc ra và loại đi rất nhiều thể hiện quan hệ Artist - Birthplace đã được đề xuất. Tuy nhiên, trong chương này chúng tôi mô tả một phương pháp kết hợp thông tin nền với các kết quả của quá trình trích xuất thông tin như một cách mà nó độc lập với các nguồn trích xuất thông tin chính xác. Mặt khác, các nguồn này được coi như là "hộp đen" mà nó cung cấp cho thể hiện quan hệ ứng viên.
Việc sử dụng đa nguồn và các nguồn hỗn tạp sẽ cải tiến hiệu năng của quá trình làm giàu ontology. Để làm được các điều này, các nguồn thông tin được sử dụng cho quá trình trích xuất thông tin sẽ phải bao trùm (overlap) một khối lượng đáng kể để tập hợp các bằng chứng thực sự đủ mạnh để chọn ra được các thể hiện ứng viên chính xác. Tại cùng thời điểm đó, các nguồn thông tin phải đủ khác biệt để chúng thực sự có thể bổ sung các thông tin mới vào quá trình làm giàu ontology. Chúng sẽ không tạo ra các lỗi tương tự nhau, bởi vậy các sai sót và các khẳng định lỗi (false positives) đã được trích xuất bởi một nguồn có thể được chính xác hóa bởi một nguồn thông tin khác. Mặt khác, các nguồn thông tin cá thể phải có độ chênh (biases) khác nhau để chúng đưa ra khằng định lỗi và phủ định lỗi (false negatives) khác nhau. Trong trường hợp đó, việc kết hợp các kết quả từ đa nguồn sẽ cải tiến toàn bộ hiệu năng của phương pháp.
Ví dụ, các lỗi mà là kết quả của quá trình so khớp (matching) tên không tốt trong phương pháp trích xuất thông tin dựa trên nhận dạng thực thể tên có thể được chính xác hóa bởi việc kiểm tra nó đối với tri thức nền mà nó sử dụng các thông tin về thời gian (temporal) đã được lưu trong ontology. Giả thuyết mà chúng tôi nghiên cứu thông qua một số các thử nghiệm trong chương này là việc kết hợp thông từ các nguồn thông tin khác nhau một cách đầy đủ thực sự giúp tăng hiệu năng của quá trình làm giàu ontology.
Trường hợp đặc biệt là trường hợp bỏ sót các giá trị. Các thể hiện chính xác có thể bị bỏ sót hoàn toàn bởi một nguồn thông tin, ví dụ do quá trình so khớp thuật ngư lỗi hoặc do thuật ngữ đó không xuất hiện trong kho văn bản đã được sử dụng bởi nguồn thông tin đó. Nếu các giá trị này có thể được tìm bởi một hoặc miền các nguồn thông tin bổ sung mà nó sử dụng các thủ tục so khớp khác nhau hoặc cùng sử dụng một kho văn bản khác nhau, giá trị bỏ sót đó bây giờ có thể được tìm thấy.
Trong phần tiếp theo, chúng tôi đi vào nghiên cứu chi tiết công việc này. Trong phần 5.3, sau đó chúng tôi giới thiệu phương pháp thu thập thông tin của chúng tôi từ các nguồn thông tin khác nhau và giới thiệu một số các quy tắc ví dụ mà có thể được sử dụng như là tri thức nền. Tất cả các thông tin này được kết hợp bằng cách sử dụng một kỹ thuật tích hợp thông tin, kỹ thuật này phụ thuộc vào các cài đặt cụ thể mà phương pháp đó đã sử dụng. Chúng tôi mô tả ba trường hợp tích hợp thông tin để cung cấp một phân loại true/false đối với một thể hiện quan hệ ứng viên. Trong phần 5.4, sau đó chúng tôi chỉ ra cách thức các trường hợp (incarnations) khác nhau của phương pháp làm việc đối với ba công việc trích xuất thể hiện quan hệ khác nhau này.
5.2 Định nghĩa bài toán
Mục tiêu của công việc làm giàu ontology là tìm ra các lớp và quan hệ mới (Ontology Learning) và các thể hiện của cả các lớp và các quan hệ này (Ontology Population) cho một ontology đã có sẵn. Chúng tôi giả sử rằng ontology này là ontology đã được trích xuất một phần trước đó và vì vậy rất nhiều tri thức là có sẵn về miền của ontology đó. Như chúng ta đã thực hiện trong các chương trước, ở đây chúng ta tập trung vào việc trích xuất thể hiện quan hệ: phát hiện các thể hiện quan hệ mới giữa các thể hiện lớp đã có trong cơ sở tri thức ontology. Trong chương này, chúng tôi giả sử rằng tri thức nền về các thể hiện đích là có sẵn trong ontology đó.
Trong quá trình làm giàu ontology, các phương pháp trích xuất thông tin khai phá các cấu trúc ontology ứng viên mới và sự hợp lý (likelihoods) có liên quan. Các ứng viên này sau đó được lưu trữ trong ontology hoặc bằng cách thủ công hoặc tự động (ví dụ, bằng việc chấp nhận tất cả các cầu trúc ứng viên với một sự hợp lý cao hơn một vài ngưỡng sau đó). Bằng việc sử dụng tri thức nền sẵn có đối với ứng viên đó, chúng ta có thể hiệu chỉnh sự hợp lý của thể hiện quan hệ ứng viên đó hoặc là lọc nó ra khỏi danh sách ứng viên.
Ví dụ, khi trích xuất các thể hiện quả quan hệ has_artist giữa các lớp Art Style và Artist, thông tin nền có sẵn đã được lưu trữ trong cơ sở tri thức về cả các phong cách nghệ thuật đơn lẻ và nghệ sĩ có thể được kết hợp để lọc độ hợp lý (likelihood) của thể hiện quan hệ ứng viên. Ví dụ về độ hợp lý, khả năng Vincent van Gogh là một nghệ sĩ theo xu hướng kỳ quái (Baroque artist) đưa ra là rất thấp bởi ông ta sinh ra sau gần một thế kỷ của phong cách nghệ thuật này. Đặc biệt trong các ontology di sản văn hóa đã được làm giàu một phần đã được mô tả trong các chương trước, rất nhiều các thông tin nền này hữu ích đối với việc cải tiến công việc trích xuất thông tin là sẵn có.
Việc sử dụng ontology đó và các ngôn ngữ luật đã được sử dụng trong web ngữ nghĩa hiện nay, các kỹ sư ontology về mặt lý thuyết là có thể đưa trực tiếp các kiểu ràng buộc này vào trong ontology đó (trong trường hợp ví dụ ở trên mọt ràng buộc nói rằng không nghệ sĩ nào có thể liên quan đến một phong cách nghệ thuật nếu khoảng thời gian của nghệ sĩ đó và khoảng thời gian tồn tại phong cách đó là không chồng lấn (overlap) lên nhau). Trong trường hợp đó, khi một cấu trúc ontology ứng viên lỗi được thêm vào cơ sở tri thức ontology, thì tính không nhất quán (inconsistency) sẽ xảy ra. Tuy nhiên, trong thực tế phần lớn các ontology thế giới thực không có đủ mật độ logic cao này. Nhiều ontology mà thực sự đã được sử dụng trong web ngữ nghĩa, bao gồm the MultimediaN E-culture ontology [Schreiber et al., 2008] là các nguyên tắc phân loại (taxonomies) và bộ từ vựng có cấu trúc có tính logic tương đối thấp.
Tuy nhiên, khi mô hình hóa một miền lớn và thực tế thì trước hết không cần các luật khó này. Do tính chất không chắc chắn của hầu hết các tri thức thể giới thực, nên việc bổ sung quá nhiều các ràng buộc khó đối với mô hình dữ liệu sẽ chắc chắn dẫn đến nhiều khả năng không nhất quán, làm cho mô hình dữ liệu kém hiệu quả. Trong ví dụ về quan hệ artist-art style ở trên, chúng tôi có lẽ muốn cho phép một nghệ sĩ sinh sau thời kỳ tồn tại của một phong cách nghệ thuật để vẫn có thể làm việc được với phong cách nghệ thuật đó. Thay vì khai thác các tri thức nền có sẵn bằng cách sử dụng các ràng buộc khó và mang tính bản thể học, chúng tôi muốn sử dụng các quy tắc mềm để hiệu chỉnh độ hợp lý của các cấu trúc ontology ứng viên.
Để cải tiến hơn nữa quá trình trích xuất thông tin, chúng tôi không chỉ muốn kết hợp các kết quả trích xuât thông tin với tri thức nền thông qua việc sử dụng các luật này mà còn kết hợp các kết quả của một loạt các phương pháp trích xuất thông tin này. Các phương pháp khác nhau có các sai số hệ thống (biases) khác nhau và bằng cách kết hợp các kết quả từ một loạt phương pháp chúng ta có thể làm giảm tính hợp lý của các khẳng định lỗi và tăng tính hợp lý của các phủ định lỗi từ một phương pháp đơn giản qua đó nâng cao độ chính xác và sai số. Phương pháp đã đưa ra này có thể kết hợp các nguồn thông tin khác nhau.
Khung được mô tả dưới đây đơn giản hóa việc kết hợp của các kết quả từ các phương pháp trích xuất thông tin mà nó mang lại thông tin không chắc chắn (uncertain information) và thông tin để đánh giá các cấu trúc ontology có được việc sử dụng tri thức nền. Chúng tôi đã xác định đươc 3 loại kết hợp khác nhau như sau:
Trường hợp 1: Kết hợp các kết quả trích xuất thông tin không chắc chắn (uncertain) với các kết quả trích xuất thông tin không chắc chắn khác.
Trường hợp 2: Kết hợp các kết quả trích xuất thông tin không chắc chắn với các tri thức nền không chắc chắn.
Trường hợp 3: Kết hợp các kết quả trích xuất thông tin không chắc chắn với tri thức nền từ các nguồn trích xuất thông tin không chắc chắn.
Trong kết hợp loại 1, nguồn thông tin cung cấp thông tin về cùng một trúc ontology ứng viên. Ví dụ, phương pháp trích xuất thông tin dựa trên khả năng đồng xuất hiện chạy trên một kho văn bản riêng và trả về hai độ hợp lý khác nhau cho cùng ột bộ ứng viên (chính xác) [Baroque, has_artist, Vincent van Gogh].

Figure 21: Outline of the three method variants

Trong trường hợp 2, chúng tôi kết hợp độ hợp lý lấy từ quá trình trích xuất thông tin (IE-derived likelihood) cho một bộ ba (triple) ứng viên với tri thức nền về chủ ngữ và tân ngữ của bộ ba này. Chúng tôi sử dụng các luật để lấy được các tuyên bố mới (new statement) về bộ ba ứng viên từ tri thức nền sẵn có để nó có thể được kết hợp với các kết quả từ quá trình trích xuất thông tin. Tùy thuộc vào tính tổng quát của chúng, các luật được tạo thủ công có thể được sử dụng lại cho các công việc / miền khác. Trong ví dụ về Vincent van Gogh đã mô tả ở trên một luật được sử dụng bắt đầu với quan hệ art_style-artist hầu như ít có khả năng áp dụng khi mà hai khoảng thời gian của chúng không chồng lấn lên nhau. Trong trường hợp này, tri thức nền về các nghệ sĩ đơn lẻ và phong cách nghệ thuật đã được sử dụng để tạo ra một tuyên bố về độ hợp lý của bộ ba [Baroque, has_artist, Vincent van Gogh]. Trong phần 5.3.1, chúng tôi cung cấp một số luật rất tổng quát và có thể sử dụng được trong rất nhiều miền khác nhau để khai thác tri thức miền.
Các luật mà đã được sử dụng trong trường hợp thứ 2 cũng có thể được sử dụng với thông tin không chắc chắn được lấy từ các phương pháp trích xuất thông tin, tương ứng với trường hợp 3. Trong ví dụ trên, ngày sinh và ngày chết của nghệ sĩ có lẽ là có sẵn trong ontology, nhưng thời kỳ tồn tại của các phong cách nghệ thuật có lẽ không có. Khi đó, các thông tin muộn hơn được nhận bằng cách sử dụng phương pháp trích xuất thông tin với một độ hợp lý nào đó, các luật tri thức nền có thể được sử dụng trong cùng cách thức như là trong trương hợp 2, kết quả của nó là một tuyên bố về bộ ba ứng viên. Sự không chắc chắn của các thông tin đã được trích xuất có thể được xem xét.
5.3 Mô tả phương pháp
Hình 21 cho ta một biểu diễn đồ họa của các bước khác nhau của phương pháp này. Trong hai bước đầu tiên, thông tin được thu thập. Chúng tôi phân loại ra thành hai loại nguồn thông itn: nguồn đầu vào và nguồn lọc. Các nguồn đầu vào cung cấp các thể hiện quan hệ ứng viên cũng như là một vài điểm chỉ ra độ chắc chắn của thể hiện quan hệ là đúng. Các ví dụ là các kết quả từ một phương pháp trích xuất thông tin như là phương pháp dựa trên dư thừa thông tin đã được mô tả trong chương 2, số điểm liên quan trong trường hợp đó là một hệ số trượt (drop factor). Trong bước đầu tiên của phương pháp này, các thông tin này đã được thu thập. Phương pháp cần ít nhát một nguồn đầu vào được biểu diễn để tổng quát hóa các thể hiện quan hệ ứng viên. Đa nguồn đầu vào có thể được sử dụng, trong trường hợp này các ứng viên mới đã được bổ sung vào tập ứng viên. Chúng tôi giả sử rằng nếu một nguồn đầu vào không cung cấp bất kỳ điểm độ hợp lý nào đối với một thể hiện quan hệ ứng viên mà đã được cung cấp bởi một nguồn đầu vào thứ hai, thì giá trị điểm độ hợp lý này được đặt về 0.
Loại nguồn thứ hai là nguồn dùng để lọc. Nguồn này không cung cấp các thể hiện quan hệ ứng viên mới mà chỉ đưa ra điểm độ hợp lý đối với tập các ứng viên được tạo ra bởi các nguồn đầu vào. Một ví dụ của điều này là sử dụng khoảng các google chuẩn (the Normalized Google Distance) [Cilibrasi and Vitanyi, 2004] đối với các thể hiện quan hệ. Số lượng lớn các kết hợp chủ ngữ - tân ngữ khả thi làm cho nó không khả thi để sử dụng nó như là một nguồn trích xuất thông tin (input source), nhưng có thể được sử dụng để tạo ra các điểm độ hợp lý đối với các thể hiện quan hệ ứng viên đã có. Tri thức nền cũng được sử dụng như là một nguồn lọc thông tin để cung cấp các điểm độ hợp lý, hoặc là bằng cách sử dụng thông tin từ ontology (trường hợp 2) hoặc bằng cách sử dụng thông tin đã được trích xuất (trường hợp 3). Trong phần 5.3.1, chúng tôi sẽ mô tả cách chúng tôi sử dụng tri thức nền như thế nào.
Trong bước thứ 3, các độ hợp lý đối với một thể hiện quan hệ ứng viên đã được kết hợp với nhau. Thể hiện quan hệ ứng viên này sau đó đã được đưa vào cơ sở tri thức trong trường hợp nó được phân loại là đúng và sẽ không được đưa vào nếu nó được phân loại là sai. Các phương pháp tích hợp thông tin khác nhau có thể được thực hiện ở đây, mỗi một phương pháp đều có khuynh hướng (bias), giá trị và lợi thế riêng. Trong chương này, chúng tôi thảo luận ba phương pháp khả thi: tính toán sác xuất phi trọng số (unweighted probability) (3a), sử dụng phương pháp bầu cử (voting method) (3b) hoặc tập huấn mô hình phân loại sử dụng một phần chú thích bằng tay của dữ liệu (3c). Trong phần 5.3.2, chúng tôi mô tả chi tiết từng phương pháp này.
5.3.1 Sử dụng tri thức nền
Để tạo ra điểm độ hợp lý, tri thức nền về chủ ngữ, quan hệ và tân ngữ của bộ ba (triple) đã được kết hợp với các luật tổng quát, đã được khởi tạo đối với một miền cụ thể. Các luật tổng quát này áp dụng đối với một loạt miền khác nhau và có thể được tái sử dụng thông qua các công việc. Trong phần này, chúng tôi giới thiệu một số các luật tổng quát này, nó đã được khởi tạo trong các thử nghiệm đã được mô tả trong phần 5.4. Các luật khác có thể được thêm vào bằng tay và chúng có thể hoặc là tổng quát hoặc là cụ thể đối với miền đó. Đối với một công việc làm giàu ontology đã cho trong đó một quan hệ đã được trích xuất, các luật tổng quát có thể được khởi tạo bằng cách sử dụng tri thức có sẵn trong có sở tri thức (điều này tương ứng với trường hợp 2).
Các luật này cũng có thể được sử dụng với tri thức đã được trích xuất từ một nguồn khác. Điều này tương ứng với trường hợp 3. Trong trường hợp đó, tính không chắc chắn của các thông tin đã được trích xuất cũng có thể được đưa vào danh sách quan tâm sau đó xác định độ hợp lý của bộ ba quan hệ ứng viên này. Trong các thử nghiệm của chúng tôi, chúng tôi sử dụng một phiên bản đơn giản, trong đó chúng tôi quan tâm các thông tin đã được trích xuất là chính xác mà nó đã được sử dụng trong trường hợp thứ 2. Trong độ hợp lý của một bộ ba đã được trích xuất bên dưới, ngưỡng mà bộ ba này đã bị loại và nó không được sử dụng như là tri thức nền có sẵn đối với đã được khởi tạo với luật đó.
Định dạng luật đó mà chúng tôi sử dụng ở đây bao gồm 3 phần. Phần đầu tiên là phần 'FOR', mà nó xác định loại bộ ba ứng viên [Subject, Relation, Object]candidate mà luật đó đã định nghĩa. Phần thứ 2 của luật này là danh sách một số tiền điều kiện mà nó đã được biểu diễn trong các bộ ba này. Phần cuối cùng của luật này là phần 'THEN', trong đó điểm độ hợp lý L đã được tính toán dựa trên các giá trị đã được xác định bởi các tiền điều kiện. Độ đo hợp lý này có thể hoặc là điểm khoảng cách (distance score), trong đó các giá trị thấp hơn cho một giá trị độ hợp lý cao hơn hoặc một điểm độ hợp lý trực tiếp, trong đó điểm cao hơn cho một giá trị độ hợp lý cao hơn. Trong bước tích hợp thông tin này, các giá trị này có thể được sử dụng trực tiếp hoặc đã chuẩn hóa phụ thuộc vào phương pháp tích hợp.
Ở đây chúng tôi giới thiệu một số ví dụ luật đối với các bộ ba ứng viên của loại [Person, member_of, Group]. Đây là một quan hệ mà có thể được tìm thấy trong nhiều miền và trong nhiều cơ sở tri thức hoặc ontology. Trong các chương trước, chúng tôi đã mô tả các phương pháp mà tìm các thể hiện của các loại quan hệ này bao gồm việc trích xuất các thể hiện quan hệ [Artist, has_style, Art_Style] và [Football Player, plays_for, Football Club]. Chúng tôi sử dụng các luật này trong các thử nghiệm đã được mô tả trong phần 5.4.
5.3.1.1 Luật tri thức nền thời gian
Luật này tương ứng với khả năng trực giác mà không tưởng đối với một agent (một người) tương ứng với một nhóm các agent nếu khoảng thời gian tồn tại của cái cũ (former) không trùng với khoảng thời gian tồn tại của cái sau (latter). Nếu hai khoảng thời gian này chồng xếp lên nhau, Ltime = 0. Ngược lại, nếu nó bằng với độ dài của khoảng thời gian giữa hai thời kỳ này. Ở đây, giá trị Ltime lớn hơn tương ứng với hầu như ít các ứng viên hơn.

5.3.1.2 Luật tri thức nền không gian
Luật tri thức nền không gian có thể được áp dụng đối với các thể hiện quan hệ trong đó cả chủ thể (subject) và đối tượng (object) đều có một vị trí và các vị trí gần hơn tạo ra thể hiện quan hệ ứng viên nhiều hơn. Ở đây chúng ta giới thiệu luật đối với các quan hệ [Person, member_of, Group]. Trong luật bên dưới, chúng ta giả sử rằng một nhóm có thể có nhiều vị trí khác nhau, và chúng ta sử dụng khoảng cách tối thiểu. Trong luật này, bất kỳ độ đo khoảng cách địa lý nào cũng có thể được sử dụng, trong thử nghiệm được mô tả trong phần 5.4.3, chúng tôi sử dụng một độ đo khoảng cách giữa các quốc gia, trong đó khoảng cách giữa hai quốc gia là 1 nếu chúng tương đương hoặc nếu các quốc gia này có cùng biên giới và bằng 0 trong các trường hợp còn lại. Trong các trường hợp khác, nhiều hơn các độ đo khoảng cách hạt mịn (fine-grained distance measures) có thể được sử dụng.

5.3.1.3 Luật tri thức nền kết hợp
Luật này nói rằng một người hầu như thuộc về một nhóm nhiều hơn nếu người đó có liên quan đến người thứ hai, trong số họ được biết rằng anh ta hoặc cô ta có liên quan đến nhóm này. Đối với một kết hợp đơn giản, điểm độ hợp lý tương quan (association likelihood score) bẳng tổng độ hợp lý hậu (posteriori likelihood) mà người có liên quan là một phần của nhóm đích đó. Nếu quan hệ này được biểu diễn trong một ontology mà trong đó tất cả các tri thức được giả sử một cách logic là đúng, chúng ta có thể giả sử một giá trị tối đa cho độ hợp lý sau này. Trong thử nghiệm được mô tả trong phần 5.4.3, độ hợp lý là kết quả của một nguồn thông tin đầu vào, bởi vậy nó tương ứng với trường hợp thứ 3 như đã mô tả trong phần 5.2. Nếu nhiều hơn các sự kết hợp là sẵn có, thì các giá trị độ hợp lý thu được sẽ được cộng lại.

5.3.2 Bước tích hợp thông tin
Đối với một thể hiện quan hệ ứng viên, các điểm độ hợp lý khác nhau L được đưa vào bước tích hợp thông tin. Ở đây, các độ hợp lý được kết hợp vào một giá trị độ hợp lý tổng, dựa vào giá trị đó mà một thể hiện quan hệ ứng viên có thể hoặc không thể được thêm vào trong cơ sở tri thức của ontology. Vì vậy, phần công việc mà module này xử lý là một trong các công việc tích hợp thông tin. Các giải pháp khác nhau cho bài toán tích hợp này đã được đề xuất trong tài liệu và đã được triển khai trong các hệ thống làm việc. Các phương pháp này có các khuynh hướng (biases) và các yêu cầu khác nhau như là các ngưỡng được xác định một cách thủ công hoặc các bước chuẩn hóa. Trong phần này, chúng tôi mô tả ba thủ tục tích hợp thông tin khác nhau, mỗi trong số đó được sử dụng trong các thử nghiệm được mô tả trong phần 5.4.
5.3.2.1 Xác suất trung bình (average probabilities)
Một cách rất đơn giản của việc kết hợp các điểm độ hợp lý là chuyển đổi mỗi giá trị điểm này thành một giá trị xác suất (vd: một giá trị giữa 0 và 1, trong đó giá trị cao hơn có nghĩa là xác suất cao hơn). Giá trị trung bình có trọng số của các giá trị xác suất này có thể được sử dụng để xác định xác xuất hậu nghiệm (posteriori probability). Một ngưỡng giá trị xác suất này xác định phân lớp của thể hiện quan hệ ứng viên. Điều này tương ứng với phân loại sử dụng một perceptron. Trong trường hợp này, các trọng số đã được xác định bằng tay. Nếu chúng ta không có tri thức trước về độ chính xác của các nguồn thông tin, tất cả các giá trị trọng số này có thể được đặt là 1. Các kết quả này trong một xác xuất hậu nghiệm mà nó là giá trị trung bình của xác suất tất cả các nguồn thông tin.
Để chuyển một giá trị độ hợp lý (Lsource) thành một giá trị xác suất Psource, chúng ta thực hiện một bước chuẩn hóa. Nếu phân bố của các giá trị điểm độ hợp lý là đã biết, điều này có thể được thực hiện cho quá trình chuẩn hóa. Ví dụ, đối với các giá trị điểm độ hợp lý mà tương ứng với tần suất xuất hiện trong văn bản, chúng ta có thể sử dụng một phương pháp chuẩn hóa lo-ga-rit thay vì một phương pháp tuyến tính, bởi vì chúng ta biết rằng thuật ngữ tần suất xuất hiện theo một phân bố Zipf [Zipf, 1932]. Nếu phân bố này mà chưa biết trước, chúng ta sử dụng một phương pháp chuẩn hóa tuyến tính mặc định bởi việc chia giá trị độ hợp lý của một thể hiện quan hệ ứng cho giá trị độ hợp lý tối đa của tất cả các ứng viên:

Nếu giá trị độ hợp lý này là một độ đo khoảng cách (hầu như các giá trị cao hơn là ít), chúng ta sử dụng 1 trừ đi giá trị này để có được giá trị xác suất.
Phương pháp tích hợp thông tin này có ưu điểm là nó chỉ có một tham số: giá trị ngưỡng xác suất hậu nghiệm mà nó xác định phân lớp cuối cùng. Trong các thử nghiệm trong phần 5.4, chúng tôi chọn một giá trị mặc định là 0,5 và đưa ra báo cáo về hiệu năng của phương pháp sử dụng giá trị này. Chúng tôi cũng nghiên cứu giá trị tối ưu đối với ngưỡng này cho các dữ liệu được chú thích và so sánh tham số tối ưu (optima) giữa các thử nghiệm. Bởi vì phương pháp này chỉ có một tham số nên nó có thể được sử dụng khi không có thông tin về độ chính xác của các nguồn thông tin là có sẵn. Trong trường hợp mà chúng ta không có tri thức trước thì chúng ta sử dụng các giá trị trọng số bằng nhau, phương pháp này giả sử rằng tất cả các nguồn đều có thể xác định xác suất cho một ứng viên. Cũng như vậy, nếu bước chuẩn hóa được thực hiện sử dụng các phân bố chênh lệch (skewed distributions), các giá trị xác suất có lẽ không tương ứng với các giá trị độ chính xác của ứng viên.
5.3.2.2 Biểu quyết (voting)
Đối với công việc biểu quyết, chúng ta sử dụng một giá trị ngưỡng điểm độ hợp lý của nguồn (Tsource) để tạo ra một giá trị xác suất hoặc là 1 hoặc là 0 cho thể hiện quan hệ ứng viên. Nếu điểm độ hợp lý này là một độ đo khoảng cách, một ứng viên với giá trị điểm thấp hơn giá trị ngưỡng sẽ được gán một giá trị xác suất là 1 và là 0 trong trường hợp ngược lại. Đối với các điểm giá trị độ hợp lý trực tiếp, một giá trị điểm cao hơn giá trị ngưỡng sẽ cho kết quả độ hợp lý là 1. Công việc biểu quyết xác định phân lớp cuối cùng của ứng viên dựa trên. Có thể sử dụng thiết lập biểu quyết nhiều lần. Các ví dụ là biểu quyết đa số (majority voting) trong đó ứng viên được phân lớp là true nếu hơn một nửa số nguồn đều có giá trị độ hợp lý cao hơn giá trị ngưỡng hoặc biểu quyết nhất trí (unanimity voting) trong đó giá trị độ hợp lý của tất cả các nguồn phải trên giá trị ngưỡng.
Phương pháp tích hợp thông tin này sử dụng nhiều tham số hơn: với mỗi nguồn, một giá trị ngưỡng phải được xác định bằng tay và công việc biểu quyết cũng phải được chọn. Sử dụng phương pháp tích hợp này cho phép nhiều điều chỉnh khả thi và bất kỳ tri thức hoặc giả thuyết nào về các giá trị ngưỡng có thể được sử dụng đều ảnh hưởng đến hiệu lực của các nguồn thông tin cụ thể hoặc để chọn một thiết lập mà nó ưu đãi hoặc là độ chính xác hoặc là sai số. Nó cũng cung cấp cái nhìn sâu sắc về ảnh hưởng của các nguồn thông tin khác nhau, bởi vì ở đây chúng ta có thể kiểm tra hiệu quả cá nhân đối lập với phương pháp xác suất trung bình. Trong phần 5.4 chúng tôi nhìn vào ảnh hưởng của các giá trị ngưỡng và các thiết lập biểu quyết khác nhau này và cố gắng để đưa ra một chiến lược tốt cho việc chọn các thiết lập này.
5.3.2.3 Huấn luyện một phân lớp (classifier)
Phương pháp tích hợp thứ ba mà chúng tôi mô tả ở đây khác so với hai phương pháp đầu tiên mà trong đó chúng tôi giả sử rằng một phần của các thể hiện quan hệ ứng viên là đã được phân lớp hoặc là tru hoặc là false. Nếu dữ liệu này là có sẵn, chúng ta có thể huấn luyện một mô hình để phân lớp các thể hiện quan hệ ứng viên. Ở đây, các giá trị điểm độ hợp lý của nguồn được coi như là các giá trị đặc trưng. Phương pháp này là rất hữu ích khi một phần của dữ liệu đã được gắn thẻ như các ví dụ khẳng định và phủ định. Một trường hợp ví dụ là trong một kịch bản làm giàu ontology, khi mà một tập mới các thể hiện đã được thêm vào cơ sở tri thức trong đó rất nhiều các quan hệ đã được biểu diễn.
Chúng ta có thể sử dụng các phương pháp học có giám sát hoặc học bán giám sát. Đối với các phương pháp học bán giám sát, trên mỗi vòng lặp chúng ta có thể thêm vào hầu hết các thể hiện quan hệ ứng viên dựa trên mô hình hiện tại của tập nhân các thể hiện quan hệ đã biết và hiệu chỉnh các tham số mô hình dựa trên tập nhân đã được xây dựng mới. Chúng ta huấn luyện hai loại mô hình trên dữ liệu đã được đánh giá thủ công để cho ra một chỉ số hiệu suất có thể đạt được bằng phương pháp này. Loại mô hình đầu tiên là phân lớp Naive Bayes [Mitchell, 1997]. Mô hình thứ hai là Cây quyết định (Decision Tree) được học sử dụng thuật toán J48 [Witten and Frank, 2005] mà nó chia dữ liệu tại mỗi node phụ thuộc vào giá trị của một đặc tính, đến tận node lá cung cấp một phân lớp. Các phân lớp này có thể được học từ các dữ liệu có sẵn hoặc được xây dựng thủ công.
5.4 Thử nghiệm
Ở đây chúng tôi mô tả ba thử nghiệm mà nó chỉ ra công việc của phương pháp và lợi ích của việc quan tâm đến đa nguồn thông tin đối với việc trích xuất thể hiện quan hệ. Trong mỗi thử nghiệm này, chúng tôi chỉ ra hiệu quả của việc sử dụng ba phương pháp tích hợp thông tin: phương pháp xác suất trung bình, phương pháp biểu quyết và mô hình huấn luyện phân lớp. Các kết quả đã được đánh giá theo độ chính xác, sai số và giá trị độ đo F dựa trên một chuẩn vàng (gold standard). Để đo hiệu quả của mô hình đã được huấn luyện, chúng ta thực hiện việc đánh giá 10-fold cross validation.
5.4.1 Tướng lĩnh La Mã
5.4.1.1 Setup
Thử nghiệm đầu tiên chúng tôi mô tả ở đây được sử dụng để minh họa các công việc của phương pháp với một công việc trích xuất thể hiện quan hệ tương đối nhỏ. Chúng tôi cũng xem xét các kết quả trên hiệu suất tổng thể. Ở đây, quan hệ đích đã được khởi tạo là quan hệ [Roman General, participated_in, Roman War]. Đối với quan hệ này, chúng ta nhận được 51 cuộc chiến tranh trong lịch sử liên quan đến thành Rome và 210 tướng lĩnh La Mã từ các trang Wikipedia tương ứng và đã sử dụng các kết quả này như là các thể hiện của các lớp Roman War và Roman General. Đối với cả các tướng lĩnh và các cuộc chiến tranh chúng ta cũng trích xuất các khoảng thời gian liên quan đến họ: năm sinh và năm qua đời đối với tướng lĩnh năm bắt đầu và năm kết thúc đối với các cuộc chiến tranh. Nếu chỉ có các năm xấp xỉ (gần đúng), chúng ta tự quyết định dựa trên năm bắt đầu và năm kết thúc đã cố định.
Chúng tôi sử dụng ba nguồn thông tin: phương pháp trích xuất thông tin dựa trên công cụ tOKo được mô tả trong chương 4 được sử dụng như là một nguồn đầu vào và khoảng cách google chuẩn và tri thức nền về thời gian như là các nguồn lọc.
Đối với phương pháp trích xuất thông tin dựa trên công cụ tOKo, đầu tiên chúng ta xây dựng một kho văn bản bằng cách truy vấn google với câu truy vấn "Roman Rome War General". Chúng tôi đã tải về 1000 trang kết quả đầu tiên và loại bỏ đi các trang trắng, cuối cùng được một kho văn bản với 678 trang văn bản. Kho văn bản này và các lớp đã được trích xuất được đưa vào công cụ tOKo. Sự xuất hiện đồng thời của một tướng lĩnh La Mã (Roman general) và một cuộc chiến tranh (war) bên trong 20 token của nhau (twenty tokens of each other) trong các tài liệu kho văn bản được xem như là bằng chứng cho sự tồn tại của các thể hiện quan hệ liên quan. Chúng tôi đã trích xuất toàn bộ các thể hiện của các câu truy vấn tOKo sau: <RomanWar>...20<RomanGeneral> và <RomanGeneral>...20<RomanWar> và kết hợp các kết quả đó với nhau. Các mẫu này đã xuất hiện với tần số thấp: 59 kết quả được tìm thấy trong toàn bộ kho văn bản trong đó có 23 thể hiện quan hệ ứng viên khác nhau. Các thể hiện quan hệ ứng viên này và các tần suất xuất hiện tOKo của nó được sử dụng như là nguồn thông tin đầu vào.
Loại nguồn thông tin thứ hai mà chúng tôi đã sử dụng là khoảng cách google chuẩn (NGD) mà chúng tôi đã sử dụng nó như là một nguồn lọc (filter source). Khoảng cách google chuẩn được tính toán giữa các nhãn của lớp tướng lĩnh La Mã và nhãn của các cuộc chiến tranh đối với mỗi thể hiện quan hệ ứng viên. Khoảng cách được tính toán cao nhất là đối với thể hiện [Julius Caesar, participated_in, Jugurthine War], với giá trị là 0,534 và thể hiện quan hệ ứng viên có điểm tốt nhất là [Publius S. Galba Maximus, participated_in, Second Macedonian War] với giá trị độ đo NGD = 0,146.
Nguồn thông tin thứ 3 là luật tri thức nền thời gian, như đã được mô tả trong phần 5.3.1.1, nguồn này chúng tôi sử dụng như một nguồn lọc. Đối với mỗi thể hiện quan hệ ứng viên, khoảng cách giữa các khoảng thời gian có liên quan được chúng tôi tính toán, khoảng cách này có giá trị là 0 đối với các ứng viên mà các khoảng thời gian của chúng chồng xếp lên nhau.
Để tính toán hiệu quả của phương pháp này, chúng tôi đã đánh giá bằng cách thủ công của mỗi thể hiện ứng viên trong số 23 thể hiện quan hệ ứng viên dựa vào các trang Wikipedia của tướng lĩnh đó hoặc các nguồn Web đã được chứng thực khác. 11 trong số 23 thể hiện quan hệ ứng viên được đánh giá là chính xác. Để so sánh các kết quả sai số khi sử dụng các nguồn thông tin khác nhau, chúng tôi coi như 11 ứng viên chính xác là maximum. Do đó "unfiltered recall" là 1,00. Điều này mang lại giá trị độ đo F là 0,65.
Bảng 34 chỉ ra danh sách 23 thể hiện quan hệ ứng viên, các kết quả đánh giá và các giá trị tương ứng với ba nguồn thông tin.
5.4.1.2 Kết quả
Xác suất trung bình (average probabilities). Đối với phương pháp tích hợp thông tin này, chúng tôi chuẩn hóa mỗi điểm độ hợp lý. Cả sự khác nhau về thời gian và điểm khoảng cách google chuẩn là các độ đo khoảng cách, vì vậy chúng tôi lấy giá trị nghịch đảo của chuẩn hóa tuyến tính (linear normalization). Đối với tần suất xuất hiện tOKo, chúng tôi sử dụng chuẩn hóa logarit (logarithmic normalization). Sác xuất được tính trung bình đối với mỗi ứng viên. Ví dụ, các giá trị đã chuẩn hóa đối với thể hiện quan hệ ứng viên đầu tiên trong bảng 34 là Ptoko = 1 , Ptime = 1 và PNGD = 0,504, cho kết quả sác xuất trung bình là 0,835. Đối với thể hiện không chính xác đầu tiên (Julius Caesar - Second Punic War), các giá trị điểm là thấp hơn đáng kể với Ptoko = 0,610, Ptime = 0,684 và PNGD = 0,424 kết quả giá trị xác suất trung bình là 0,572.

Table 34: The 23 candidate relation instances for the roman generals task. For each of the candidate relations, the evaluation result is also listed as well as the values for each of the three information sources. For time difference, higher values indicate a larger interval between the lifespan and the war, making the relation less likely.

Bằng việc thay đổi giá trị ngưỡng trên đó sác xuất posteriori (posteriori probability) mà chúng ta có thể chọn lựa hoặc là độ chính xác hoặc là sai số. Bất kỳ giá trị ngưỡng giữa khoảng 0,10 và 0,58 sẽ cho kết quả giá trị độ đo F cao hơn 0,65 so với tập chưa được lọc. Giá trị ngưỡng mặc định là 0,50 trong thử nghiệm này cũng tạo ra một phân lớp với giá trị độ đo F cao nhất. Với giá trị này, 16 thể hiện đã được phân lớp như là chính xác, 10 trong số này đã được phân lớp chính xác. Điều này tương ứng với giá trị độ đo F là 0,74, một cải tiến đáng kể so với tập chưa lọc.
Phương pháp biểu quyết. Đối với phương pháp biểu quyết, hiệu năng được xác định bằng các giá trị của mỗi giá trị ngưỡng trên điểm độ hợp lý. Để minh họa điều này, chúng ta xem xét hiệu năng đạt được bằng cách sử dụng mỗi phân lớp của nguồn và xác định các giá trị ngưỡng thực thi tốt nhất.
Sử dụng một giá trị ngưỡng tần suất xuất hiện tOKo cao hơn (Ttoko) tạo ra độ chính xác và sai số thấp hơn, Đối với các thể hiện quan hệ ứng viên này, giá trị độ đo F cao nhất nhận được với Ttoko = 0, nó tương ứng với các thiết lập chưa lọc.
Đối với sự khác nhau về thời gian, độ hợp lý của một thể hiện quan hệ ứng viên là 1 nều sự khác nhau về thời gian <= Ttime và là 0 trong trường hợp ngược lại. Các giá trị khả thi đối với Ttime từ 0 đến vô định, các giá trị sau đó tương ứng với tập chưa lọc. Với Ttime = 0, giá trị độ đo F là lớn nhất (0,83) với 10 trong số 13 ứng viên còn lại là chính xác.
Đối với khoảng cách google chuẩn, độ hợp lý biểu quyết là 1 được gán cho các ứng viên với giá trị NGD <= TNGD và là 0 trong trường hợp ngược lại. Bộ lọc này là tồi hơn đáng kể so với các bộ lọc trước. Giá trị độ đo F cao nhất đạt được là TNGD = 0,46: ở đây chúng tôi giữ lại 17 thể hiện quan hệ ứng viên, 10 trong số này là chính xác, cho kết quả độ chính xác là 0,59 và sai số là 0,91 và giá trị độ đo F là 0,71. Điều này chỉ cải thiện đôi chút so với đường cơ sở (baseline).
Tiếp theo, chúng tôi kết hợp cả ba nguồn thông tin này bằng cách sử dụng phương pháp tích hợp thông tin biểu quyết. Chúng tôi sử dụng các giá trị ngưỡng điểm tốt nhất đối với ba nguồn thông tin này (Ttoko = 0, Ttime = 0, TNGD = 0.46) để tạo ra 3 giá trị độ hợp lý nguồn thông tin là 1 hoặc 0 đối với mỗi thể hiện quan hệ ứng viên. Trong bảng 35, chúng tôi trình bày các ảnh hưởng đến hiệu năng khi hoặc là giá trị tối thiểu của một, hoặc là giá trị tối thiểu của hai hoặc là tất cả các nguồn phải cung cấp một giá trị độ hợp lý là 1 để giữ lại một thể hiện quan hệ ứng viên. Bảng này chỉ ra rằng giá trị độ đo F đạt được đối với các thiết lập nghiêm ngặt, khi tất cả ba nguồn thông tin này cần cung cấp giá trị độ hợp lý là 1. Với yêu cầu tối thiểu của hai trong số 3 giá trị độ hợp lý là 1, chúng tôi cũng nhận được kết quả làm tăng độ chính xác, trong khi giá trị sai số vẫn là 100% dẫn đến làm tăng nhẹ đối với giá trị độ đo F.
Các mô hình phân lớp có huấn luyện. Để huấn luyện hai mô hình phân lớp, chúng tôi đã import dữ liệu này từ bảng 34 vào công cụ WEKA [Witten and Frank, 2005] và phân lớp đã được huấn luyện mặc định Naive Bayes. Các giá trị nguồn thông tin này: tần suất xuất hiện tOKo, khoảng cách thời gian (temporal difference) và khoảng cách NGD được sử dụng như là các đặc trưng và kết quả đánh giá (true/false) được sử dụng như là lớp đích. Mô hình này đã được đánh giá sử dụng 10-fold cross validation. Trong các mô hình kết quả, chỉ yếu tố thời gian được quan tâm để xác định lớp: tất cả các thể hiện ứng viên mà có khoảng cách thời gian <= 1,5 được phân lớp là "true", phần còn lại là "false". Điều này cho kết quả độ đo F là 0,83.

Table 35: Performance of the different combination methods. For the average probabilities, the default threshold value also produces the best F-measure at 0.50. For voting method, we show the results of varying the minimum number of "1" likelihoods for the threshold parameter values Ttoko = 0, Ttime = 0, TNGD = 0.46. Shown is the number of retained candidates, the number of correct candidates and the resulting precision, recall and F-measures.

Chúng tôi cũng đã học cây quyết định đối với dữ liệu này sử dụng thuật toán J48 như đã được cài đặt trong bộ công cụ WEKA sử dụng các thiết lập mặc định và các kết quả đã được đánh giá lại sử dụng 10-fold cross validation. Trong các kết quả phù hợp nhất với cây thì yếu tố thời gian là đặc trưng dự báo tốt nhất, việc phân lớp tất cả các thể hiện với khoảng cách thời gian > 0 hầu như là false. Các thể hiện ứng viên còn lại được phân lớp là "true" nếu khoảng cách google chuẩn của nó NGD > 0,23, Các kết quả này trong 11 khẳng định true và 1 khằng định false (precision= 0.91, recall=0.91, F-measure=0.91).
Chúng tôi so sánh các kết quả của một loạt các phương pháp tích hợp thông tin khác nhau trong bảng bảng 35. Trong mọi trường hợp trong đó thông tin được kết hợp, hiệu năng tốt hơn khi được so sánh với đường cơ sở. Trong công việc cụ thể này, thông tin thời gian cung cấp lợi ích (gain) cao nhất. Phương pháp kết hợp Naive Bayes nhanh hơn so với tất cả các phương pháp kết hợp khác.
Trong thử nghiệm này, chúng tôi đã chỉ ra rằng việc sử dụng tri thức nền và các kết quả trích xuất thông tin bổ sung như là các bộ lọc trên một tập của các quan hệ ứng viên có thể cải tiến chất lượng của tập đó. Đặc biệt, bộ lọc thời gian cung cấp các kết quả tốt, mà không đáng ngạc nhiên cho rằng một tướng lĩnh tham gia trong một cuộc chiến tranh mà ông ta đã được tìm thấy trong khoảng thời gian trước khi ông ấy sinh ra hoặc sau khi ông ấy mất đi. Quan hệ ứng viên chính xác duy nhất trong đó khoảng cách thời gian > 0 là [Lusius Quietus, participated_in, Bar Kokhba’s Revolt]. Lý do cho điều này là nguồn đó được sử dụng để xác định thời kỳ diễn ra cuộc chiến tranh và nguồn mà chúng tôi đã đánh giá quan hệ vớ một định nghĩa khác nhau về cuộc chiến: một nguồn bao gồm các cuộc nổi dậy sớm hơn trong khi nguồn khác là một định nghĩa giới hạn hơn. Điều này là một ngoại lệ rõ ràng đối với ràng buộc thời gian giới hạn sáng suốt khác trên quan hệ này. Với hai thử nghiệm tiếp theo, chúng tôi chỉ ra các kết quả đối với các ràng buộc ít hạn chế hơn.
5. 4. 2 Art Style-Artist
Thử nghiệm trước đã chỉ ra trong miền đó, hiệu năng có thể tăng bằng cách sử dụng đa nguồn. Với thử nghiệm này, chúng tôi kiểm tra hiệu năng của phương pháp đối với công việc lớn hơn: với nhiều hơn các thể hiện ứng viên. Chúng ta trở lại với công việc trích xuất thể hiện quan hệ ứng viên từ chương 3 và chương 5: Việc trích xuất các nghệ sĩ tham gia đối với các phong cách nghệ thuật hiện đại trong ngữ cảnh của ontology và cơ sở tri thức văn hóa điện tử MultimediaN. Chúng tôi sử dụng lại ba nguồn thông tin: như là các nguồn đầu vào chúng tôi sử dụng cả các kết quả từ phương pháp dư thừa thông tin đã được mô tả trong chương 2 và phương pháp trích xuất thông tin dựa trên công cụ phân tích văn bản tOKo. Chúng tôi cũng sử dụng lại thông tin nền thời gian như là một nguồn lọc.
5.4.2.1 Cài đặt
Quan hệ mà đã được trích xuất là [Art Style, has_artist, Artist]. Lớp artist được trích xuất với các thể hiện từ các tên trong danh sách Union List of Artist [The Getty Foundation, 2000c]. Chúng tôi đã sử dụng lại 10 phong cách nghệ thuật cũng đã được sử dụng trong chương từ bộ từ điển the Art and Architecture Thesaurus [The Getty Foundation, 2000a].
Trong thử nghiệm này, chúng tôi sử dụng hai nguồn đầu vào khác nhau. Nguồn thông tin đầu tiên này của đầu vào là các kết quả của thử nghiệm đã được mô tả trong chương 2 trong đó phương pháp dư thừa thông tin đã được sử dụng để trích xuất các quan hệ art style - artist đối với 10 phong cách nghệ thuật này. Chúng tôi sử dụng 400 thể hiện quan hệ ứng viên mà đã được đánh giá thủ công cộng với tập nhân với 30 thể hiện. Đối với điểm độ hợp lý chúng tôi sử dụng giá trị trượt (drop value). Đối với 30 thể hiện trong tập nhân, giá trị này là 1.
Đối với nguồn đầu vào thứ 2, chúng tôi sử dụng lại phương pháp trích xuất dựa trên mẫu để tìm ra các thể hiện quan hệ. Mô tả chi tiết hơn của thử nghiệm này có thể xem trong chương 4. Một kho văn bản với 5000 tài liệu đã được trích xuất bằng việc truy vấn máy tìm kiếm Google với 10 phong cách nghệ thuật hiện đại đó. Kho văn bản này đã được load vào công cụ tOKo cùng với các lớp đích và các thể hiện của nó. Hai mẫu tOKo đã được sử dụng để trích xuất các thể hiện quan hệ ứng viên là:
1 { [painte r; disj ] ∧ <word; capt> }{|_} ...20 [style; disj]
2 [style; disj]{|_} ...20 {[painter; disj] ∧ <word; capt>}
Đối với mục đích đánh giá, chúng tôi chỉ bổ sung các ứng viên với tần suất xuất hiện tOKo > 4, bởi vì các ứng viên này đã được đánh giá đối với các thử nghiệm trong chương 4 cộng thêm các thể hiện quan hệ ứng viên này đã được đánh giá bằng phương pháp dư thừa thông tin với nguồn đầu vào. Tổng cộng, nguồn đầu vào này cung cấp 243 thể hiện quan hệ ứng viên đã được đánh giá.
Phương pháp dư thừa thông tin và các nguồn đầu vào dựa trên công cụ tOKo cung cấp tất cả các thể hiện quan hệ ứng viên. Trong tổng số 528 thể hiện quan hệ ứng viên đã được sử dụng như là đầu vào. Nếu đối với một ứng viên được phát hiện với phương pháp dư thừa thông tin, không có thể hiện nào đã được tìm thấy bằng cách sử dụng mẫu tOKo, thì kết quả là tần suất xuất hiện tOKo được đặt là 0 và ngược lại. Điều này tạo ra một tập lớn số lượng các "giá trị bị mất" mà có thể tiềm năng gây ra vấn đề đối với phân lớp cuối cùng như là một giá trị của 0 bây giờ cũng có thể là nguyên nhân bởi việc không xuất hiện đồng thời của một thể hiện quan hệ ứng viên trong một trong kho văn bản hoặc do công việc so khớp (mathching) lỗi.
Trong số 528 ứng viên có 309 ứng viên là chính xác, tương ứng với giá trị độ chính xác là 0,59. Một lần nữa, với các khai phá tiếp theo, chúng tôi giả sử rằng đây là giá trị sai số tối đa, bởi vậy ở đây chúng tôi đặt sai số là 1,00, kết quả cho giá trị độ đo F là 0,74 đối với tập đầu vào chưa lọc.
Ở đây, chúng tôi sử dụng một nguồn lọc dưới dạng của một luật tri thức nền thời gian. Danh sách ULAN năm sinh và năm mất đối với các nghệ sĩ. Đối với các thời kỳ tồn tại của 10 phong cách nghệ thuật chúng tôi sử dụng cùng năm bắt đầu và năm kết thúc để đánh giá việc trích xuất theo khoảng thời gian trong miền này như đã mô tả trong chương 3. Đối với mỗi thể hiện quan hệ ứng viên, khoảng cách giữa các khoảng thời gian liên quan được tính toán, với một giá trị là 0 đối với các ứng viên mà khoảng thời gian của chúng là chồng xếp lên nhau.
5.4.2.2 Kết quả
Xác suất trung bình. Việc chuẩn hóa tần suất xuất hiện tOKo và khoảng cách thời gian đã được thực hiện cùng cách với thử nghiệm trước. Bởi vì giá trị trượt (drop value) của phương pháp dư thừa đã được định nghĩa trong khoảng 0 đến 1, xác suất chuẩn hóa nàytương đương với yếu tố trượt này (drop factor). Chúng tôi thực hiện việc đo lại ảnh hưởng đến hiệu năng của một loạt ngưỡng giá trị trung bình của các sác xuất này. Đối với giá trị ngưỡng nằm trong khoảng từ 0,14 đến 0,40, giá trị độ đo F cao hơn so với tập các thể hiện quan hệ ứng viên chưa lọc. Với giá trị mặc định là 0,50, độ chính xác cao hơn với giá trị ngưỡng là 0,77 nhưng giá trị độ đo F trong cả hai trường hơp đều thấp hơn so với các giá trị ngưỡng 0,45 và 0,57 một cách tương ứng. Giá trị ngưỡng mặc định trong trường hợp này là giới hạn nhiều và không đưa ra tập các thể hiện quan hệ tốt hơn, khi được đo bởi giá trị độ đo F. Độ đo F tối đa đạt được tại giá trị ngưỡng 0,36. Ở đây, 568 ứng viên đã được giữ lại, 355 trong số này là chính xác, dẫn đến kết quả là giá trị độ đo F đạt 0,76.

This is only a small increase. This is caused by the large number of ’missing values’ for the tOKo and Redundancy method scores. A lot of relation instances receive a high probability for the one source, while the other source provides a 0. This is the case if a relation is not found in one corpus, but is found in the other.
Voting. For the voting-based information integration method, we now have three threshold parameters: TRM on the redundancy method’s drop factor score (higher is better), Ttoko on the tOKo frequency (higher is better) and Ttime on the temporal difference (lower is better). We again first discuss the individual effects of these threshold values.
For both TRM and Ttoko , a higher value results in an increase in precision, while recall deteriorates. The optimal F-measure for both threshold parameters is achieved at a value of 0, when all 528 candidates receive likelihoods of 1. Filtering out candidate relation instances based on time differences does increase the F-measure and the best performance is achieved at Ttime = 0, where the highest possible amount of candidates are filtered out. With these parameter settings, two of the three sources assign a likelihood of 1 to all candidates. This has as a consequence that the voting settings where one or two out of the three sources are required to provide a likelihood of 1 for a candidate result in the same performance, namely that of the baseline. When three out of the three sources are required to produce a 1, only 93 candidates remain, 74 of which are correct, resulting in a the precision of 0.80, a recall of 0.24 and a F-measure of 0.37.

Table 36: Performance of the different combination methods on the art style - artist relations. For the average probabilities integration method, the optimal threshold value is 0.36. For voting method, we show the results of varying the minimum number of "1" likelihoods for the threshold parameters TRM = 0.17, Ttoko = 1 and Ttime = 0. Shown is the number of retained candidates, the number of correct candidates and the resulting precision, recall and F-measures.

Better results with the voting method are obtained when slightly higher values for the threshold parameters are chosen and some candidates actually receive a likelihood of 0. The best performance is achieved at TRM = 0.17, Ttoko = 1 and Ttime = 0. Table 36 shows the performance for different voting settings using these parameter settings.
Trained classification models. We again also learned the two models on this data. The results of these two learned classifiers is also shown in Table 36. Both models achieve a very slight improvement with respect to the unfiltered set.
In this experiment, the gain in F-measure of using multiple information sources is relatively low for all combination methods and settings and for some settings such as the default threshold value for the average probabilities method, the F-measure is lower than for the unfiltered set. The highest found F-measure is achieved by discarding a lot of candidate relation instances based on the combination of their Redundancy Method drop factor and the Time differences.
One explanation for this lack in improvement is the ’missing values’ problem discussed earlier. A second problematic property of this data set is that most of the false candidate relation instances are between art styles and artists that actually occurred in the same time periods, since they occur in the same ’modern art’ corpus. This big overlap in time periods for the ten art styles has as a consequence that time difference is not as good a classifying feature as it could be. In the experiment described in Section 5.4.3, we use different art styles with less overlap in time periods.
A third problem is that the 430 evaluated candidate relation instances that come from the Redundancy Method already have a relatively good precision and the incorrect instances are relatively hard to identify by either the tOKo frequency or the Time difference. Even so, we can achieve some increase in performance with a number of information integration methods.
5.4.3 Regional Artists
With the third experiment, we show the workings of the two remaining background knowledge rules as we combine information from five different sources: the tOKo-based Information Extraction method as the only input source, the Normalized Google Distance and the Temporal, Spatial and Association background knowledge rules as filter sources. With this experiment we show how much the performance can be increased if all these sources fit the task.
As in the previous experiment, the target relation is [Art Style, has_artist, Artist]. As our input information source, we again use the pattern-based relation extraction method. We also use Normalized Google Distance [Cilibrasi and Vitanyi, 2004] between the relation’s subject and object as a filter source and we use three types of background knowledge as additional filter sources: the Temporal, Spatial and Association Background Rules. The temporal and spatial information about the art styles and artists is provided by the knowledge base, corresponding to case 2 as described in Section 5.2. The association information is provided by an Information Extraction module, corresponding to case 3.
5.4.3.1 Setup
For this experiment, we selected a subset of the AAT tree containing European regional art styles, making it possible to use the spatial rule from Section 5.3.1.2. From this subtree we selected 10 art styles. Information about the regions associated with the styles is present in the AAT, we manually normalized these to at least one nationality. We also extracted the time period (in start year-end year format) for these art styles from the AAT scope notes or by extracting them from the Web. The ten art styles were selected such that there was some variation in their associated time periods. In Table 37, we show the ten art styles along with their associated regions and time period.
We use only one input source for this task. This is again the tOKo candidate relation extraction from Chapter 4. For each of the ten art styles, we created a Google query consisting of the style label (shown in Table 37 and the disambiguating term "art style". These queries were sent to the Google search engine and for each style, the first 100 retrieved pages were downloaded. After filtering out error-producing web pages, this resulted in a corpus of 795 documents which was loaded into tOKo. The ten regional art styles, instances of the Art Style class, were also loaded into the tool.
For the candidate artists, we loaded the ULAN [The Getty Foundation, 2000c] painters into the tool. As we did in Chapter 4, we first determined the subset of these painters that occurred anywhere in our corpus with their full name. This resulted in a subset of 1066 artists. For these artist instances we added an extra label consisting of only the last name. This was done to increase the recall when searching for the relation instances in the corpus. We then used the same two tOKo queries as described in the previous section to extract the candidate relation instances. A total of 122 separate candidate relation instances were found, occurring with frequencies ranging from 210 to 1.

Table 37: The ten regional art styles, the associated region and the time period as used in the artist-art style population experiment.

We use four filter sources: NGD, time difference, spatial background knowledge and the association background knowledge. For the first two sources, we extract likelihood scores in the same manner as described in the previous experiment, we here only elaborate on the latter two.
Spatial background knowledge. For the spatial rule from Section 5.3.1.2, we use the information from Table 37 as the locations of the art styles. We retrieved the artist’s nationalities from the ULAN. We here use a very simple distance measure: the distance is set to 1 if the two nationalities are the same, if one is a subset of the other or if the two countries share borders.
Association background knowledge. For the Association rule from Section 5.3.1.3, we here use extracted information (corresponding to case 3 from Section 5.2. This information consists of the associations between artists extracted in the same way as we have described in Chapter 4. For this, we searched for the following pattern in our corpus:
{ [painte r; disj ] ∧ <word; capt> }{|_} ...10 and ...10 { [painter; disj ] ∧<word; capt>}
This resulted in a list of related artists with an associated frequency corresponding to the strength of the association. Every candidate relation instance received an associated frequency score which is the sum of the tOKo frequencies of other candidate relation instances containing the same art style and an associated artist. In this case, all extracted information is considered to be ’true’ and we do not use information about the strength or probability of the association. However, the associated frequency score is still a good indication of the correctness of a candidate relation instance.

Table 38: The first thirteen candidate relation instances, ordered by tOKo frequency. Listed are the evaluation result and the values for each of the five information sources.

We manually evaluated all 122 candidates. Of these candidates 35 were evaluated as correct, resulting in a unfiltered precision of 0.29. Again, we set maximum recall to 35 correct instances, so here recall is 1.00. This corresponds to a F-measure of 0.45. For illustration, Table 38 shows a subset of the 122 candidate relation instances. The first thirteen instances, when ordered by their tOKo frequency are shown. The table shows that some of the errors are matching errors ("Frank Young" and "Young Poland", while other incorrect candidates are more complex (S. Jaffe is not a member of De Stijl, although she painted in related art styles).
5.4.3.2 Results
Average probabilities. We normalized the tOKo, NGD and time difference scores as before. For the place value, no normalization is required since the values are already between 0 and 1. For the association, we used the logarithmic normalization since it is based on term frequencies. After averaging, the a posteriori probability for the first (incorrect) relation instance in Table 38 is 0.63, whereas the probability for the second (correct) instance is 0.88.
Threshold values between 0 and 0.68 produce a gain in F-measure, which is a much greater interval than in the other two experiments. A threshold value of 0.50 produces a F-measure of 0.72, which is a significant increase compared to the unfiltered set. The highest F-measure is achieved for a threshold of 0.56. Here 36 candidates are retained, 28 of which are correct, resulting in a F-measure of 0.79. This is a very large increase when compared to the unfiltered set.
Voting. We again first discuss the individual effects of using the each of the information sources separately. For each of the information sources, we use a threshold parameter to produce a likelihood of either 1 or 0. In Table 39, we show the best performing threshold parameter values for the five information sources. The table shows that using each of the information sources significantly boosts precision at the expense of some recall of the retained set of candidate relation instances and that the overall quality as indicated by the F-measure increases. The largest improvement is achieved by the spatial background knowledge rule, raising the F-measure from 0.45 to 0.64.

Table 39: Performance measures for the baseline and the effect on the performance of using each of the information sources as a filter in the regional art experiment.

Table 40: Performance measures for the baseline and the effect on the performance of using different voting thresholds. The threshold parameter values for the individual sources are Ttoko = 5, TNGD = 0.58, Ttime = 0, Tplace = 1 , Tassoc = 1.

We combined the information of all sources described above using the voting procedure. For each candidate relation instance, the threshold values of Table 39 are used. We again explore the effects of different voting settings (minimum amount of 1-likelihoods to retain a candidate) on the performance. Since there are 5 information sources, a threshold of 5 corresponds to a unanimous voting policy, while a threshold of 3 corresponds to a majority vote. Table 40 shows the effect of the various voting thresholds on the quality of the remaining candidate relation instances when all tOKo results are considered.

Figure 22: Decision Tree best fitting the regional art data using the J48 algorithm with default settings as implemented in the WEKA toolkit. Leaf nodes show the number of correctly and incorrectly classified instances respectively between parentheses.

Table 40 show that voting thresholds of 2 and 3 produce a significant increase in performance. When considering all 122 candidates, a value of 3 results in the highest gain. For that value, 86 candidate triples are discarded, 6 of which are evaluated as correct. This results in a F-measure score of 0.82, which is almost twice the F-measure score for the baseline. A voting threshold value of 2 provides a less strict filtering, removing 50 triples, only one of which was evaluated as correct. The lower the voting threshold value, the more recall is favored over precision. Depending on the specific type of application or post-processing the desired point on the precision/recall tradeoff can be chosen using this voting threshold parameter.
Trained classification models. We again also learned a Decision Tree and a Naive Bayes classifier on this data using 10-fold cross validation. The best classifying decision tree is considerably more complex than the previously learned trees with a depth of 4. It uses all types of information to classify the candidate instances. We show the resulting decision tree in Figure 22. The best classifying Naive Bayes classifier also uses all information sources to determine the probability of the evaluation class. The performance of both learned models is also shown in Table 40.
In this experiment we have seen that using the separate information as filters can boost performance up to a F-measure of 0.64. Using multiple information sources can drastically improve the performance, to a maximum F-measure of 0.82, almost double that of the baseline. By using the voting procedure, one can also choose to retain a high recall, but reduce the total number of false positives. In a scenario where extraction results are filtered out by hand, this can significantly reduce the workload.
5. 5 Related work
The idea of combining multiple information sources to improve Information Extraction is not new. For example, in the Armadillo system described by Ciravegna et al. [2004], target information is collected from different web sites using various wrappers. This results in multiple extractions of the same instances with minor variations. Armadillo integrates this information to identify these identical entities and merges them. Spelling variants, for example, are mapped onto the same entity. The fact that an entity is discovered using different extraction methods or the fact that the entity is extracted from multiple web sites is additional evidence that the extraction is correct. In our proposed method, we combine not only Information Extraction results about a single entity, but also combine it with background knowledge found in the target ontology.
Korst et al. [2006] describe how filtering results using background knowledge improves Ontology Population results. To filter candidate person names, additional Google queries are generated that check whether or not a candidate term is indeed a name. In our system, we use actual ontological information about relations through the background knowledge rules.
In the OntoSyphon system developed by McDowell and Cafarella [2006], Information Extraction queries are directly derived from an input ontology. The results of these queries, retrieved from various places in the web are then combined, together with information from the ontology, to automatically verify these candidate instances and relations. Here, ontological background information is directly used as input for the specific Information Extraction procedure. In our approach, we use background knowledge as a filter on results of some unspecified Information Extraction method.
Different methods and systems exist that use ontological background knowledge to improve Information Extraction. One method is to enhance extraction patterns with other terms extracted from Wordnet hyponym relations [Fellbaum, 1998] to improve pattern recall.
The method used by Cimiano et al. [2004] combines evidence from heterogeneous sources for learning subsumption relations. Here, multiple heterogeneous sources are used to gather evidence for a instances of a subsumption relation. One of the Information Extraction methods used is for example using Hearst patterns [Hearst, 1992] on a textual corpus. The system also uses hyponym information from Wordnet as evidence for a subsumption relation instance. The evidence for the target subsumption instances are combined to calculate a final likelihood for the instance. In Cimiano et al. [2004] method, structured background knowledge (in this case from Wordnet) is used to improve Information Extraction. However, the method is only defined for subsumption relations whereas the method here describes rules for using more domain-specific information to improve identifying the domain-specific relation instances.
5. 6 Conclusions
In this chapter, we have presented a method for combining information from multiple Information Extraction sources with background knowledge about a target relation instance. We identify two types of information sources: input sources and filter sources. The background knowledge is combined with general rules to produce likelihood scores for candidate relation instances and is therefore a filter source. In the information integration step, the different likelihood scores are combined through one of three methods to derive whether a candidate relation instance is correct or incorrect: unweighted average probabilities, a voting procedure and learning classification models. We have done three separate experiments in which we show the effect of combining information from different sources on the quality of the set of candidate relation instances.
We here use simple rules to generate likelihoods out of background information directly related to the subject and/or the object of the target related. Although a lot of background information can be exploited using this method, as we show in the experiments, this format is not the only way in which background information could be used. For instance, by cascading multiple rules, information that is located further away from the target relation could be exploited. Also, more complex combinations or comparisons of different types of information could lead to additional beneficial information. In this Chapter, we do not claim that our rules are complete but do show that through the use of these simple and general rules we can already exploit a good part of the available information.
When using average probabilities, no parameters have to be set other than the final threshold on the a posteriori probability. It is therefore a very useful method when one has no knowledge about the amount that the different sources should contribute to the final classification. In all three experiments, we see that a well chosen value for this parameter raises the performance, although in the second experiment only by a small amount. The default value of 0.50 increases the overall performance in the first and third experiment but is too high for the second experiment as it produces a drop in F-measure.
If we take for a threshold value the average of the F-measures of the three experiments as an indicator for the overall quality, the best value is 0.42. Using this value the F-measures for the Roman Generals, Modern Art Styles and Regional Art styles are 0.73, 0.72 and 0.68 respectively, averaging at 0.71. Further experiments could be done to determine whether this optimal value is somewhat robust, but the value of 0.42 is a good starting point for this parameter.
With the voting procedure, we have shown the effects of using the individual information sources as filters on the data and compared this to the effect of using multiple information sources. For all three tasks, we saw that using individual information sources raises the performance, but that by combining information from multiple sources, even higher boosts in performance can be obtained.
The best performance gain is reported for the third experiment, where we extract the relation between regional art styles and artists. Here the various types of background information and Information Extraction modules provide a drastic performance boost, more than doubling the F-measure. This is caused by the specifics of the task (both time, place and association information have a significant effect) and by the fact that the baseline F-measure was relatively low, leaving much room for improvement.
We have seen some overlap in the optimal threshold values for the different infor mation sources across the different experiments. This indicates that these optimal values are somewhat robust across domains. For the Temporal and Spatial background knowledge rules, the extreme val ues of 0 and 1 can be used respectively. The threshold on the tOKo input source varies from 1 to 5, making it difficult to give an indication of what value should be used. The same holds for NGD and the redundancy method drop factor.
The choice of information integration method determines the amount of control that one has over the outcome. Using the voting method, one can choose a setting on the precision/recall tradeoff spectrum by setting the individual thresholds. In the experiments above, we trained the J48 Decision Tree and the Naive Bayes classifier directly on the manually evaluated data using 10-fold cross validation. A decision tree or Naive Bayes classifier can be built by hand. In this case, the decision tree node characteristics or the probability distributions will have to be determined beforehand. Alternatively, if a subset of the candidate relation instances are evaluated, the models can be learned on this subset. The results from the first and third experiment show that using this method, the overall performance can be increased without setting additional parameters.
For the second and third experiment, a lot of the performance gain stems from the fact that missing values for one information source are indeed found by other sources. Candidate instances that are for instance not found by the Redundancy Method because of faulty matching are indeed found by the tOKo pattern-based method. The background knowledge rules can then provide additional evidence for that candidate. For dealing with missing values, the voting method works better than the average probability method. In the latter, the ’0’ likelihood for the missing value is always taken into account when determining the final likelihood. For the (non-unanimous) voting method, a missing value can be completely ignored, if a sufficient amount of information sources give it a high enough likelihood.
The results from the experiments show that combining different information sources indeed improves the overall performance of the Ontology Enrichment task. This is especially clear in the experiment described in Section 5.4.3, where five information sources are combined. Since each of these sources provides different type of information, errors (both false positives and false negatives) are not reproduced across the various sources. At the same time the probability of the true positives remains high indicating that on average, the individual methods all give high scores to the instances of the target relation. The combination of information sources that on the one hand extract the same target class and on the other hand have sufficiently different biases indeed results in a better performance.