[ CogSci Summaries home | UP | email ]

Ponzetto, S.P., Strube, M. (2007). Knowledge Derived From Wikipedia For Computing Semantic Relatedness. Journal of Artificial Intelligence Research, 30, 181-212.

  author = 	 {Ponzetto, Simone Paolo and Strube, Michael},
  title = 	 {Knowledge Derived From Wikipedia For Computing Semantic Relatedness},
  journal = 	 {Journal of Artificial Intelligence Research},
  year = 	 {2007},
  volume = 	 {30},
  Tpages = 	 {181--212},

Author of the summary: Craig J. Greenberg, 2007, cgreenbe@connect.carleton.ca

Cite this paper for:

The actual paper can be found at http://www.jair.org/media/2308/live-2308-3485-jair.pdf

Natural language processing has usually involved statistical techniques, but real-world general knowledge is needed to progress further. Wikipedia may be a good source of such knowledge.

The quality of Wikipedia for this purpose is assessed in two ways: semantic relatedness judgements and coreference resolution. Performance is compared to human judgements on the same tasks.

Wikipedia articles are structured by a network of category relationships. These relationships model linguistic relationships, such as: [p182]

  1. Redirect pages- points alternate expressions to the same article; models synonymy
  2. Disambiguation pages- list possible articles a polysemic expression could be referring to; models homonymy
  3. Internal links- articles with links to other articles; models cross-reference

One problem with using Wikipedia is its vast category depth, branching, and multiple inheritance relationships.

Calculating relatedness can be done in a number of ways. Some systems perform better or worse depending on which measure of relatedness is used. For two given words, the measures used here are the following [p185]:

  1. Path-Based Measures: Relatedness is inversely proportional to the number of edges in the shortest path between two words in the category network.
  2. Information Content Measures: Relatedness is based on how much information is present in the closest word superordinate in the category network to both words.
  3. Text Overlap Measures: Relatedness is based on how much text is shared between the "gloss" or brief definition of each of the two words.

Because many expressions will yield disambiguation pages when searched in Wikipedia, a method to decide which entry to use is required. The method used in these experiments was (roughly) to compare the entries listed in each word's disambiguation pages, and if any word appeared in both pages, to use that definition. Otherwise, use the first listed. [p186]

Once the disambiguated pages are found, the relatedness measures can be used. A depth-limited search of depth 4 is performed to find the closest category superordinate to both words. Performance was improved by limiting the depth and searching only categories deeper (more specific) than the 2nd level in the network. Otherwise, all words are marginally related by the fact that they all belong to the category CATEGORY [p186].

The number of overlapping hits on Google, the search engine, was used as a baseline to compare WordNet and Wikipedia. Performance was evaluated by assessing the Pearson correlation between human relatedness judgements and the comparison algorithms used with WordNet and Wikipedia. Both WordNet and Wikipedia outperformed Google, but only differed from each other insignificantly.

The specific failings of WordNet came from "sense proliferation"; that is, it uses all possible meaning combinations of two polysemic words, and uses the pair with the shortest path, even if not semantically appopriate. By contrast, the Wikipedia search algorithm disambiguates first, then finds the shortest path between these two. [p190]

The particular word databases where Wikipedia outperformed WordNet were ones designed with semantic relatedness in mind, rather than just semantic similarity, which makes sense when the above algorithm techniques are considered. Because of the small word sets used, the authors are not convinced that the comparison is fair. [p191]

A more realistic natural language task is judging whether terms are co-referring.

Using a wide array of semantic features, as well as the relatedness measures from the previous task, the accuracy of WordNet's and Wikipedia's co-reference judgements were analyzed. WordNet uses mainly surface lexical features, while Wikipedia emphasizes semantic features. They performed approximately equally, "which indicates the usefulness of using an encyclopedic knowledge base as a replacement for a lexical taxonomy." [p200]

Additionally, semantic features are more useful for judging nouns, while surface features "such as string matching and alias suffice" for proper nouns [p200]. However, Wikipedia was still better for proper nouns, because "Wikipedia contains a larger amount of information about named entities than WordNet" [p200].

There is a lack of well-structured word databases for non-English languages. Because Wikipedia's article translations preserve semantic category structure of the links, it can be readily used for relatedness judgements in other languages also. It performed as well as an existing German word database for relatedness judgements.

Summary author's notes:
  • Page numbers are from the original periodical publication.

  • Back to the Cognitive Science Summaries homepage
    Cognitive Science Summaries Webmaster:
    JimDavies (jim@jimdavies.org)