Google releases 'Wikilinks' corpus to interpret human language

published 10 March 2013

Google announces release of extremely large dataset known as the Wikilinks Corpus that will help developers to build software that is able to understand human language.

The collection contains a total of 40 million disambiguated 'mentions' over 10 million web pages, which is 100 times bigger than the previous large corpus.

The mentions are found by searching for links through Wikipedia pages and anchoring text that closely matches the title of the page.

A post on Google's official research blog says when it comes to deciding which term an ambiguous word relates to, "humans are amazingly good at it (when was the last time you confused a fruit and a giant tech company?), computers need help."

The software is a step in the right direction for developers to create programs using Wikilinks Corpus, that will be able to distinguish meanings.

With help from researchers at the University of Massachusetts Amherst, Google have produced a signifcantly larger dataset than its predecessors, and it's free.

Although for copyright reasons the actual annotated web pages won't be made available, an index of URLS, and the tools to recreate it will be provided.

Source: Google Research Blog

T3.com is one of the UK's leading consumer lifestyle websites, visited by over 10 million people every month. You can follow us on Twitter, Facebook and Instagram. We present products in helpful buying guides and carefully curated deals posts across style, living, auto, smart home, watches, travel, fitness and more. We also have a monthly magazine which you can buy in newsagents or subscribe to online – print and digital versions available.