Google announces release of extremely large dataset known as the Wikilinks Corpus that will help developers to build software that is able to understand human language.
The collection contains a total of 40 million disambiguated 'mentions' over 10 million web pages, which is 100 times bigger than the previous large corpus.
The mentions are found by searching for links through Wikipedia pages and anchoring text that closely matches the title of the page.
A post on Google's official research blog says when it comes to deciding which term an ambiguous word relates to, "humans are amazingly good at it (when was the last time you confused a fruit and a giant tech company?), computers need help."
The software is a step in the right direction for developers to create programs using Wikilinks Corpus, that will be able to distinguish meanings.
With help from researchers at the University of Massachusetts Amherst, Google have produced a signifcantly larger dataset than its predecessors, and it's free.
Although for copyright reasons the actual annotated web pages won't be made available, an index of URLS, and the tools to recreate it will be provided.
Source: Google Research Blog