Google announces release of extremely large dataset known as the Wikilinks Corpus that will help developers to build software that is able to understand human language.
The collection contains a total of 40 million disambiguated 'mentions' over 10 million web pages, which is 100 times bigger than the previous large corpus.
The mentions are found by searching for links through Wikipedia pages and anchoring text that closely matches the title of the page.
A post on Google's official research blog says when it comes to deciding which term an ambiguous word relates to, "humans are amazingly good at it (when was the last time you confused a fruit and a giant tech company?), computers need help."
The software is a step in the right direction for developers to create programs using Wikilinks Corpus, that will be able to distinguish meanings.
With help from researchers at the University of Massachusetts Amherst, Google have produced a signifcantly larger dataset than its predecessors, and it's free.
Although for copyright reasons the actual annotated web pages won't be made available, an index of URLS, and the tools to recreate it will be provided.
Get all the latest news, reviews, deals and buying guides on gorgeous tech, home and active products from the T3 experts
Source: Google Research Blog
For 25 years T3 has been the place to go when you need a gadget. From the incredibly useful, to the flat out beautiful T3 has covered it all. We're here to make your life better by bringing you the latest news, reviewing the products you want to buy and hunting for the best deals. You can follow us on Twitter, Facebook and Instagram. We also have a monthly magazine which you can buy in newsagents or subscribe to online – print and digital versions available.
