Think tank TAUS now offers a new technique under the name Matching Data to select datasets for translation machines. This is the result of a multi-year STW project concerning machine translation, supervised by Khalil Sima'an of the University of Amsterdam.
Computer-based translation services, a well-known example of which is Google Translate, are getting better and better. This is because their algorithms are becoming increasingly sophisticated. One of the challenges of the industry that deals with machine translation is: which data sources do you use? For a good translation it is necessary to train the machine with reliable sources and datasets that contain the relevant type of words. For example, translating a policy document or legal text requires a completely different vocabulary and a different type of translation than, for example, a newspaper report.
STW
In 2013, a project called DatAptor, led by Professor Khalil Sima'an of the UvA Institute for Logic, Language and Computation, received a major grant from technology foundation STW to deal with this problem. And with success. The research results of the DatAptor project have now been implemented by TAUS, an important think tank in the field of machine translation. They offer the new technology under the name Matching Data.
On the weblog of TAUS Sima'an says: “Our dream was to make the world wide web itself the source of all data selections. But we decided to start more modest and make the very large TAUS Data repository our hunting field first. In DatAptor we learned that every domain is a mixture of many subdomains. The combinatorics of subdomains in a very large repository harbors a wealth of new, untapped selections. Therefore, if the user provides a Query corpus representing their domain of interest, the Matching Data method is likely to find a suitable selection in the repository. ”
This release was first published 16 January 2019 by the University of Amsterdam.