Sunday, January 22, 2017

More Translations, less Noise

There has been a large number of changes to the generation of WikDict dictionary changes. While many of them are related, some are just included in this post to give you a good summary of what happened in the last months.

More Translations

Deriving Translations from Intermediate Languages

When a translation is not found in the dictionary, you could give up and tell the user that there is no such translation. Or you could try to use translations between other languages to give a (hopefully accurate) answer. Here's an example. Let's say the word "dog" can't be found in the English-German dictionary, you could try to use French as an intermediate language:

dog (en) -> chien (fr)
chien (fr) -> Hund (de)
=> dog (en) -> Hund (de)

While this is useful, it can generate wrong translations due to ambiguities. WikDict tries to get the best of both worlds by applying a scoring dependant on multiple different factors to rank and filter the results of this approach.

Bug Fixes and Workarounds

Some bugs, especially a bug in the Virtuoso database made it necessary to skip some translations. The known bugs are now fixed or a workaround is applied. 

More Recent Data

As always, the people working on Wiktionary and DBnary aren't lazy, either. Their changes trickle down to WikDict with some delay and lead to visible improvements over time.

Less Noise

More filtering and better sorting

The scoring mentioned above has also been used to improve the sorting of words, senses and translations, as well as to filter some less reliable results introduced when reading dictionaries in reverse.

Less Unparsed Markup in Senses

When senses/definitions for words are extracted from Wiktionary, quite a large number of different Markups might be left inside those strings. WikDict got better at parsing those texts, so you will see less [[brackets]], <tags> and [1] left 1. over | numbers : or symbols than before. If you still see those, please let me know.