Saturday, November 11, 2017

Now with 22% more translations!

WikDict builds on data extracted by the dbnary project. This project changed its way of storing data, which required adaptations on the part of WikDict. This is the reason why WikDict data has not been updated during the last months.
Now this work is finally done and new data is available in the web interface. This includes all changes done to the underlying Wiktionaries as well as additional bug fixes which prevented some translations from showing up properly. Overall this yields 22% more translations than the previous data from March 2017. As always, please let me know about any problems you encounter or suggestions for improvement.

Sunday, February 19, 2017

Get translations while typing

Having a typeahead autocompletion is very helpful when the work you are looking for is long or hard to type. But it can get even better by providing the translation along with the autocompletion. This is now available on WikDict.

As always, feedback is very welcome!

Sunday, January 22, 2017

More Translations, less Noise

There has been a large number of changes to the generation of WikDict dictionary changes. While many of them are related, some are just included in this post to give you a good summary of what happened in the last months.

More Translations

Deriving Translations from Intermediate Languages

When a translation is not found in the dictionary, you could give up and tell the user that there is no such translation. Or you could try to use translations between other languages to give a (hopefully accurate) answer. Here's an example. Let's say the word "dog" can't be found in the English-German dictionary, you could try to use French as an intermediate language:

dog (en) -> chien (fr)
chien (fr) -> Hund (de)
=> dog (en) -> Hund (de)

While this is useful, it can generate wrong translations due to ambiguities. WikDict tries to get the best of both worlds by applying a scoring dependant on multiple different factors to rank and filter the results of this approach.

Bug Fixes and Workarounds

Some bugs, especially a bug in the Virtuoso database made it necessary to skip some translations. The known bugs are now fixed or a workaround is applied. 

More Recent Data

As always, the people working on Wiktionary and DBnary aren't lazy, either. Their changes trickle down to WikDict with some delay and lead to visible improvements over time.

Less Noise

More filtering and better sorting

The scoring mentioned above has also been used to improve the sorting of words, senses and translations, as well as to filter some less reliable results introduced when reading dictionaries in reverse.

Less Unparsed Markup in Senses

When senses/definitions for words are extracted from Wiktionary, quite a large number of different Markups might be left inside those strings. WikDict got better at parsing those texts, so you will see less [[brackets]], <tags> and [1] left 1. over | numbers : or symbols than before. If you still see those, please let me know.

Sunday, July 24, 2016

Links now link to searches in WikDict

Previously, clicking on a term in the dictionary results lead to the corresponding Wiktionary page. Feedback from users has shown that this is not a typical user's expectation. Now all linked terms lead to a search in WikDict using the clicked term as search text.

The Wiktionary links can now be found in the side bar at the right instead. As always, feedback on this change is very welcome!

Sunday, April 24, 2016

Stemming support for English

All English entries are now searched using the Porter stemming algorithm, which means that more translations will be found if you use something different than the base form of word. The most common case is searching for a plural (e.g. "stoats") are getting a translation for the singular ("stoat"), even though the plural form does not appear anywhere in the data set.


Sunday, April 17, 2016

WikDict source code in Bitbucket

If you're interested in how WikDict works or you want to contribute any fixes or improvements, head straight to the WikDict Bitbucket page and have a look at the different parts of this project. If you need any help with one of the repositories, feel free to contact me for additional information.

Sunday, January 17, 2016

Filtering of HTML entities and tags

In some cases, HTML tags (<center>, <ref>, etc.) or entities (usually &nbsp;) remain in the data from dbnary, which is used as input material for WikDict. I'm now using a basic HTML parser to improve the handling of these cases.

Entities

From now on, all entities should be properly converted resulting in
Gerät für Turnübungen, auf einem Gestell befestigter 10 cm breiter und 5 m langer Holzbalken
instead of
Gerät für Turnübungen, auf einem Gestell befestigter 10&nbsp;cm breiter und 5&nbsp;m langer Holzbalken

 Stripped Tags

Most HTML tags will be ignored, leaving the text inside the tag untouched. However, some tags will be stripped including the content, since the tag content is not relevant for the translation. One such tag is <ref>, resulting in
  • „fiktives Land, in dem absurde Verhältnisse herrschen“ als sinnbildhafte Bezeichnung für „unverständliche (absurde) politische Situationen“, „bestimmte Verhältnisse[, die] nicht nachvollziehbar sind“, für „etwas völlig Absurdes“
instead of
  • „fiktives Land, in dem absurde Verhältnisse herrschen“<ref></ref> als sinnbildhafte Bezeichnung für „unverständliche (absurde) politische Situationen“<ref name="WP"></ref>, „bestimmte Verhältnisse[, die] nicht nachvollziehbar sind“<ref name="WP"/>, für „etwas völlig Absurdes“<ref>, Stichwort »absurd«, Seite 45.</ref>

Sub- and Superscripts

Special handling is done to sub- and superscript tags, where the content is converted to the corresponding Unicode characters in the most common cases. This makes beautiful chemical texts like
  • Organische Chemie: eine farblose, viskose Säure mit der Summenformel C₄H₆O₃
instead of
  • Organische Chemie: eine farblose, viskose Säure mit der Summenformel C<sub>4</sub>H<sub>6</sub>O<sub>3</sub>

Feedback

Did you find cases where the results of these changes are bad or there are obvious improvements possible? Let me know and I'll try to fix it.