Germanic Lexicon Project







Sean waving

Note from Sean Crist, July 2020

My main period of activity with the Germanic Lexicon Project was around 1999-2006. I was a grad student (later a professor) in linguistics with historical Germanic linguistics as an area of focus. My personal motivation in starting the project was that I wanted to use computational techniques in the area of historical linguistics.

This meant that I needed data. One way to get the data was to digitize copyright-expired dictionaries. When you perform OCR on dictionaries of premodern languages, you get a lot of OCR errors.

Wikipedia was brand new at the time, and I thought about how that general kind of crowdsourcing approach could also be a way to correct the OCR errors: anybody can pitch in to help correct the OCR errors, and everybody can use and share the resulting free data. So, I built a web-based system to allow volunteers to reserve one page of a dictionary, correct the errors, and submit the corrected text.

The crowdsourcing part of the project was a big success. A lot of people helped out. By around 2008, two of the major dictionaries (Bosworth/Toller and Cleasby/Vigfusson) had each had one round of corrections by volunteers. (A history of the early part of the project up to 2004 is here.)

Unfortunately, the bigger picture did not work out. Over a period of several years, I submitted multiple grant applications to support further work and research with the data, but these attempts were unsuccessful. The Germanic Lexicon Project was one piece of the larger academic career plans I had at the time, but those plans were going nowhere fast. In 2006, I made the decision to change career paths, and left academia to go into industry to work in Natural Language Processing.

Meanwhile, a major piece of the project had been picked up by Ondrej Tichy of Univerzita Karlova (Charles University) in Prague, Czechia. Ondrej and his team took on the further corrections and improvement of Bosworth-Toller's dictionary and built a superior interface for it, which can be seen here. I am grateful to Ondrej for providing the hosting space for the rest of the Germanic Lexicon Project between 2006-2020.

After 2008

I could have left the project behind altogether, but I was still interested. Now the project became a personal hobby rather than a professional career focus.

I had a real technical problem on my hands. When I built the original project, I didn't know anything about writing web apps. Back around 2003, I had made a decision which made sense at the time: a dictionary page can be corrected by only one volunteer, and then it's done. The amount of text to be corrected was vast, and finishing it seemed far in the future. This tradeoff allowed me to get an initial system out the door fairly quickly in the limited time I had to work on it.

By 2008, volunteers had done corrections on all of the text of the two main dictionaries. It was clear that they hadn't caught all the errors. However, adapting the system to allow further corrections was a very non-trivial problem.

I came up with a design for a new system which would allow an unlimited number of wiki-style edits. I worked on it in my spare time over a period of years.

Unfortunately, it was just too big of a project. After poking at it for several years and then taking a week of vacation time to try to finish it, I reluctantly came to the conclusion that it was too big of a project for me to finish while holding down a full-time corporate job. I had to set it aside, although I still hope that some change in my future fortunes will allow me to get back to it.

July 2020

In July 2020, I finally had a chance to move the project to my own server, and do some housecleaning to reflect the fact that there's isn't currently a way to volunteer. I fixed several things that were broken.

About the data

If you download the data, you will probably notice that the character encoding scheme is very odd.

The reason for this is that Unicode support was weak back in 2003. I could have written the website software to use UTF-8, and that would have been my first choice. The problem, though, was that volunteers needed to be able to edit the text. In those days, a lot of text editors and word processors didn't support UTF-8.

So, I worked out an awkward way of representing special characters with entities such as þ or &o-long-acute; or &u-long-short;.

In 2020, of course, UTF-8 has long since become the lingua franca. I'd like to convert all the data to UTF-8. Converting the files themselves would be simple enough.

Unfortunately, all of the site software (the search and message boards) are coded to use the old makeshift encoding. Those systems would break if I simply converted the data.

The website code I wrote in 2003 is totally out of date, and it wouldn't make sense to put the work into adapting it to handle UTF-8. I know a heck of a lot more now about how to not build web applications. New code would be the only way forward. If I get a chance, I'll work on it at some point.


Thanks to the many people who have helped out with the project in numerous ways. Also thanks to the many people who have expressed thanks for the free materials.

A lot of people have contacted me since 2008 offering to help. I wish I had a better answer for you. Maybe in the future.

As of this writing, I'm 51, and I'm now at a point in my life when retirement is on the distant horizon. The project isn't aligned with my current career direction, but it is something I could pick up as a full-time retirement project. We'll see.