Germanic Lexicon Project
Message Board

Home

Texts

Search

Messages

Volunteer

About


[ Main Message Index ]     [ Previous | Next ] [ Reply ]

Author: Sean Crist (Humedica)
Email: kurisuto at panix dot com
Date: 2016-01-06 16:26:28
Subject: Re: machine readable format

> I'm trying to extract translations from the Icelandic dictionary. I've downloaded the 11M file containing everything, but the format seems really complicated. Is there a format I could use that's machine readable (e.g. xml) or does anyone have a script to parse the dictionary?
>
> Thanks if you can help.


That doesn't currently exist.

You are correct that extracting data from a human-readable dictionary is not an easy problem. There is actually a whole academic literature on that problem, mostly from the 1980s and early 1990s. If you've got a human-readable dictionary in electronic form, how do you parse it to extract the information?

What motivated that line of research is the need for machine-readable lexicons (like a dictionary, but in such a regular form that a computer program can conveniently read it and use it). Many forms of natural language processing require a lexicon, and it's expensive and labor-intensive to create these lexicons by hand (I should know; when I was in grad school, my RA-ship for two years was to create machine-readable lexicons of German and Japanese).

Human-readable dictionaries look like structured data, so can't we just read what we need out of one of these dictionaries? That sounds good until you actually try to do it. In reality, dictionaries are only semi-structured data.

For most dictionaries, it's not the case that the author rigidly follows some well-specified entry format. A lot of entries seem to be following some sort of regularity, but there are always cases where the dictionary writer seems to violate his own rules. The literature on processing dictionaries often refers to these entries as "malformed" entries.

My take on it is that there is not actually any clear dividing line between well-formed and malformed entries, and that this property of dictionaries follows from two considerations: 1) the dictionaries are written for human consumers, and 2) the subject matter itself is not inherently regular; while some aspects on human languages lend themselves to orderly cataloging, there are always quirky facts about some lexical items which don't fit into any orderly scheme.

Most of the field gave up on this approach because it seemed unworkable. I do think there are cases where it still makes sense to do it. The reason that the Germanic Lexicon Project exists at all is because I thought (and still think) that historical language such the older Germanic languages are a case where it makes sense to do it. My whole reason for creating the project is that I wanted a large body of etymological data in machine-readable form so that I could experiment with computational techniques in historical linguistics.

One approach to processing dictionaries is to write a script with hand-made rules. You can do this; but if you try it, then know when to draw the line. You can't cover every detail and every exception, and you'll drive yourself crazy trying to do it (plus your code will look like crap, since it basically will amount to a funny encoding of every irregularity in the dictionary). Go for maybe 90% accuracy, and then fix things by hand as needed.

I would no longer use hand-made rules myself. I've done research in this area using a machine-learning technique called Conditional Random Fields, and I got pretty good results (95% ~ 98% accuracy, depending on the complexity of the dictionary). My results aren't published, but this is the technique I'd use at this point. It does require some specialized knowledge of natural language processing and machine learning.

I wish I could just give you what you're asking for! (i.e. a version of Cleasby/Vigfusson which is already marked up into fields in XML) I left academia in 2006 to go to industry. If I had been able to stay in academia and continue working in this area, I would long since have produced that by now. There is no money in computational historical linguistics, unfortunately.

--Sean

Messages in this threadNameCollege/UniversityDate
machine readable format Gary Krug 2016-01-05 01:42:19
Re: machine readable format Sean Crist Humedica 2016-01-06 16:26:28