Germanic Lexicon Project
Message Board

Home

Texts

Search

Messages

Volunteer

About


[ Main Message Index ]     [ Previous | Next ] [ Reply ]

Author: Sean Crist (Swarthmore College)
Email: kurisuto at unagi dot cis dot upenn dot edu
Date: 2004-10-27 14:39:49
Subject: Re: dictionary database available for download?

> Hmmm... so basically, your database is just raw text and not reorganized as a
> database file?

Well, yes and no. The database files are automatically rebuilt from the flat raw text files.

Here's how the process works. First, the raw text files for Bosworth/Toller and Cleasby/Vigfusson are automatically re-assembled every Saturday evening, using the most recent version of each page. The resulting text files are the ones you can download. These exact same same files are also used as input to rebuild the search database.

The back end to the search system is a MySQL database containing two tables. Table one is a word index. For example, the word eorendel occurs in three entries, so the following three rows occur in table one (one column has the word form, and the other has an entry ID number):

eorendel bt_b0254:6
eorendel bt_d0171:2
eorendel bt_d0386:29

An entry in table two contains an entry ID number in one column, and the full text of the corresponding entry in the other column. So you can see how the query is structured: first we ask the database "Give me the IDs for all the rows in table one where the word form is eorendel." Then we ask the database, "Give me all the rows in table two where the ID is in this list of IDs."

So to rebuild the database, here's what we do:

-Run one Perl script to assign a unique ID number to each entry. (This is essentially the same as table two.)

-Run a second Perl script to shred the entries into the right format for table one, removing duplicates within each entry.

-When both input files are ready, disable the web interface to the query system. Erase the old tables from last week. Load in the new tables, which takes about 90 minutes because MySQL indexes the tables for fast lookup. Then re-enable the web interface.

So I could make these database files available for download, but since it is a purely mechanical task to convert the raw text files into tables one and two, I hadn't considered it to be worthwhile to post this trivially derived form. I could be convinced to change my mind if there's real interest, but this would be one more thing for me to document and maintain.

In general, I haven't provided multiple formats of the same data because of the extra effort. I'd rather put my energies into getting the data itself online. Then, others can use the data to create whatever they want. But, I'll make exceptions if there is an obvious demand.


> As for the searching software, thanks for the background info. I suppose it would
> be way over my head as beginning programmer to even try to do anything with
> the code, especially since I'm learning Python and not PERL. But I'm
> sure I'm not the only one intrested in the progress of your program. Maybe
> once you're able to put more time into it, it might be a good idea to create
> a web page about the program's progression. ..?

I'm not quite sure what you're asking here. Which program do you mean?

--Sean


Messages in this threadNameCollege/UniversityDate
dictionary database available for download? chicane 2004-10-26 22:25:09
Re: dictionary database available for download? Sean Crist Swarthmore College 2004-10-27 08:18:41
Re: dictionary database available for download? chicane 2004-10-27 12:55:06
Re: dictionary database available for download? Sean Crist Swarthmore College 2004-10-27 14:39:49
Re: dictionary database available for download? chicane 2004-10-27 15:11:38