Transcribing Texts

A fundamental value of our project is correctly attributing Jewish liturgy and liturgy related work. Even when the original author of a work is lost to history, we strive to record every adaptation and variation sourced within particular manuscripts and extant published works in the Public Domain.

To do this, our volunteers help produce transcriptions that are easily determined to be authentic witnesses of a given work, whether it is the earliest known version known, or some other variation. We use Wikisource, as the collaborative transcription and proofreading environment for transcribing Public Domain and free-culture licensed texts. Volunteers have begun to transcribe and proofread these texts.

If you’d like to begin transcribing a work, let us know in the comments below! We can help you.

How to Get Started Transcribing!

Install a Hebrew Keyboard layout for your Operating System. (We recommend the Biblical Tiro layout.) Refer to the key mapping images and familiarize yourself with the four levels of the Biblical Tiro keyboard layout.
Download and install Unicode Hebrew Fonts supporting the full range of Hebrew diacritics. We recommend installing the Taamey Frank CLM font from the Culmus Project (available in the Open Siddur Font Pack).
Configure your web browser to display Unicode Hebrew Fonts supporting the full range of Hebrew Diacritics. See below for specific details for changing the default Hebrew fonts displayed in Mozilla Firefox.
Register a new user account with Wikisource.
Login and set your preferred settings for the language and editing interface in Wikisource (see below).

Preparing Mozilla Firefox for Transcription in Wikisource

Download the fonts in the Open Siddur Unicode Hebrew Font Pack and install them
Open Firefox Options.
Select the Content tab.
Click the Advanced button. Under Fonts, select Hebrew.
Choose your favorite Hebrew fonts and font size for transcription.
Click OK when finished.

Familiarizing Yourself with Wikisource

Login and Settings

If you haven’t yet registered an account on Wikisource, please create your account now.
To login to your account at Wikisource, click on the login link at the top right of the web page.
Click “My Preferences” in the top right corner. In Hebrew Wikisource, click “כניסה לחשבון” in the top left corner
Click “User Profile” to choose your preferred language for working within Wikisource’s interface. To navigate Hebrew Wikisource in English or another language, click “פרטי המשתמש” and the context menu next to “שפת הממשק” to select your preferred language.
Click “Editing” to set your preferred settings for using the editing interface. See below for my preferred settings to edit.

Using the Transcription Interface

The great things about collaborative transcription and proofreading is that you can correct other’s work and others can correct your errors. The key thing is to know how to navigate the Wikisource interface.

To edit a page of text, click on the ‘Edit’ link (next to the ‘Read’ link) above the page image.
When you are done editing or proofreading a page, don’t forget to indicate in what state you’ve left it. Reading the help page “Help:Editing” will help you better understand how Wikisource users track their transcription and proofreading.

What about OCR for Hebrew

Tesseract-OCR is an excellent open-source OCR for Hebrew. Combined with a user-interface such as VietOCR.net it is fairly easy to use.

With technology in its current state, manual transcription (typing) is the only reliable way to transcribe Hebrew text with vowels. Open source tools for the automated transcription of Hebrew are not capable of reliable conversion of images with Hebrew letters and diacritical marks into machine readable Hebrew text without requiring more work proofreading the text than would have been done transcribing it from scratch.[ref]HOCR is available for testing on Linux. Unfortunately, an effort to continue Kobi Zamir’s work on hOCR stalled in 2010. An early version of hOCR compiled for use on Windows is available for download here.[/ref] Until such tools improve, projects such as the Open Siddur must depend on the manual transcription of text by humans.

Getting Hebrew OCR with diacritical support available and accessible needs attention! While Tesseract can not OCR Hebrew with niqqud “out of the box,” researchers have had success in training Tesseract to do so. Take a look at Adi Oz and Vered Shani’s work at their project page and in their PowerPoint presentation. If Adi and Vered’s work could be made available to the wider community, we’d be grateful.

Another OCR project to keep an eye on is that of Assaf Urieli. What is interesting about Assaf’s approach is that it will check the OCR against a list of words so that the software can measure the confidence of its recognition.