Steps to Digitizing a Document …
What does it take to get a document digitized and published online here at the Rudolf Steiner Archive? Here are the steps:
- Locate and acquire the document, either purchased or from a library (see image at right … click image for larger view).
- Scan each page, saving as a computer file (preferably, but not necessarily, a TIF [Tagged Image Format] file). At this point it is a graphic image, like a photo of each page.
- Run files against OCR (Optical Character Recognition) software, which converts any alphabetic characters it finds in the image to actual “text” characters, resulting in the creation of text files. The accuracy of this process varies from 95% recognition for very clear documents, to no recognition at all for some old manuscripts, which have to be keyed in (typed) by hand.
- Proofread and correct each text file, comparing against the original document (preferred), or the scanned images, and save as a revised file(s). This includes:
- edit for typographic errors, whether caused by OCR inaccuracy or in the original document (it happens!),
- verify special characters, especially left and right quotation marks, and diacritical marks such as umlauts.
- Proofread to locate all footnotes and graphics (e.g., diagrams, drawings) in order to place them correctly in the online version.
- Proofread to locate all references to items online in order to set up links for cross-references.
- Convert to HTML. For a single lecture, this is a single file. If this is a book or collection, there are multiple files, including cover image, contents, prefaces, appendices, synopses, notes, footnotes, cross-references – much of this is automated, but the human eye is still needed, and a lot of this must be done manually.
- All browsers are not equal! There is quite a bit of work that needs to be done to make the document render, at least close to the same way, in all browsers! What looks fine in one browser may look terrible in another. And when you fix it in the other browser, it breaks the first one. We recommend Firefox!
- Put into the database(s), cross-referencing with other documents, create index, keywords, and other information needed for our database and the search/research tools we have created.
- Publish on the website (Whew! see image at left … click image to read the lecture).
- From start to finish, a 10-page lecture could take anywhere from one to eight hours, from initial scanning to finally appearing online. For a collection or book, it can take 10-50% more time to handle all the indexing, cross referencing, and formatting. Also, graphics and diagrams can take a lot of work to clean up after they have been scanned. Some of our materials are original typewritten manuscripts on very fragile, yellowed papers, and are nearly impossible for OCR processing. Currently, there are 615 on-site volumes and 2315 individual documents here at the Archive!
Most of the digitizing project is done in-house, but we have wonderful volunteers all over the world who acquire and scan documents, run against their own OCR software (if they have it), and create files that they send to us. The final proofreading, cross-references, creation of HTML files, setting up for our databases and tools, and online publication are all done in-house. And, of course, we provide the heavy-duty servers and broadband to make it all available to the world.
Our Search and Research Tools and Database Management
Jim Stewart has designed and created the online tools — the database, searching capability, keyword indexing and cross referencing, etc. — that enable users to access and research the on-line documents with ease. This has been an ongoing project for almost 30 years.
How to Help
We have a tremendous backlog of materials we want to get online, and there are so many irreplaceable resources at risk worldwide! If you can afford to donate even a little to help support this initiative, you will be helping save irreplaceable works and to make the information available to so many others! Please check our Donation and Appeal pages to see how you can help!