20061208

Journey on Rails #3

The volumes in ChaptersS have come in one big file: all books, chapters, and verses in one file.

Remember the format I was shooting for?

Chapter ##

1 This is verse one.

2 This is verse two.

The format the books came in was:

Book Name

1:1 This is chapter 1, verse 1.

1:2 This is chapter 1, verse 2.

2:1 This is chapter 2, verse 1.

2:2 This is chapter 2, verse 2.

and so on...

OK so here is my text processing journey:

Processing Attempt #1:
  • Open volume file in text editor.
  • Manually cut-paste-save individual books.
  • Open individual book files:
    • Manually cut-paste-save individual chapters. Filenames were "book##.txt" where "##" was a two digit representation of the chapter number padded where necessary.
    • Manually create "Chapter ##" header. This was accomplished using a keyboard macro to try to speed things up.
    • Use very simple regex search-replace to replace chapter:verse markers with just a verse.
    • Use very simple regex search-replace to create paragraph tags around each verse.
  • Use text editor to replace a single quotation mark/apostrophe in all the chapter files with a \' so that my SQL insert script would not break.
  • Total Time: About 7 hours.
Very tedious. You would think I knew better. In college I had a job where I cleaned up and formatted scanned texts for the folio indexing software. How embarrassing.

Processing Attempt #2:
  • Open volume file in text editor.
  • Manually cut-paste-save individual books.
  • Open individual book files:
    • Use regex search-replace to create chapter headers for every chapter in the book.
    • Use regex to replace the chapter:verse markers with just a verse.
    • Use regex search-replace to create paragraph tags around the verses.
    • Search and replace all single quotation marks/apostrophes with \'.
  • Total Time: About 4 hours.
Not bad but still very time consuming.

Processing Attempt #3:
  • The books were already broken out for this volume. I must have done it earlier and forgotten.
  • This time I decided to write a Ruby script to do all the work for me. Not only would this be fun and educational but it would speed up my processing of future volumes as it would be re-usable.
  • In summary my script does the following:
    • Opens the individual book file.
    • Figures out when a new chapter begins.
    • Opens a new chapter file with the name "book##.txt" like before.
    • Adds the chapter header line to the file ("Chapter ##").
    • Replaces chapter:verse markers with just verse in every verse.
    • Adds paragraph tags around every verse.
    • Replaces single quotation marks/apostrophes with \'.
    • Writes verses to file.
  • Total Time: 2 hours 30 seconds (2 hours to create script, 30 seconds to run it)
Processing Attempt #4:

This hasn't happened yet. When it does then I will attempt to enhance my script to process an entire volume instead of individual book files. The only problem I foresee is that the book titles within the volume can be very long, like "The Gospel According to Matthew". There might be some manual pre-processing of the volume file before I can script it totally.

Getting the Chapters in the Database

I created a shell script that would process each chapter file and insert it with an appropriate title, volume_id and volumeOrder.

Whewww. Hopefully, the length of this post reflects the pain of processing the volume text files into usable content.

Next up: ChaptersS on Rails