Centre for Research in Applied Linguistics

Teenage Health Freak Corpus: The data Transformation Process

Although the data for this project was 'born digital' it posed a number of challenges which needed to be overcome in order to get the most out of the data. The data was supplied by the website technical team in MS Access. The first step was a simple transformation into an MS Excel spread sheet using functionality within Access itself and the data was also manually split into individual years. The remaining clean up of the data was acheived primarily using Python scripts. The steps taken are outlined below. 


Step 1: Create XML

A Python script was used to turn the MS Excel spread sheet into XML.

Step 2: Basic Clean up

A Python script was used to fix character encoding problems, remove empty messages and remove exact duplicate messages sent on the same date. At the same time any messages which had been generated as the result of a user taking a quiz on the website were tagged as having been generated from a quiz and also for the particular quiz they were associated with. The strap line which allowed the message to be tagged as having come from a quiz for example “Question asked from the smoking quiz” was also removed and if this lead to a message containing no text at all the message was deleted.

Step 3: Identification and Deletion of Possible Duplicates

In the data there were many occurrences of very similar messages being sent one after the other. In some cases spelling was corrected in the later messages, in others a few details were added or deleted or changed etc. It was decided that if it was highly likely that two messages very similar to each other were sent by the same person then one of the messages should be removed from the corpus. In order to facilitate this a script was used to identify possible duplicate messages. The script used the Lenvenstein Distance Algorithm, which calculates the similarity between two strings, to find messages that were very similar. The calculation was performed both on the length of the longest message and also the length of the shortest message which allows the algorithm to find messages which have later been added to. Only messages from the same date were considered as possible duplicates. The possible duplicate pairs were logged in a text file as the process is too slow for a user to work in real time making the final decisions on deletion.

Once the duplicates were identified and logged to the text file another script reads the log file and presents each pair of potential duplicates to the user along with the timestamp of each message. This allows the user to consider each pair of messages and decide which, if any, should be deleted. The general principal used was that the messages should be sent within 10 minutes of each other and should be the same question expressed in largely the same way. The message that was deleted was either the one with more spelling errors or the shorter message.

Step 4: Spelling Correction

The data presents many challenges with regards the spelling used. Spelling was corrected as far as is possible by using the keyword procedure in WordSmith Tools to identify consistently misspelled words. The reference corpus selected for this particular task was the written BNC. Once the keywords had been generated the misspelled words were manually corrected if the vast majority were instances of the same word (this was established using concordance lines from the corpus). The resulting corrections were added to the corpus using the TEI <choice>, <corr> and <sic> tags so that the original text was not lost.

An analysis of the volume and type of spelling errors in the corpus is available.


Back to top

Centre for Research in Applied Linguistics

The University of Nottingham

telephone: +44 (0) 115 951 5900
fax: +44 (0) 115 951 5924