As the Teenage Health Freak Corpus consists of unedited typed messages it is inevitable that spelling errors and typos will be present in the corpus. Another feature, which for some analytical purposes pose the same problems as spelling errors, is the deliberate use of abbreviations and acronyms as used in text messages, instant messages and internet forums.



Volume of Spelling Errors

Step one was to investigate the type of spelling errors found in the corpus and also how frequent they are and therefore how big a problem they may pose for the corpus analysis. In order to do this 50 messages were selected at random from each year in the corpus. This was done using a random number generator in a Python script. The samples were then analysed by hand to identify spelling errors. The results of this stage of the processing can be seen below.

Year No. of Words No. of Errors Percentage Errors
2004 1209 76 6.3
2005 1403 116 8.3
2006 758 89 11.7
2007 898 55 6.1
2008 1000 70 7.0
2009 871 60 6.9
All Years 6139 466 7.6

Assuming these samples are representative of our corpus we could expect around 168,592 words to be incorrectly, or at least unconventionally, spelled. With this number so large further investigation of the spelling errors in the corpus was warranted.

Type of Spelling Errors

Understanding the type of spelling errors regularly encountered in the corpus is important because spelling correction algorithms often work better with some types of errors than others. Therefore finding out about the type of errors we have to deal with could help with the selection and evaluation of algorithms. In order to investigate this errors were first classified into five main classes explained and illustrated in the table below.

Error Class Description Examples
Chat-style abbreviations or acronyms which might be expected in text messages or Instant Messaging u > you; 4 > for; cuz > because; sum > some
Phonetic words which can reasonably be pronounced in the same way as the original word and are less likely to be typographical probarbly > probably; egsisting > existing; marige > marriage
Typographical errors which are more likely to be caused by mistyping iam > i am; resulst > results; alchohl > alcohol
Emphasis deliberate errors made for emphasis (typically additions) soooo > so; yoooo > yo
Unclassified  errors that don't seem to fit in any of the categories pencise > penis

The results of the analysis can be seen in the table below. If words include more than one class of error they are counted in each relevant class.

Error Class Total Occurrences
Typographical 257
Chat-Style 125
Phonetic 83
Emphasis 3
Unclassified 1

Typographical Errors

The largest class of errors in the corpus are typographical errors. Included in this figure are 123 errors which only involve a missing apostrophe. These were included in typographical errors because instances of “Im” and “I'm” will be treated as different tokens by corpus processing software. For many corpus tasks, however, this will probably not be of much concern. Even if these examples are removed typographical errors are still the largest class with 134 examples in our selection.

If we look further into the typographical errors 36 involve errors of space placement.

Space Placement Error Examples Count
Deletion eachother > each other; iam > i am 25
Insertion when ever > whenever; every thing > everything 9
Transposition o fmy > of my; wantt o > want to 2

A further 7 are examples of word substitution where the substituted word is not a homophone of the intended word (homophones are included in phonetic errors). Examples of these include “you” > “your” and “my” > “me”.

The remaining typographical errors fall into the categories in the table below.

Error Type Examples Count
Letter Deletion becaue > because; syptoms > symptoms 35
Letter Transposition lieks > likes; develpoed > developed 18
Letter Insertion piulls > pills; pregnaunt > pregnant 17
Letter Substitution mush > much; ma > my 14
Complex Combination alchohl > alcohol; pregnate > pregnant 5

Chat-Style Errors

A further analysis of the chat-style errors showed that the overwhelming majority are abbreviations rather than acronyms.

There were only two examples of acronyms both occurring at the end of the same message; wb for write back and the more commonly used asap. (Here we are talking specifically about acronyms for chat-related functions rather than things such as BMI for Body Mass Index.)

The abbreviations used tend to fit general patterns or conventions and there is generally a 1 to 1 relationship between abbreviations and target words. In the selection analysed we have examples of:

vowel changes

  • vowels being missed out of words (jst > just; bt > but; thr > there; rly > really)
  • vowels and final e being changed for a single vowel (sum > some; lyk > like)
  • dipthongs changing to a single vowel (duznt > doesn't; frendz > friends; shud > should)

consonant changes

  • s changing to z even where this does not result in an abbreviation (frendz > friends; itz > it's)
  • th going to z, d for f (za > the; fink > think; deir > their)
  • f changing to v (ov > of)
  • silent h missing (wen > when; wat > what)
  • final g dropped (aveing > having)
  • opening consonant dropped (aveing > having)

syllable changes

  • er changes to a (ova > over; uva > other)
  • ough shortened (tho > though)

full word changes

  • numbers being used in abbreviations (4 > for; m8 > mates)
  • letters standing for words (n > and; r > are; u > you; y > why; bf > boyfriend)
  • word shortening (brill > brilliant)

two words joined

  • of appended with a (loadsa > loads of; kinda > kind of)
  • to appended with a (wanna > want to)
  • other contractions (dunno > don't know; waza > what's up)


  • please changing to plz
  • because abbreviated to cus, cuz, cos, coz

This is a fairly small set of messages and there is likely to be more variety in the corpus as a whole. In this selection the most varied abbreviations are found with the word “because” where we have “cos”, “coz”, “cus” and “cuz” even these however are combinations of single features described above. An interesting observation on spelling in general but which is particularly true of the use of chat-style abbreviations is that there is huge difference between messages with some users avoiding chat-style language and others making full use of it. This may reflect the familiarity of the user with instant messaging, forum writing and perhaps text messaging but also reflects the choice of register considered appropriate for addressing medical questions to Dr Ann which some selecting very formal registers and other much more informal.

Phonetic Errors

Phonetic errors have been separated from typographical errors because they each have a different relationship with the target word. In the case of typographical errors the relationship between the typed word and the target word is based on the position of letters on the keyboard or sequences of frequently types letters. With phonetic errors there is a more direct relationship between the typed word and the target word which phonetic based algorithms should be able to handle effectively.

More detailed analysis of the phonetic errors follows the same pattern as that used for typographical errors and the results can be seen in the table below.

Error Type Examples Count
Letter Insertion dissorder > disorder; scruews > screws; drinkes > drinks 36
Letter Substitution shrivals > shrivels; raisen > raisin; descusting > disgusting 26
Letter Deletion gaynes > gayness; realy > really; obsesive > obsessive 19
Homophone Substitution too > to; no > know; band > banned 17
Complex Combination flemmy > phlegmy; masterbaiting > masturbating; sigerate > cigarette 7
Multiple Letter Substitution dieing > dying; egsisting > existing 3


The examples of emphasis in the messages used for analysis only involve the word “so” being emphasised with the addition of several “o”s and the word “yo” being emphasised in the same way. In the larger corpus however examples have been seen with involve “please” being extended on the “e”. This is by no means the only, or even the most common, way that emphasis is expressed in the corpus. Capital letters are very frequently used as are repeated exclamation marks and question marks, repeated words in particular the word “please” are also found.


