Health Communication and the Internet: An Analysis of Adolescent Language Use on the Teenage Health Freak Website

funded by the Economic and Social Research Council (ESRC) 


One of the outputs from this project is an encyclopedia of 100 keywords in 5 topics.


Project Overview

The project (ESRC RES-000-22-3448) explores the integration of corpus-linguistic and sociolinguistic approaches for the analysis of a unique, 2-million word longitudinal corpus of messages posted to the ‘Teenage Health Freak’ website. The descriptive advantages afforded by the tools of corpus linguistics will be utilised to inform sociolinguistic observations of adolescent language innovation and change on the specific topic of health care. Keywords and key phrases used by adolescent advice-seekers, with associated meanings and patterns of use over a period of 6 years, will be extracted from the corpus and then analysed to highlight emergent trends in adolescent sociolinguistic style and register. As well as the academic value of this combined methodological innovation, the findings of the analysis will be made available to health care providers and users of health care services in the form of a practical, encyclopaedic resource, thus contributing to the continuous professional development of user groups in the NHS, as well as being a resource for parents, teachers and adolescents themselves.

Over the past decade there has been a considerable rise in the use of the internet for the provision of health advice and information. The anonymous nature of internet communication is particularly appealing to young people who may be reluctant to discuss sensitive matters with a health professional in face-to-face interaction. The success of interactive, reputable health care websites aimed at adolescents provides a further indication of the growing demand for on-line advice on health-related issues. With a potential shift in the preferred discourse domain in which adolescents initially voice their health concerns, it is paramount that we develop methods and frameworks for capturing and analysing key concerns on the basis of data that is often customarily collected by providers of interactive health care web-sites.

A combined, computer-assisted corpus linguistic and sociolinguistic approach to the analysis of these data allows us to establish common patterns of usage of individual words and phrases, associated meanings and distribution across users and contexts. The sociolinguistic analysis of specialised corpora of internet health communication is an area which is currently under-explored, and there is a need to develop a better understanding of the requirements for linguistic analysis of key sites of social interaction such as the one that forms the basis of this proposal. Corpus linguistics is well established as a methodology for analysing language in use. However, the main area of application of this methodology is still in language description for lexicographical purposes and for the development of technological applications. With ever larger language data-sets becoming available, many of which are ‘born digitally’, there is a real opportunity now to explore the value of corpus linguistics to address a range of key social science questions in the particular area of health communication. By integrating the descriptive linguistic advantages of corpus methods with a sociolinguistic focus on adolescents’ language innovation and change, it is the intention to take corpus linguistics beyond the descriptive and produce a context-sensitive, socially-informed approach with a clear practical outcome which can be a useful and informative resource to those providing adolescents with healthcare advice and information.


Details of papers and presentations based on this research can be found here.

Key words and Clusters

Topic Sketches

The topic sketches can be downloaded as pdf files below.


The key outputs will be a set of analytical results that relate to the meaning of key words and phrases in use in the data, a set of guidelines relating to the application of different corpus linguistic methods to this kind of data-set, and a list of recommendations relating to the presentation of results for use by health care practitioners and end-users.

Research Questions

  • How can the descriptive tools of corpus linguistics be combined with sociolinguistic approaches to 'style' to produce findings which will be of practical relevance to end users?
  • How can the corpus be divided into different time-based sub-corpora to enable a longitudinal analysis of changes in use of particular sociolinguistic styles and registers over the period of 6 years since the data collection started?
  • How can the results of the combined corpus and sociolinguistic analysis be best presented to be of maximum benefit to health professionals and other end users?

Methodological Questions

  • Drawing on methods in corpus linguistics, how may we best extract meaningful units from the data?
  • What are the effects of using different reference corpora for the purpose of comparative keyword analysis, and how does this affect the results of the keyword list generated from the data of the Teenage Health Freak website? How do we identify suitable reference corpora to generate keywords form the different sets of data?
  • What are the effects of variations in spelling and terminology used in the messages that are submitted on the results of the analysis?

Background Information

The teenage health freak website

Operated by UK-based GPs specialising in adolescent health, the Teenage Health Freak website has been running and continuously updated on a weekly basis since its launch in 2000. It has established itself as a very popular site. It is designed to be interactive, confidential and evidence-based, providing adolescents with accessible advice and information pertaining to a broad range of health issues. Adolescents are able to submit their health questions anonymously to the online GP persona, Doctor Ann.

The teenage health freak corpus

The corpus used in the project is comprised of health questions send to Doctor Ann through the 'Ask Doctor Ann' facility on the teenage health freak website. The messages date from January 2004 to December 2009, a period of 6 years. In total the corpus contains 113,480 messages and 2,217,919 words. A more detailed overview of the corpus is available.


The main aim of the proposed research is to explore a corpus-linguistic approach to the anaylsis of teenage health concerns as evidenced in the messages sent to an interactive health advice website. The following methods were used:

Preparation of data

The data has been converted from spreadsheets to XML

files and has been cleaned to remove empty, duplicate and near duplicate messages. As far as possible the spelling has also been corrected.

Further details of how this was achieved can be found here

An analysis of the volume and type of spelling errors in the corpus can be found here.

Extraction and analysis of key words and key sequences

WordSmith Tools was used to extract key words from the corpus using the BNC as the reference corpus. One of the main aspects of a text highlighted by keywords is the text's 'aboutness'. The keywords were therefore used to establish the main themes in the data and the keywords were classified by theme.

Sequences of two to six words were extracted from the Teenage Health Freak corpus using an adjusted frequency algorithm, specifically the serial cascading algorithm, outlined in O'Donnell, M.B. (2011). ('The adjusted frequency list: A method to produce cluster-sensitive frequency lists'. ICAME Journal 35: 135-169). Key clusters were then generated using this word list with clusters extracted from the BNC as the reference corpus.

A selection of the keywords and key clusters generated can be downloaded in the Results/Materials section below.

Representation of emerging descriptions of health concerns

Using the keywords as described above five main areas of concern emerge. These are:

Sex, pregnancy and relationships
Sexual body parts
Body Changes
Smoking, drugs and alcohol
Weight and eating

Topic sketches were created to illustrate these topics. The topic sketches show the distribution of the words from the topic by gender and age and also by year. The topic sketches can be downloaded in the Results/Materials section below.

Analysis of units of meaning and sociolinguistic patterns of use

Word sketches were created to illustrate the use of particular words in the corpus. Like topic sketches the word sketches show the distribution of the words from the topic by gender, age and year. They also show sample concordance lines and the clusters they form with other words. The top 20 keyword sketches from each of the five topics can be downloaded in the Results/Materials section below.

Project Details

Grant Period

January 2010 – December 2010


ESRC Grant number


Health Language Research Group

Visit HLRG website


