Centre for Research in Applied Linguistics
  • Print

Corpus Linguistics Network


Aims of the Corpus Linguistics Network

The Corpus Linguistics Network is a research group which brings together people who have an interest in corpus linguistics, including students, PhD researchers and more established academics, in the School of English and other Schools and departments across the University of Nottingham. Anyone is welcome to join, including those who are new to the discipline.

Over the course of this academic year, the Corpus Linguistics Network will be hosting an exciting series of monthly talks, given by a mixture of internal and external speakers, which demonstrate the wide range of research in and applications of corpus linguistics methods at Nottingham and other Universities.


Upcoming Talks

David Wright (Nottingham Trent University)

N-gram textbites and author identification: A corpus approach to a forensic problem

Wed. 9 Dec. 2015, 3.30pm (Trent A35)

Kim-Sue Kreischer (University of Nottingham)

Integrating a cognitive perspective into corpus analysis

Wed. 20 Jan. 2016, 3pm (Trent A35)

Paul Bonham (University of Nottingham)

Mental health online: language and identity in self-help

Wed. 10 Feb. 2016, 3pm (Trent A35)

Olivia Walsh (University of Nottingham)

Linguistic Purism in France and Quebec (1865-2000)

Wed. 9 Mar. 2016, 4pm (Trent A35)

Mark Cole (University of Nottingham)

A Discourse Analysis of Hand Hygiene Policy in NHS Trusts

Wed. 13 Apr. 2016, 4pm (Trent A35)

Karen Kinloch (Lancaster University)

Uncertain outcomes - discourses of risk and chance around infertility

Wed. 11 May 2016, 3pm (Trent A35) 

If you have any suggestions or would like to get involved (by presenting your own research), please contact Gavin Brookes.

Previous Events

Postgraduate Symposium 'Corpus Linguistics beyond Boundaries: Interdisciplinary Applications' (10 July 2015)

Summer School 'Corpus Linguistics: Tools and Applications' (7-9 July 2015)

25/03/2015: Talk by Paul Rayson (Lancaster University) - 'Can you adapt a modern semantic tagger for Early Modern English corpora?'


Paul Rayson

  • Lancaster University
  • Director of the UCREL Research Centre
  • Senior lecturer in the School of Computing and Communications)

Paul Rayson is director of the UCREL Research Centre and a senior lecturer in the School of Computing and Communications. His research interests are based on applications of corpus-based natural language processing to address significant challenges in a number of different areas: child protection in online social networks, better understanding of the language of extremism and counter extremism, text mining for conceptual history studies, the quality of the corporate financial information environment and the use of metaphorical language in end-of-life care. His methodological contributions are in the areas of key semantic domains and corpus analysis software – Wmatrix was in fact developed by Paul. His talk will be entitled: “Can you adapt a modern semantic tagger for Early Modern English corpora?” (see the full abstract below).


Can you adapt a modern semantic tagger for Early Modern English corpora?

In this talk, I will present joint research from the Samuels project ( where we are carrying out a number of case studies on two very large corpora around 1-2 billion words each: (a) Early English Books Online (EEBO) Text Creation Partnership (TCP) consisting of over 53,000 transcribed books published between 1473 and 1700 and (b) two hundred years of UK Parliamentary Hansard made up from over 7 million files. In this talk I will describe the changes that we've made to the Wmatrix tag wizard in order to address historical spelling variation and meaning change over time. I will describe the latest version of the VARD (Variant Detector) software which allows us to pre-process historical corpora and match modern forms to historical variants, thus improving tagging accuracy. In order to have a historically valid taxonomy, we have adopted the Historical Thesaurus of English (developed at the University of Glasgow) and the Oxford English Dictionary, thus helping us improve methods for the automatic semantic analysis of historical texts. The Historical Thesaurus contains 793,742 word forms arranged into 225,131 semantic categories. The combination and scale of the corpora and the size of the taxonomy pose significant computational challenges for existing retrieval methods (Wmatrix) and annotation software (USAS) and I will describe our current solutions to these problems.


13/03/2015: Discussion of the research article ‘Language is never, ever, ever, random’ (Kilgarriff, 2005)

We are meeting on Friday, 13 March 2015, 3.00pm at Trent A35 to discuss the research article ‘Language is never, ever, ever, random’ by Adam Kilgarriff published in Corpus Linguistics and Linguistic Theory, 1(2), 2005.

We will have tea and biscuits and hope to see you there!


27/02/2015: Talk by Laurence Anthony (Waseda University, Japan): 'New Developments in Corpus Tools for Data Collection, Analysis, and Visualization'

The Corpus Linguistics Workshop and the Vocabulary Research Group are hosting a talk by Dr Laurence Anthony (Waseda University, Japan) on 27 February, entitled 'New Developments in Corpus Tools for Data Collection, Analysis, and Visualization'. This event will take place on Friday, February 27th, Trent A35, from 3:30 pm, followed by a small wine reception. As usual, please let us know if you want to attend via email to Lorenzo Mastropierro or Viola Wiegand.


Dr Laurence Anthony

(Center for English Language Education, Waseda University, Japan; Honorary Research Fellow, Lancaster University, UK)


Laurence Anthony is Professor of Educational Technology and Applied Linguistics at the Faculty of Science and Engineering, Waseda University, Japan. His main interests are in corpus linguistics tools development and English for Specific Purposes (ESP) program design and teaching methodologies. He received the National Prize of the Japan Association for English Corpus Studies (JAECS) in 2012 for his work in corpus software tools design.  He is the developer of various corpus tools including AntConc, AntWordProfiler, AntMover, EncodeAnt , SarAnt, TagAnt, and VariAnt.


In this talk, I will first discuss recent changes made to AntConc that will allow the software to work quicker and more easily with very large, annotated corpora. Next, I will introduce a range of newly developed freeware desktop and web-based parallel corpus tools that enable corpus linguists to easily collect, clean, and standardize corpus data, analyze monolingual and parallel corpora in a variety of ways, and also visualize the results of corpus analyses in the form of tables, dispersion plots, and network graphs. At the end of the talk, I will discuss future directions for corpus research and invite the audience to consider possible tools that might facilitate their own corpus linguistics research.


20/02/2015: Spring term opening event - Talk by Andrew Kehoe

We are pleased to host a special event with Dr Andrew Kehoe to open the spring term [see the event poster].

The event will take place on Friday, February 20th, Trent A35, from 3:30 pm, followed by a small wine reception. Please let us know if you want to attend via email to Lorenzo or Viola.

Andrew is Director of the Research & Development Unit for English Studies (RDUES) at Birmingham City University. The RDUES team has in recent years developed the WebCorp suite of online search tools for linguistic study and the eMargin collaborative text annotation system. He has research interests in all aspects of corpus linguistics, including the development of software tools for the identification and visualisation of language change across time and the use of the web as a source of natural language data. The title of his talk is: “Reader comments on online news articles: A corpus-based analysis”.


Reader comments on online news articles: a corpus-based analysis Andrew Kehoe, Birmingham City University

Launched in March 2006, ‘Comment is Free’ is a section on The Guardian website where non-journalists can, by invitation, write a blog post on any subject of their choosing ( Readers are encouraged to comment on these blog posts and take part in discussions, with some posts generating over 1000 comments. A fortnight after the launch of Comment is Free, The Guardian began to allow reader comments on conventional news articles across all sections of its website. Hermida & Thurman (2008: 6) report that five other UK newspaper websites were allowing reader comments on news articles by the end of 2006. The integration of blogs and reader comments – so-called ‘user-generated content’ – across such websites has led to a blurring of the boundaries between opinion and hard news, and between professional and non-professional writing.

This paper presents a corpus linguistic analysis of comments across The Guardian website since their introduction in 2006, based upon a corpus of over 500,000 articles and blog posts. The first part of the paper adopts a ‘key words’ approach to explore the differences between Comment is Free and the other sections of the website, and whether or not these differences are becoming less pronounced over time.

The second half of the paper explores the distribution of reader comments across blog posts and articles (henceforth referred to collectively as ‘articles’). Our initial analysis has suggested that comments are permitted on around 40% of articles and, where commenting is permitted, the vast majority of articles (85%) have at least one comment. The Guardian’s commenting policy is rather vague, stating only that comments are not allowed on ‘stories about particularly divisive or emotional issues’ ( In this paper, we are able to identify sub-sections of the newspaper’s website where commenting is most prevalent and where it is most likely to be banned outright. Taking the analysis further, through the extraction of keywords we identify the specific topics which are most likely to generate debate, often relating to politics, religion and social issues. Moreover, we are able to identify specific words indicative of particular styles of writing which encourage the most reader discussion.

Overall, this paper offers insights into changing newspaper practices and reader behaviour through lexical analyses of a large corpus of articles and comments. With the continued growth of user-generated content, the work is of potential interest across disciplines. From a practical perspective, the work offers suggestions for the refinement of automated spam detection and moderation procedures.

Hermida, A. & N. Thurman (2008) ‘A clash of cultures: the integration of user-generated content within professional journalistic frameworks at British newspaper websites’. Journalism Practice 2(3), 343-356.


20/01/2015: 'Challenges to surveillance II: Interdisciplinary reflections on the UK Terrorism Act and other cases' [Research workshop]

The Corpus Linguistics Workshop co-hosts the research workshop 'Challenges to surveillance II: Interdisciplinary reflections on the UK Terrorism Act and other cases’ on Tuesday, 20 January, 9.30-13.00 at Highfield A09.

The first event in this series (‘Challenges to surveillance – Interdisciplinary perspectives') took place in December 2014. The speakers, with backgrounds ranging from law and history to linguistics, are listed on the event flyer.

Anyone interested in the topic of surveillance is welcome and light refreshments will be served.

Please contact Viola Wiegand in case of any questions.


05/12/2014: Discussion of the article 'A linguistic account of wordplay: The lexical grammar of punning' (Partington, 2009)

The next regular meeting of the Corpus Linguistics Workshop is taking place on Friday, 5 December, 2014, 3.00pm at Trent A35. Everyone is welcome; tea and Christmassy snacks will be provided. We'll discuss Llauradó & Tolchinsky's (2013) "Growth of text-embedded lexicon in Catalan: From childhood to adolescence", published in First Language, 33 (6). Eduard Abelenda will also present his own research based on the same corpus.


Lexical development is a key facet of later language development. To characterize the linguistic knowledge of school age children, performance in the written modality must also be considered. This study tracks the growth of written text-embedded lexicon in Catalan-speaking children and adolescents. Participants (N = 2161), aged from 5 to 16 years produced six different texts: a film explanation, a film recommendation, a joke telling and definitions of a noun, a verb and an adjective. The resultant corpus of 11,332 texts was analyzed using four distributional measures of lexical development: word length, lexical density, use of adjectives and nominalizations. Heylighen’s F-measure of level of text formality was also computed. Word length, use of adjectives and nominalizations were powerful indicators of lexical development. Text type and home language had an effect on these measures. Lexical density showed no clear developmental change, and did not vary by type of text. Heylighen’s F-measure was a weaker developmental indicator. Educational implications are discussed.


02/12/2014: Research Workshop "Challenges to Surveillance - Interdisciplinary Perspectives"

The CRAL Corpus Linguistics Workshop is co-hosting the research workshop “Challenges to Surveillance: Interdisciplinary Perspectives” on Tuesday, 2 December 2014 (Trent, A35). This workshop is collaboratively organised by researchers from the Department of History, the Horizon Centre for Doctoral Training, the Human Rights Law Centre, the School of Computer Science and CRAL. In four talks, the topic of surveillance will be discussed from the perspectives of the different disciplines.

No prior registration is required and everyone is welcome. If you have any questions, please contact.

See the workshop programme.

Viola Wiegand.

21/11/2014: Discussion of the article "Discovering formulaic language through data-driven learning: Student attitudes and efficacy" (Geluso & Yamaguchi, 2014)

Discussion of the article "Discovering formulaic language through data-driven learning: Student attitudes and efficacy" by Joe Geluso and Atsumi Yamaguchi (ReCALL, 2014).21st November 2014, Trent A35, 3.00pm


Corpus linguistics has established that language is highly patterned. The use of patterned language has been linked to processing advantages with respect to listening and reading, which has implications for perceptions of fluency. The last twenty years has seen an increase in the integration of corpus-based language learning, or data-driven learning (DDL), as a supporting feature in teaching English as a foreign / second language (EFL/ESL). Most research has investigated student attitudes towards DDL as a tool to facilitate writing. Other studies, though notably fewer, have taken a quantitative perspective of the efficacy of DDL as a tool to facilitate the inductive learning of grammar rules. The purpose of this study is three-fold: (1) to present an EFL curriculum designed around DDL with the aim of improving spoken fluency; (2) to gauge how effective students were in employing newly discovered phrases in an appropriate manner; and (3) to investigate student attitudes toward such an approach to language learning. Student attitudes were investigated via a questionnaire and then triangulated through interviews and student logs. The findings indicate that students believe DDL to be a useful and effective tool in the classroom. However, students do note some difficulties related to DDL, such as encountering unfamiliar vocabulary and cut-off concordance lines. Finally, questions are raised as to the students’ ability to embed learned phrases in a pragmatically appropriate way.


07/11/2014: Discussion of the article "Corpus linguistics and theoretical linguistics: A love–hate relationship? Not necessarily..." (Gries, 2010)

Discussion of the article "Corpus linguistics and theoretical linguistics: A love–hate relationship? Not necessarily..." by Stefan Gries (International Journal of Corpus Linguistics, 2010). You can download the article from the Journal of Corpus Linguistics.
7th November 2014, Trent A35, 3.00pm

24/10/2014: 2014/2015 Opening Event - Talk by Dr Paul Thompson ('Writing between disciplines: data-driven approaches to interdisciplinary research discourse')

We are pleased to have Dr Paul Thompson from the University of Birmingham speak at this year's opening event (also see the event poster). The talk will take place on Friday, 24th October, 3:00pm at Trent A35 and will be followed by a drinks reception.


It is generally accepted now that many real-world problems are best addressed by a number of disciplines working together rather than by individual disciplines alone. In the UK, research councils promote interdisciplinary research activity, and universities in turn encourage academics to collaborate with colleagues in other disciplines. In order to facilitate effective communication between such researchers, we believe that it is important to develop a fuller description of what the distinctive features of discourse practices in interdisciplinary research are and of how they differ from discourse practices in conventional disciplines.

As a step toward this goal we are investigating the discourse of a successful journal in an interdisciplinary field: Global Environmental Change (GEC). We are investigating the extent to which this field operates as a unified whole, the extent to which journal authors in the field broaden their messages to a multidisciplinary audience, and the extent to which each discipline in the field maintains a discrete identity.

One of the challenges of investigating interdisciplinary research discourse is to find ways to categorise texts - within an interdisciplinary journal, for example, are some articles more 'interdisciplinary' than others? Are some papers more typical of one or another discipline involved than of others? Typically, in corpus building, texts are classified by external criteria, but in this project, after the initial choice of texts in given journals, we approach the categorisation, or clustering of texts within the corpus through text-internal features, using Multidimensional Analysis.

The corpus contains the entire holdings of the journal Global Environmental Change in the period 1990-2010, and also the articles for five journals identified as monodisciplinary and another five which are classed as multidisciplinary, for the period 2001-2010. Taking all of the articles in this collection that are over 2000 words in length, we develop a new set of dimensions, in collaboration with Doug Biber, and then use the new set of dimensions to cluster articles. On this basis, we will see whether the texts cluster following discipline boundaries. We also look at this across time, to see whether discourse practices change as a field becomes more established and as an interdisciplinary community develops. In addition to MD analysis, we have also used topic modelling, we have examined the use of metadiscourse in the texts, and we have labelled all the papers in the corpus into types. I will report on some of the findings of our initial investigations in this talk."


04/07/2014: Discussion of research article "The peaks and troughs of corpus-based contextual analysis”

Trent A35, 3.00pm


"This paper focuses upon two issues. Firstly, the question of identifying diachronic trends, and more importantly significant outliers, in corpora which permit an investigation of a feature at many sampling points over time. Secondly, we consider how best to combine more qualitatively oriented approaches to corpus data with the type of trends that can be observed in a corpus using quantitative techniques. The work uses a recently completed ESRC-funded project as a case study, the representation of Islam in the UK press, in order to demonstrate the potential of the approach taken to establishing significant peaks in diachronic frequency development, and the fruitful interface that may be created between qualitative and quantitative techniques."

International Journal of Corpus Linguistics, 17(2)

Please download this article from the International Journal of Corpus Linguistics.


20/06/2014: Discussion of research article (Bednarek, 2008)


“In this paper I want to re-examine the key corpus-linguistic notion of semantic preference. This is defined here as the collocation of a lexical item with items from a specific (more or less general) semantic subset. The article aims to throw some light on the term semantic preference, and to examine in more detail some aspects of semantic preference that are frequently neglected in research. It also discusses how semantic preference interacts with syntax and meaning, and what happens when semantic preferences are not ‘realized’ in context. Finally, it seeks to illuminate the distinction between semantic preference and semantic prosody, and points to future research in this area.”

Corpus Linguistics and Linguistic Theory, 4(2)

Please download the article from the journal Corpus Linguistics and Linguistic Theory.


06/06/2014: Discussion of research article "The discourse of Olympic security: London 2012" (MacDonald & Hunter, 2013)


"This article uses a combination of critical discourse analysis (CDA) and corpus linguistics (CL) to investigate the discursive realization of the security operation for the London 2012 Olympic Games. Drawing on Didier Bigo’s (2008) conceptualization of the ‘ban-opticon’, it addresses two questions: (1) What distinctive linguistic features are used in documents relating to security for London 2012? (2) How is Olympic security realized as a discursive practice in these documents? Findings suggest that the documents indeed realized key features of the ban-opticon: exceptionalism, exclusion and prediction, as well as what we call ‘pedagogization’. Claims were made for the exceptional scale of the Olympic events; predictive technologies were proposed to assess the threat from terrorism; and documentary evidence suggests that access to Olympic venues was being constituted to resemble transit through national boundaries."

Discourse & Society, 24(1)

Please download this article from the journal Discourse & Society.


23/05/2014: Discussion of research article "Tracking learners’ actual uses of corpora: Guided vs non-guided corpus consultation" (Perez-Paredes et al., 2014)

Tracking learners’ actual uses of corpora: Guided vs non-guided corpus consultation (Perez-Paredes et al., 2014) 

Much of the research into language learners’ use of corpus resources has been conducted by means of indirect observation methodologies, like questionnaires or self-reports. While this type of study provides an excellent opportunity to reflect on the benefits and limitations of using corpora to teach and learn language, the use of indirect observation methodologies may confine the scope of research to learners’ opinions about the benefits of using corpora for language learning and their self-perceived difficulties in consulting them. This article proposes and discusses the use of logs to research learners’ actual use of corpus-based resources, analyzing the number of events or actions performed by each individual, the total number of different web services used, the number of activities completed, the number of searches performed on the British National Corpus (BNC) and, last, the number of words or wildcards per BNC search. Our research used these parameters to investigate whether learner interaction with corpus-based resources differed under different corpus consultation conditions: guided versus non-guided consultation. Our findings show that the individuals in the two research conditions behaved differently in two of the parameters analyzed: the number of different web services used during the completion of the tasks and the number of BNC searches. Our results corroborate empirically the suggestions found in the literature that skills and guidance are necessary when teachers take a corpus to the classroom. Similarly, we offer evidence that user tracking is essential to claim research and results validity.

Computer Assisted Language Learning Journal, 24(3).

Please download the paper from the Computer Assisted Language Learning Journal.

30/04/2014: Corpus Stylistics Workshop

This workshop is part of the ICAME conference, but is also open to day delegates and aims to contribute to the growing area of research that employs corpus linguistic methods in the study of literary texts. Further information on the programme and registration can be found on the Corpus Stylistics Workshop website.


04/04/2014: Discussion of research article "Phrasal irony: Its form, function and exploitation" (Partington, 2011)

Phrasal irony: Its form, function and exploitation (Partington, 2011)

Please find the abstract below and download the full article from the Journal of Pragmatics.

This paper is an examination of the as yet little-studied phenomenon of phrasal irony, defined as the reversal of customary collocational patterns of use of certain lexical items. The first research question is how phrasal irony is structured. A second, very closely  related question is how, why and where writers use it, and a third question is how it relates to other more familiar types of irony. During the course of these investigations it was observed that, occasionally, the ironic use of a particular phrase or phrase template is found to be repeated frequently and productively and can therefore be said to have become a recognised usage in its own right. However, it was also noted that by no means all reversal of normal collocational patterning is performed with an ironic intent, and so yet a further research question is how the circumstances when phrasal irony is at play might differ from those of simple counterinstances to the statistically normal collocational patterns of use. Corpus methodology is used to locate ironic uses of phrase templates for examination. As Louw (1993) points out, before the advent of language corpora, detecting sufficient instances of such use, which can be quite rare, was problematic, and this may explain why so little previous attention has been given to these phenomena.

Journal of Pragmatics, 43(6)


21/03/2014: Discussion of research article "Individual differences and usage-based grammar” (Barlow, 2013)

Our next meeting will take place on Friday, March 21 (Trent A35, 15.00-16.30). We will discuss the research article "Individual differences and usage-based grammar" by Michael Barlow (International Journal of Corpus Linguistics, 18 (4), 2013). Please find the abstract below and download the full article from the International Journal of Corpus Linguistics.


Individual differences and usage-based grammar (Barlow, 2013)

Since usage-based theories such as cognitive grammar assume an intimate relationship between mental representations of grammar and the processing of instances of language (usage events), corpora have an important role in the development of grammatical analyses. One consequence of relying on corpus data is that individual differences in usage tend to be obscured. To overcome this problem and investigate individual differences in spoken usage, we examine a large corpus consisting of the spoken output of six White House press secretaries. The results provide strong evidence that within this one particular discourse context the patterns of speech of each individual are clearly recognisable. Furthermore, these idiolectal preferences are consistent and are maintained over a period of at least a year or two. In addition, we briefly explore some theoretical consequences and possible explanations for the disparity found between the speech of the individual and that of the discourse community.


06/03/2014: Talk by Dr Maristella Gatto (“A virtual(ly) multilingual corpus. Approaches, tools and resources, perspective”)

We are pleased to have a special guest for our launch event. Dr. Maristella Gatto, from the University of Bari (Italy), will speak about the use of the web as a corpus: “A virtual(ly) multilingual corpus. Approaches, tools and resources, perspective”. The event will take place on Thursday, March 6, Trent A19, 2:00 pm, and will be opened by Professor Michaela Mahlberg. Also see the event poster.

Dr MARISTELLA GATTO is researcher of English Language and Translation at the University of Bari (Italy), where she teaches English Language and Translation at both undergraduate and postgraduate level.


The web as a virtual(ly) multilingual corpus. Approaches, tools and resources, perspectives

Corpus resources have undergone significant changes over the past few years, moving along those lines of multilinguality, dynamic content, distributed architecture, virtuality, connection with web search, which Wynne (2002) envisaged as bound to characterize all linguistic resources in the 21st century. These changes have become particularly evident in a distinct trend at the confluence between Corpus Linguistics and Computational Linguistics, where the enormous potential of the web as a linguistic resource has been addressed under the umbrella term “Web as/for Corpus”. From the widespread – albeit controversial - practice of searching the web for immediate evidence of attested usage, to the development of web concordancers, to specific tools for the semi-automated compilation and exploration of monolingual and comparable corpora, the web plays a key role in corpus linguistics. Furthermore, corpus linguistics is now meeting the challenge of embracing the new paradigm subsumed under the label Web 2.0, whereby the web is increasingly experienced as a platform which enables users from all over the world to cooperate in processes and share products, turning virtually each single corpus linguist into a potential prosumer, i.e. a producer as well as a consumer, of corpus resources.

Such approaches and methods have not only changed the practice of corpus compilation and exploration but have also affected the way we are to conceive of corpora in the new millennium. This can be represented as a shift occurring in the basic metaphor underlying corpus resources, whereby the reassuring notion of a corpus as a ‘body’ of texts (i.e. a well proportioned corpus of authentic texts sampled so as to be representative of the whole) is complemented by a less reassuring, but possibly more functional, image of a corpus as a ‘web’ of texts. It goes without saying that while the notion of a linguistic corpus as a body of texts rests on some related issues such as finite size, balance, permanence, the very idea of a web of texts brings about notions of non-finiteness, flexibility, provisionality which need to be addressed if the web is to be used as a corpus on sound methodological bases.

It is against this complex background that the we will explore this multifaceted research field, on the assumption that - far from being the easy way - using the web as/for corpus requires awareness of both corpus-as-body issues and specific web-as-corpus issues, which in turn entail a different approach on the user’s part and hence different research methods.



Gavin Brookes

Find, follow and explore


Back to top 

Centre for Research in Applied Linguistics

The University of Nottingham

telephone: +44 (0) 115 951 5900
fax: +44 (0) 115 951 5924