Loading and exploring the data

Executive summary

In the below document, we show that we should use the Python-generated total word count for calculating normalised frequencies (or the counts provided in the metadata, as the two are nearly identical and well correlated across all text lengths.


In this file, we start by loading and exploring the data for the Australian Obesity Corpus.

Code
library(here)
library(janitor)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
library(knitr)
theme_set(theme_minimal())
read_cqpweb <- function(filename){
  read.csv(
    here("100_data_raw", filename), 
    skip = 3, sep = "\t") %>% 
    janitor::clean_names()
}
clean_fatneglabel <- function(dirname){
  purrr::map_dfr(
    list.files(
      here("100_data_raw", dirname )
      ),
    ~{read.csv(
      here(
        paste0("100_data_raw/", dirname), .x), 
      skip = 3, sep = "\t") %>% 
        janitor::clean_names()
      })
}
condition_first <- read_cqpweb("aoc_all_condition_first.txt")
person_first <- read_cqpweb("aoc_all_person_first.txt")
adj_obese <- read_cqpweb("aoc_all_obese_tagadjlemma.txt")
adj_overweight <- read_cqpweb("aoc_all_overweight_tagadjlemma.txt")
fat_labelled <- clean_fatneglabel("fat_neg_label_yes_textFreqs")
metadata <- read_csv(here("100_data_raw", "corpus_cqpweb_metadata.csv"))
additional_source_metadata <- read_csv(here("100_data_raw", "addition_source_metadata.csv"))
metadata_full <- inner_join(metadata, additional_source_metadata)
condition_first_annotated <- inner_join(
  condition_first, metadata_full, by = c("text" = "article_id"))
person_first_annotated <- inner_join(
  person_first, metadata_full, by = c("text" = "article_id"))
fat_annotated  <- inner_join(
  fat_labelled, metadata_full, by = c("text" = "article_id")
)
write_csv(fat_annotated, file = here::here("200_data_clean", "fat_annotated.csv"))

Check that all text in the CQP web export file are found in the article_id of the metadata file, to ensure that there was no corruption of article_ids when we imported and exported from CQPWeb.

Code
janitor::tabyl(condition_first$text %in% metadata$article_id) %>% kable()
condition_first\(text %in% metadata\)article_id n percent
TRUE 3208 1
Code
janitor::tabyl(person_first$text %in% metadata$article_id) %>% kable()
person_first\(text %in% metadata\)article_id n percent
TRUE 106 1
Code
janitor::tabyl(adj_obese$text %in% metadata$article_id) %>% kable()
adj_obese\(text %in% metadata\)article_id n percent
TRUE 10821 1
Code
janitor::tabyl(adj_overweight$text %in% metadata$article_id) %>% kable()
adj_overweight\(text %in% metadata\)article_id n percent
TRUE 7136 1
Code
janitor::tabyl(fat_annotated$text %in% metadata$article_id) %>% kable()
fat_annotated\(text %in% metadata\)article_id n percent
TRUE 2300 1

Yes, this is true for all of them.

Articles by month, year and publication (based on metadata)

Have we sampled the articles consistently by month, year and publication?

Code
assess_year_source <- function(df, var){
  var <- enquo(var)
  df %>% 
  select(source, !!var) %>%
  group_by(source, !!var) %>% 
  count(!!var) %>%
  rename(count = n) %>%
  pivot_wider(id_cols = c(source), names_from = !!var, values_from = c(count), values_fill = 0) %>%
  janitor::adorn_totals(c("row", "col")) %>%
  kable()
} 
assess_year_source(metadata_full, year)
Table 1: Number of articles by source and year.
source 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Total
Advertiser 386 309 300 284 326 395 287 293 221 210 164 174 3349
Age 354 243 218 223 204 317 220 325 234 191 177 120 2826
Australian 296 221 178 201 168 172 155 110 95 113 147 104 1960
BrisTimes 0 0 0 0 0 5 11 81 16 14 53 48 228
CanTimes 188 188 166 180 165 238 218 212 114 154 149 72 2044
CourierMail 382 334 309 274 239 296 246 226 188 200 244 193 3131
HeraldSun 459 409 404 379 338 413 273 252 196 209 214 176 3722
HobMercury 199 141 149 98 135 177 88 85 107 91 101 94 1465
NorthernT 104 94 97 67 64 73 56 61 64 59 43 40 822
SydHerald 349 311 352 329 338 342 265 372 285 237 242 214 3636
Telegraph 0 0 0 0 0 2 190 134 190 233 191 149 1089
WestAus 283 222 221 210 185 190 210 114 119 80 40 17 1891
Total 3000 2472 2394 2245 2162 2620 2219 2265 1829 1791 1765 1401 26163

We can see that we are:

  • missing articles from the Brisbane Times and Daily Telegraph in 2008-2012.
  • There are fewer articles in 2019 than in previous years. Let’s explore if we have data for all of the months for that year:
Code
metadata_full %>%
  filter(year == 2019) %>%
  assess_year_source(., month_metadata)
Table 2: Number of articles by source and month of publication.
source 01 02 03 04 05 06 07 08 09 10 11 12 Total
Advertiser 19 14 16 23 21 16 10 14 12 11 9 9 174
Age 9 6 4 0 12 12 9 10 13 15 16 14 120
Australian 8 11 11 6 11 7 10 14 3 8 9 6 104
BrisTimes 4 2 2 6 3 5 6 3 6 5 3 3 48
CanTimes 11 6 10 10 9 0 5 5 5 6 4 1 72
CourierMail 15 24 17 24 0 16 18 18 17 17 8 19 193
HeraldSun 16 23 10 20 15 17 14 16 12 14 6 13 176
HobMercury 7 17 15 10 1 8 7 8 4 3 9 5 94
NorthernT 3 7 3 2 3 2 5 3 1 6 1 4 40
SydHerald 15 14 18 14 22 19 17 17 19 22 21 16 214
Telegraph 13 20 9 23 10 18 13 12 7 10 8 6 149
WestAus 1 0 3 1 1 2 0 2 2 3 0 2 17
Total 121 144 118 139 108 122 114 122 101 120 94 98 1401

We can see that we do have data for each month from 2019, although there are many fewer articles that year from the Western Australian and to a lesser extent the Canberra Times.

Let’s also visualise this trend by publication

Code
metadata_full %>%
  group_by(source, year) %>%
  count() %>%
  ggplot(aes(col = source, y = n, x = year)) + 
  geom_line() +
  scale_x_continuous(breaks = unique(metadata_full$year)) + 
  labs(x = "", y = "Number of articles")

Figure 1: Number of articles by publication and year. The overall number of articles featuring the terms seems to decrease with time.

We can also look at this using facets:

Code
metadata_full %>%
  group_by(source, year) %>%
  count() %>%
  ggplot(aes(col = source, y = n, x = year)) + 
  geom_line() +
  scale_x_continuous(breaks = unique(metadata_full$year)) + 
  labs(x = "", y = "Number of articles") + 
  facet_wrap(~source, scales = "free_y") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "NA")

Figure 2: Number of articles by publication and year, faceted by source publication.

This most clearly shows that the number of articles mentioning obes* declines with time in Australian media.

This could be due to:

  • Fewer articles in general being published
  • Proportionally fewer articles about obesity being written
  • More duplicate articles not having been cleaned in the earlier years

It is important for us to acknowledge this trend, as it will influence our inferences around using the raw number of articles in temporal comparisons.

Code
metadata_full %>%
  # don't have all years of data for these
  filter(!(source %in% c("BrisTimes", "Telegraph"))) %>%
  group_by(source, year) %>%
  count() %>%
  ggplot(aes(x = year, y = n, group = year)) + 
  geom_boxplot() +
  scale_x_continuous(breaks = unique(metadata_full$year)) + 
  labs(x = "", y = "Number of articles") 

Figure 3: Boxplot of articles by year. The overall number of articles featuring the terms seems to decrease with time.


CQP-web vs metadata-annotated and Python-quantitated word counts

When carrying out modelling, it is often important to use data normalised to article word counts. This is dependent upon getting a correct word count for each article. Below, we compare article word counts from CQPWeb to those generated by Python and reported in the metadata in Lexis, for the condition-first language dataset as an example. The effects seen here will be consisten across other datasets as well.

First, compare the counts from Python and the metadata:

Code
p <- condition_first_annotated %>%
  ggplot(aes(x = wordcount_total,
             y = wordcount_from_metatata)) + 
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, col = "red") + 
  geom_abline(slope = 1, intercept = 0, col = "blue", lty = 2) +
  xlab("Word count, Python") + 
  ylab("Word count, metadata") 
plotly::ggplotly(p) 

Figure 4: Correlation between word counts derived from the Lexis-supplied metadata and generated via Python counting. The correlation seems to be near-perfect (blue dashed line shows perfect equality, red is line of best fit of data).

The word counts are well correlated across all text lengths.

Next, we compare the word counts from CQP-web and the metadata:

Code
p <- condition_first_annotated %>%
  ggplot(aes(x = no_words_in_text,
             y = wordcount_from_metatata)) + 
  geom_point() + 
  geom_abline(slope = 1, intercept = 0, col = "blue", lty = 2) +
  xlab("Word count, CQPWeb") + 
  ylab("Word count, metadata") 
plotly::ggplotly(p) 

Figure 5: Correlation between word counts derived from the Lexis-supplied metadata and reported by CQPweb. It appears that there is a strong overestimate of word counts by CQPweb.

We can see that the longer the text, the more the counts do not match what is in the metadata (or counted via Python). This is most likely to CQP-Web counting punctuation as tokens, thereby with longer text more punctuation is added to each text.

We can add a smoothed conditional means line (in red) to show the deviation between the identity line (y = x, shown dashed in blue) and the line of best fit:

Code
condition_first_annotated %>%
  ggplot(aes(x = no_words_in_text,
             y = wordcount_from_metatata)) + 
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, se = TRUE, col = "red") + 
  geom_abline(slope = 1, intercept = 0, col = "blue", lty = 2) +
  xlab("Word count, CQPWeb") + 
  ylab("Word count, metadata") 

Figure 6: Correlation between word counts derived from the Lexis-supplied metadata and reported by CQPweb. The blue dashed line shows perfect equality, red is line of best fit of data. The overestimate by CQPweb is clearly detectable.

To make this more apparent, we plot the difference between the metadata-provided and Python and CQP-web counts:

Code
condition_first_annotated %>%
  mutate(
    `Python - metadata` = (abs(wordcount_total - wordcount_from_metatata)),
    `CQPweb - metadata` = (abs(no_words_in_text - wordcount_from_metatata)),
    word_count_quartile = ntile(wordcount_from_metatata, 10)) %>%
  select(`Python - metadata`, `CQPweb - metadata`, word_count_quartile) %>%
  pivot_longer(cols = c(`Python - metadata`, `CQPweb - metadata`)) %>%
  ggplot(aes(x = as.factor(word_count_quartile), 
             y = value,
             fill = name)) + geom_boxplot() +
  labs(
    x = "Text length, decile, based on metadata supplied word count",
    y = "Absolute difference in word count between X and metadata-provided counts",
   caption = "The longer the text, the more the CQP-Web count diverges from that of the metadata (and Python)") + guides(fill=guide_legend(title="Difference"))

Figure 7: Boxplot of absolute difference between (Python-metadata) and (CQPweb-metadata) derived word counts. It’s clear that the longer the text, the more CQPweb overestimates the word count, especially for the top decile of texts by length.

This also affects the normalised counts. To show this, we have broken the dataset into 6 groups based on text length, with 1 being the shortest and 6 being the longest texts. We can see that for the ~500 longest texts (panel 6), the line of best fit between frequencies is curved, showcasing how the normalised frequency is under-estimated for longer texts (due to the denominator for normalisation being bigger, as punctuation is included in the calculation).

Code
condition_first_annotated %>%
  mutate(frequency = 10^6*no_hits_in_text/wordcount_total,
         length_quantile = ntile(desc(condition_first_annotated$wordcount_total),6)) %>%
  ggplot(aes(x = frequency, y = freq_per_million_words)) + 
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, se = TRUE, col = "red") + 
  geom_abline(slope = 1, intercept = 0, col = "blue", lty = 2) +
  facet_wrap(~length_quantile, scales = "free_y") +
  labs(x = "Python-counted frequency per million words",
       y = "CQP-calculated frequency per million words")

Figure 8: Comparison of effect of using CQPweb vs Python word counts by texts grouped into 6 length bins

CONCLUSION: Use the Python-generated total word count for calculating normalised frequencies (or the counts provided in the metadata, the two are nearly identical and well correlated across all text lengths. This also means we cannot use the provided normalised frequencies from CQPweb.