Obesity phase 2 - Loading and exploring the data

Executive summary

In the below document, we show that we should use the Python-generated total word count for calculating normalised frequencies (or the counts provided in the metadata, as the two are nearly identical and well correlated across all text lengths.

In this file, we start by loading and exploring the data for the Australian Obesity Corpus.

Code

library(here)
library(janitor)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
library(knitr)
theme_set(theme_minimal())
read_cqpweb <- function(filename){
  read.csv(
    here("100_data_raw", filename), 
    skip = 3, sep = "\t") %>% 
    janitor::clean_names()
}
clean_fatneglabel <- function(dirname){
  purrr::map_dfr(
    list.files(
      here("100_data_raw", dirname )
      ),
    ~{read.csv(
      here(
        paste0("100_data_raw/", dirname), .x), 
      skip = 3, sep = "\t") %>% 
        janitor::clean_names()
      })
}
condition_first <- read_cqpweb("aoc_all_condition_first.txt")
person_first <- read_cqpweb("aoc_all_person_first.txt")
adj_obese <- read_cqpweb("aoc_all_obese_tagadjlemma.txt")
adj_overweight <- read_cqpweb("aoc_all_overweight_tagadjlemma.txt")
fat_labelled <- clean_fatneglabel("fat_neg_label_yes_textFreqs")
metadata <- read_csv(here("100_data_raw", "corpus_cqpweb_metadata.csv"))
additional_source_metadata <- read_csv(here("100_data_raw", "addition_source_metadata.csv"))
metadata_full <- inner_join(metadata, additional_source_metadata)
condition_first_annotated <- inner_join(
  condition_first, metadata_full, by = c("text" = "article_id"))
person_first_annotated <- inner_join(
  person_first, metadata_full, by = c("text" = "article_id"))
fat_annotated  <- inner_join(
  fat_labelled, metadata_full, by = c("text" = "article_id")
)
write_csv(fat_annotated, file = here::here("200_data_clean", "fat_annotated.csv"))

Check that all text in the CQP web export file are found in the article_id of the metadata file, to ensure that there was no corruption of article_ids when we imported and exported from CQPWeb.

Code

janitor::tabyl(condition_first$text %in% metadata$article_id) %>% kable()

condition_first\(text %in% metadata\)article_id	n	percent
TRUE	3208	1

Code

janitor::tabyl(person_first$text %in% metadata$article_id) %>% kable()

person_first\(text %in% metadata\)article_id	n	percent
TRUE	106	1

Code

janitor::tabyl(adj_obese$text %in% metadata$article_id) %>% kable()

adj_obese\(text %in% metadata\)article_id	n	percent
TRUE	10821	1

Code

janitor::tabyl(adj_overweight$text %in% metadata$article_id) %>% kable()

adj_overweight\(text %in% metadata\)article_id	n	percent
TRUE	7136	1

Code

janitor::tabyl(fat_annotated$text %in% metadata$article_id) %>% kable()

fat_annotated\(text %in% metadata\)article_id	n	percent
TRUE	2300	1

Yes, this is true for all of them.

Articles by month, year and publication (based on metadata)

Have we sampled the articles consistently by month, year and publication?

Code

assess_year_source <- function(df, var){
  var <- enquo(var)
  df %>% 
  select(source, !!var) %>%
  group_by(source, !!var) %>% 
  count(!!var) %>%
  rename(count = n) %>%
  pivot_wider(id_cols = c(source), names_from = !!var, values_from = c(count), values_fill = 0) %>%
  janitor::adorn_totals(c("row", "col")) %>%
  kable()
} 
assess_year_source(metadata_full, year)

Table 1: Number of articles by source and year.
source	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	Total
Advertiser	386	309	300	284	326	395	287	293	221	210	164	174	3349
Age	354	243	218	223	204	317	220	325	234	191	177	120	2826
Australian	296	221	178	201	168	172	155	110	95	113	147	104	1960
BrisTimes	0	0	0	0	0	5	11	81	16	14	53	48	228
CanTimes	188	188	166	180	165	238	218	212	114	154	149	72	2044
CourierMail	382	334	309	274	239	296	246	226	188	200	244	193	3131
HeraldSun	459	409	404	379	338	413	273	252	196	209	214	176	3722
HobMercury	199	141	149	98	135	177	88	85	107	91	101	94	1465
NorthernT	104	94	97	67	64	73	56	61	64	59	43	40	822
SydHerald	349	311	352	329	338	342	265	372	285	237	242	214	3636
Telegraph	0	0	0	0	0	2	190	134	190	233	191	149	1089
WestAus	283	222	221	210	185	190	210	114	119	80	40	17	1891
Total	3000	2472	2394	2245	2162	2620	2219	2265	1829	1791	1765	1401	26163

We can see that we are:

missing articles from the Brisbane Times and Daily Telegraph in 2008-2012.
There are fewer articles in 2019 than in previous years. Let’s explore if we have data for all of the months for that year:

Code

metadata_full %>%
  filter(year == 2019) %>%
  assess_year_source(., month_metadata)

Table 2: Number of articles by source and month of publication.
source	01	02	03	04	05	06	07	08	09	10	11	12	Total
Advertiser	19	14	16	23	21	16	10	14	12	11	9	9	174
Age	9	6	4	0	12	12	9	10	13	15	16	14	120
Australian	8	11	11	6	11	7	10	14	3	8	9	6	104
BrisTimes	4	2	2	6	3	5	6	3	6	5	3	3	48
CanTimes	11	6	10	10	9	0	5	5	5	6	4	1	72
CourierMail	15	24	17	24	0	16	18	18	17	17	8	19	193
HeraldSun	16	23	10	20	15	17	14	16	12	14	6	13	176
HobMercury	7	17	15	10	1	8	7	8	4	3	9	5	94
NorthernT	3	7	3	2	3	2	5	3	1	6	1	4	40
SydHerald	15	14	18	14	22	19	17	17	19	22	21	16	214
Telegraph	13	20	9	23	10	18	13	12	7	10	8	6	149
WestAus	1	0	3	1	1	2	0	2	2	3	0	2	17
Total	121	144	118	139	108	122	114	122	101	120	94	98	1401

We can see that we do have data for each month from 2019, although there are many fewer articles that year from the Western Australian and to a lesser extent the Canberra Times.

Let’s also visualise this trend by publication

Code

metadata_full %>%
  group_by(source, year) %>%
  count() %>%
  ggplot(aes(col = source, y = n, x = year)) + 
  geom_line() +
  scale_x_continuous(breaks = unique(metadata_full$year)) + 
  labs(x = "", y = "Number of articles")

Figure 1: Number of articles by publication and year. The overall number of articles featuring the terms seems to decrease with time.

We can also look at this using facets:

Code

metadata_full %>%
  group_by(source, year) %>%
  count() %>%
  ggplot(aes(col = source, y = n, x = year)) + 
  geom_line() +
  scale_x_continuous(breaks = unique(metadata_full$year)) + 
  labs(x = "", y = "Number of articles") + 
  facet_wrap(~source, scales = "free_y") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "NA")

Figure 2: Number of articles by publication and year, faceted by source publication.

This most clearly shows that the number of articles mentioning obes* declines with time in Australian media.

This could be due to:

Fewer articles in general being published
Proportionally fewer articles about obesity being written
More duplicate articles not having been cleaned in the earlier years

It is important for us to acknowledge this trend, as it will influence our inferences around using the raw number of articles in temporal comparisons.

Code

metadata_full %>%
  # don't have all years of data for these
  filter(!(source %in% c("BrisTimes", "Telegraph"))) %>%
  group_by(source, year) %>%
  count() %>%
  ggplot(aes(x = year, y = n, group = year)) + 
  geom_boxplot() +
  scale_x_continuous(breaks = unique(metadata_full$year)) + 
  labs(x = "", y = "Number of articles")

Figure 3: Boxplot of articles by year. The overall number of articles featuring the terms seems to decrease with time.

CQP-web vs metadata-annotated and Python-quantitated word counts

When carrying out modelling, it is often important to use data normalised to article word counts. This is dependent upon getting a correct word count for each article. Below, we compare article word counts from CQPWeb to those generated by Python and reported in the metadata in Lexis, for the condition-first language dataset as an example. The effects seen here will be consisten across other datasets as well.

First, compare the counts from Python and the metadata:

Code

p <- condition_first_annotated %>%
  ggplot(aes(x = wordcount_total,
             y = wordcount_from_metatata)) + 
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, col = "red") + 
  geom_abline(slope = 1, intercept = 0, col = "blue", lty = 2) +
  xlab("Word count, Python") + 
  ylab("Word count, metadata") 
plotly::ggplotly(p)

Figure 4: Correlation between word counts derived from the Lexis-supplied metadata and generated via Python counting. The correlation seems to be near-perfect (blue dashed line shows perfect equality, red is line of best fit of data).

The word counts are well correlated across all text lengths.

Next, we compare the word counts from CQP-web and the metadata:

Code

p <- condition_first_annotated %>%
  ggplot(aes(x = no_words_in_text,
             y = wordcount_from_metatata)) + 
  geom_point() + 
  geom_abline(slope = 1, intercept = 0, col = "blue", lty = 2) +
  xlab("Word count, CQPWeb") + 
  ylab("Word count, metadata") 
plotly::ggplotly(p)

Figure 5: Correlation between word counts derived from the Lexis-supplied metadata and reported by CQPweb. It appears that there is a strong overestimate of word counts by CQPweb.

We can see that the longer the text, the more the counts do not match what is in the metadata (or counted via Python). This is most likely to CQP-Web counting punctuation as tokens, thereby with longer text more punctuation is added to each text.

We can add a smoothed conditional means line (in red) to show the deviation between the identity line (y = x, shown dashed in blue) and the line of best fit:

Code

condition_first_annotated %>%
  ggplot(aes(x = no_words_in_text,
             y = wordcount_from_metatata)) + 
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, se = TRUE, col = "red") + 
  geom_abline(slope = 1, intercept = 0, col = "blue", lty = 2) +
  xlab("Word count, CQPWeb") + 
  ylab("Word count, metadata")

Figure 6: Correlation between word counts derived from the Lexis-supplied metadata and reported by CQPweb. The blue dashed line shows perfect equality, red is line of best fit of data. The overestimate by CQPweb is clearly detectable.

To make this more apparent, we plot the difference between the metadata-provided and Python and CQP-web counts:

Code

condition_first_annotated %>%
  mutate(
    `Python - metadata` = (abs(wordcount_total - wordcount_from_metatata)),
    `CQPweb - metadata` = (abs(no_words_in_text - wordcount_from_metatata)),
    word_count_quartile = ntile(wordcount_from_metatata, 10)) %>%
  select(`Python - metadata`, `CQPweb - metadata`, word_count_quartile) %>%
  pivot_longer(cols = c(`Python - metadata`, `CQPweb - metadata`)) %>%
  ggplot(aes(x = as.factor(word_count_quartile), 
             y = value,
             fill = name)) + geom_boxplot() +
  labs(
    x = "Text length, decile, based on metadata supplied word count",
    y = "Absolute difference in word count between X and metadata-provided counts",
   caption = "The longer the text, the more the CQP-Web count diverges from that of the metadata (and Python)") + guides(fill=guide_legend(title="Difference"))

Figure 7: Boxplot of absolute difference between (Python-metadata) and (CQPweb-metadata) derived word counts. It’s clear that the longer the text, the more CQPweb overestimates the word count, especially for the top decile of texts by length.

This also affects the normalised counts. To show this, we have broken the dataset into 6 groups based on text length, with 1 being the shortest and 6 being the longest texts. We can see that for the ~500 longest texts (panel 6), the line of best fit between frequencies is curved, showcasing how the normalised frequency is under-estimated for longer texts (due to the denominator for normalisation being bigger, as punctuation is included in the calculation).

Code

condition_first_annotated %>%
  mutate(frequency = 10^6*no_hits_in_text/wordcount_total,
         length_quantile = ntile(desc(condition_first_annotated$wordcount_total),6)) %>%
  ggplot(aes(x = frequency, y = freq_per_million_words)) + 
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, se = TRUE, col = "red") + 
  geom_abline(slope = 1, intercept = 0, col = "blue", lty = 2) +
  facet_wrap(~length_quantile, scales = "free_y") +
  labs(x = "Python-counted frequency per million words",
       y = "CQP-calculated frequency per million words")

Figure 8: Comparison of effect of using CQPweb vs Python word counts by texts grouped into 6 length bins

CONCLUSION: Use the Python-generated total word count for calculating normalised frequencies (or the counts provided in the metadata, the two are nearly identical and well correlated across all text lengths. This also means we cannot use the provided normalised frequencies from CQPweb.