Obesity phase 2 - Condition-first vs person-first language

In this notebook, we explore whether there is a difference in the use of condition- vs person-first language in the Australian Obesity Corpus.

Executive summary

1. Condition-first language is used in 9-14% of articles from all sources, while person-first language is used in less than 1% of articles.

2. Condition-first language is used in 7-14% of articles per year across the study time period, while person-first language is used in 0.17-1.14% of articles per year.

3. Person-first language is present in approximately the same number of articles in broadsheet and tabloid newspapers, whereas articles with only condition-first language are higher in number in tabloid publications.

The Pearson’s Chi-squared test with Yates’ continuity correction contrasting articles from tabloids and broadsheets that use only condition-first vs only person-first language indicate a significant link (X-squared = 4.8274, p-value = 0.02801) between type of publication and number of articles using a specific language type. The effect size is quite small (<0.2), indicating that while the result is statistically significant, the fields are weakly associated. So while the number of articles from broadsheet and tabloids that use condition- and person-first language that we observe are different, the magnitude of this difference (i.e. the number of articles we see vs would expect by random chance) is not very high.
The total number of uses of condition-first language we observe is higher in tabloids and lower in broadsheets than we would expect based on the word count in these subcorpora (p < 0.001).
The number of articles with condition-first language we observe is also higher in tabloids and lower in broadsheets than we would expect based on the total article count in these subcorpora (p < 0.001).
The total number of uses of person-first language we observe is somewhat higher in tabloids and lower in broadsheets than we would expect based on the word count in these subcorpora, but this result is not strongly significant (p < 0.05).
The number of articles with person-first language we observe is, in contrast, lower in tabloids and higher in broadsheets than we would expect based on the total article count in these subcorpora (p < 0.002).

4. Person-first language is present in approximately the same number of articles in left- and right-leaning newspapers, whereas articles with only condition-first language are higher in number in right-leaning publications.

The total number of uses of condition-first language we observe is higher in right and lower in left-leaning publications than we would expect based on the word count in these subcorpora (p < 0.001).
The number of articles with condition-first language we observe is also is higher in right and lower in left-leaning publications than we would expect based on the total article count in these subcorpora (p < 0.001).
The total number of uses of person-first language we observe is higher in left and lower in right-leaning publications than we would expect based on the word count in these subcorpora (p < 0.002).
The number of articles with person-first language we observe is also is higher in left and lower in right-leaning publications than we would expect based on the total article count in these subcorpora (p < 0.002).
The Pearson’s Chi-squared test with Yates’ continuity correction contrasting articles from left- and right-leaning publications that use only condition-first vs only person-first language indicate a significant link (X-squared = 4.6405, p-value = 0.03123) between type of publication and number of articles using a specific language type. The effect size is, however, also negligible, indicating that while the result is statistically significant, the fields are weakly associated. This indicates that while we do see more articles that use only condition-first language and fewer that use person-first language in tabloids than in broadsheets, the difference in numbers between these observed number of articles and what we would expect by random chance is not very high.

5. Re-sampling the corpus 10000 times to select 1000 articles at a time without replacement results in a mean of 4 articles per 1000 using person-first language, and 122 per 1000 using condition-first language - so more articles in the corpus use condition-first language than person-first language.

The Welch Two Sample t-test testing the difference between person_first and condition_first bootstrapping (mean of person_first = 4.11, mean of condition_first = 122.57) suggests that the effect is negative, statistically significant, and large (difference = -118.46, 95% CI [-118.67, -118.26], t(10751.55) = -1138.14, p < .001; Cohen’s d = -16.10, 95% CI [-16.31, -15.88]).

6.In texts that use either condition-first, person-first languages or both, the frequency of condition-first language is higher (mean 4 words per 1000) than person-first (mean 2.7 words per 1000).

The Welch Two Sample t-test testing the difference between condition_first_frequencies and person_first_frequencies (mean of x = 4.34, mean of y = 2.67) suggests that the effect is positive, statistically significant, and small (difference = 1.66, 95% CI [1.16, 2.17], t(131.59) = 6.49, p < .001; Cohen’s d = 0.44, 95% CI [0.30, 0.58])

7. Relative to the Advertiser, the Age, Australian, Canberra Times, Courier Mail and Sydney Morning Herald had lower frequency of condition-first language.

Code

library(here)
library(dplyr)
library(ggplot2)
library(ggvenn)
library(readr)
library(tidyr)
library(knitr)
library(ggrepel)
library(report)
library(lme4)
library(optimx)
# set ggplot2 to use the minimal theme for all figures in the document
# unless explicitly specified otherwise
theme_set(theme_minimal())
source(here::here("400_analysis", "functions.R"))
condition_first <- read_cqpweb("aoc_all_condition_first.txt")
person_first <- read_cqpweb("aoc_all_person_first.txt")
metadata <- read_csv(here("100_data_raw", "corpus_cqpweb_metadata.csv"))
additional_source_metadata <- read_csv(here("100_data_raw", "addition_source_metadata.csv"))
metadata_full <- inner_join(metadata, additional_source_metadata)
condition_first_annotated <- inner_join(
  metadata_full, condition_first, by = c("article_id" = "text")) %>% 
  mutate(frequency = 10^3*no_hits_in_text/wordcount_total) 
person_first_annotated <- inner_join(
  metadata_full, person_first, by = c("article_id" = "text"))%>% 
  mutate(frequency = 10^3*no_hits_in_text/wordcount_total) 
corpus_articlecounts <- read_csv(here("100_data_raw", "articlecounts_full.csv"), col_names = TRUE, skip = 1) %>% filter(year != "source") %>% rename(source = year)

As discussed in the exploratory data analysis, we use the Python-generated word counts to count the frequency of occurrences per thousand words, as these do not include counts for punctuation symbols and hence do not distort counts for longer texts.

We group articles into tabloids and broadsheets, and by orientation, in the following manner:

Code

metadata_full |>
  select(source, source_type, orientation) |>
  distinct() |>
  kable()

Table 1: Classification of sources into types and by orientation.
source	source_type	orientation
Advertiser	tabloid	right
Australian	broadsheet	right
NorthernT	tabloid	right
CourierMail	tabloid	right
Age	broadsheet	left
SydHerald	broadsheet	left
Telegraph	tabloid	right
WestAus	tabloid	right
CanTimes	broadsheet	left
HeraldSun	tabloid	right
HobMercury	tabloid	right
BrisTimes	broadsheet	left

Total number of articles with each of the language usages

First, we explore how many articles (absolute numbers and relative to the total number of articles in each source) use condition-first vs person-first language.

Code

condition_person_rbound <-
  rbind(
    articles_per_journal(person_first_annotated, "Person-first"),
    articles_per_journal(condition_first_annotated, "Condition-first"))
# generate how many articles per source are in the corpus
corpus_total_articles_bysource <-
  corpus_articlecounts %>% 
  rowwise() %>% 
  mutate(total = sum(c_across(where(is.numeric)))) %>% 
  select(source, total)
condition_person_rbound %>% 
  select(-year) %>% 
  group_by(type) %>% 
  count(source) %>%
  inner_join(corpus_total_articles_bysource) %>%
  mutate(percent = round(100*n/total, 2)) %>%
  rename(count = n) %>%
  pivot_wider(id_cols = source, names_from = type, values_from = c(count, total, percent), names_glue = "{type} {.value}") %>%
  rename(Total_articles = `Person-first total`) %>%
  select(-`Condition-first total`) %>%
  kable()

Table 2: Number and percentage (out of 100%) of articles in which person-first and condition-first language is used in the corpus, by publication.
source	Condition-first count	Person-first count	Total_articles	Condition-first percent	Person-first percent
Advertiser	456	8	3349	13.62	0.24
Age	315	19	2826	11.15	0.67
Australian	191	8	1960	9.74	0.41
BrisTimes	22	1	228	9.65	0.44
CanTimes	212	8	2044	10.37	0.39
CourierMail	434	12	3131	13.86	0.38
HeraldSun	509	14	3722	13.68	0.38
HobMercury	172	2	1465	11.74	0.14
NorthernT	95	2	822	11.56	0.24
SydHerald	430	23	3636	11.83	0.63
Telegraph	144	6	1089	13.22	0.55
WestAus	228	3	1891	12.06	0.16

We can see that condition-first language is used in 9-14% of articles, while person-first language is used in less than 1% of articles across all sources.

Code

corpus_total_articles_byyear <-
  corpus_articlecounts %>%
  pivot_longer(cols = -source, names_to = "year", values_to = "number_of_articles" ) %>%
  select(-source) %>%
  group_by(year) %>%
  summarise(total = sum(number_of_articles)) %>%
  mutate(year = as.numeric(year))
# count the number of articles per source that use person first language
condition_person_rbound %>% 
  select(-source) %>% 
  group_by(type) %>% 
  count(year) %>%
  inner_join(corpus_total_articles_byyear) %>%
  mutate(percent = round(100*n/total, 2)) %>%
  rename(count = n) %>%
  pivot_wider(id_cols = year, names_from = type, values_from = c(count, total, percent), names_glue = "{type} {.value}") %>%
  rename(Total_articles = `Person-first total`) %>%
  select(-`Condition-first total`) %>%
  kable()

Table 3: Number and percentage (out of 100%) of articles in which person-first and condition-first language is used in the corpus, by year.
year	Condition-first count	Person-first count	Total_articles	Condition-first percent	Person-first percent
2008	398	6	3000	13.27	0.20
2009	348	6	2472	14.08	0.24
2010	304	4	2394	12.70	0.17
2011	295	5	2245	13.14	0.22
2012	289	5	2162	13.37	0.23
2013	283	11	2620	10.80	0.42
2014	283	8	2219	12.75	0.36
2015	256	8	2265	11.30	0.35
2016	248	15	1829	13.56	0.82
2017	201	16	1791	11.22	0.89
2018	196	6	1765	11.10	0.34
2019	107	16	1401	7.64	1.14

We can see that condition-first language is used in 7-14% of articles per year across the study time period, while person-first language is used in 0.17-1.14% of articles per year.

Furthermore, the numbers of articles that use person-first language within the corpus are quite small, so we cannot simultaneously explore whether this type of language changes across both publication and year:

Code

assess_year_source(person_first_annotated)

Table 4: Number of articles that use person-first language by source and year in the Australian Obesity Corpus.
source	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	Total
Advertiser	1	1	0	0	1	1	0	0	1	2	0	1	8
Age	2	0	1	2	2	1	2	3	2	2	1	1	19
CourierMail	1	0	1	0	0	1	2	1	4	1	1	0	12
HeraldSun	1	0	0	2	0	3	1	0	2	1	1	3	14
SydHerald	1	2	1	1	1	2	1	2	1	3	1	7	23
Australian	0	1	0	0	0	1	0	0	1	2	1	2	8
CanTimes	0	1	0	0	1	2	0	1	0	1	0	2	8
HobMercury	0	1	0	0	0	0	0	0	0	1	0	0	2
WestAus	0	0	1	0	0	0	0	0	1	1	0	0	3
NorthernT	0	0	0	0	0	0	1	1	0	0	0	0	2
Telegraph	0	0	0	0	0	0	1	0	2	2	1	0	6
BrisTimes	0	0	0	0	0	0	0	0	1	0	0	0	1
Total	6	6	4	5	5	11	8	8	15	16	6	16	106

There is also not a lot of articles that use such language from each publication (2-23 articles, mean 9.9 +/- 7.14), so modelling the trend by publication is unlikely to result in meaningful data.

We do have a reasonable number of articles that use condition-first language, so we can model this if desired (except for the Brisbane Times and Daily Telegraph, where we are missing data prior to 2014):

Code

assess_year_source(condition_first_annotated)

Table 5: Number of articles that use condition-first language by source and year in the Australian Obesity Corpus.
source	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	Total
Advertiser	40	53	43	49	40	51	43	41	37	25	19	15	456
Age	49	24	24	28	37	25	22	34	26	18	19	9	315
Australian	35	26	20	22	17	11	16	8	9	7	12	8	191
CanTimes	15	18	14	18	24	17	31	24	18	10	19	4	212
CourierMail	65	44	50	31	25	40	38	36	29	29	28	19	434
HeraldSun	66	71	56	61	51	49	35	25	32	28	22	13	509
HobMercury	31	16	15	20	14	23	8	13	8	8	11	5	172
NorthernT	12	21	13	5	7	10	7	5	4	7	3	1	95
SydHerald	47	52	41	35	38	41	30	37	36	23	32	18	430
WestAus	38	23	28	26	36	16	17	14	17	9	3	1	228
BrisTimes	0	0	0	0	0	0	3	5	2	2	5	5	22
Telegraph	0	0	0	0	0	0	33	14	30	35	23	9	144
Total	398	348	304	295	289	283	283	256	248	201	196	107	3208

Also, among articles that use person-first language, nearly half will also use condition-first language in the same article:

Code

tmpfig <-
  ggvenn(list(
  `Condition first` = condition_first_annotated$article_id, 
  `Person first` = person_first_annotated$article_id),
  fill_color = c("white", "white"))
ggsave(
  plot=tmpfig,
  device = "png",
       here::here("400_analysis","venn_diagram_condition_person_first.png"),
       bg = "white", 
       width = 4,
       height = 4)
ggvenn(list(
  `Condition first` = condition_first_annotated$article_id, 
  `Person first` = person_first_annotated$article_id),
  fill_color = c("#0073C2FF", "#CD534CFF"))

Figure 1: Number of articles that use person-first, condition-first or both language types within the same article.

This means that comparing the use of person-first and condition-first language using a Chi-square test will not be appropriate, as the same article will be counted towards both condition-first and person-first language.

We can, however, compare the number of articles that use either language type (i.e. ONLY condition-first and only person-first) by type of publication:

Code

language_sourcetype_table <-
  get_article_counts_nooverlap(condition_first_annotated,
                     person_first_annotated,
                     source_type)
language_sourcetype_table %>% 
  kable()

Table 6: Number of articles that use either language type (i.e. ONLY condition-first and only person-first) by type of publication
source_type	condition-first	person-first
broadsheet	1141	30
tabloid	2020	29

We can see that person-first language is present in approximately the same number of articles in broadsheet and tabloid newspapers, whereas articles with only condition-first language are higher in number in tabloid publications.

Code

chisq_source_res <- chisq.test(language_sourcetype_table[,c("condition-first", "person-first")])
prettyrbind_chisq_result(
  df = language_sourcetype_table[,c("condition-first", "person-first")], 
  chisq_source_res, 
  prefix = "corrected") |> kable()

Table 7: Results of Chi-Square test of number of articles that use only person-first and only condition-first language in the corpus.
variable	value
corrected_method	Pearson’s Chi-squared test with Yates’ continuity correction
corrected_parameter	1
corrected_statistic	4.827403
corrected_p.value	0.02801079
corrected_effect_size	0.0387194207367031
corrected_condition_first_observed	1141
corrected_person_first_observed	30
corrected_condition_first_observed	2020
corrected_person_first_observed	29
corrected_condition_first_expected	1149.54378881988
corrected_person_first_expected	21.4562111801242
corrected_condition_first_expected	2011.45621118012
corrected_person_first_expected	37.5437888198758

The Chi-square test results in a p-value that is less than 0.05, indicating a significant link between type of publication and type of language used.

The effect size is quite small (<0.2), indicating that while the result is statistically significant, the fields are only weakly associated.

There has been criticism of use of the Yates correction, so we also provide the uncorrected results below:

Code

chisq_source_res_uncorr <- chisq.test(language_sourcetype_table[,c("condition-first", "person-first")], correct = F)
prettyrbind_chisq_result(
  df = language_sourcetype_table[,c("condition-first", "person-first")], 
  chisq_source_res_uncorr, 
  prefix = "uncorrected") |> kable()

Table 8: Results of Chi-Square test without Yates continuity correction of number of articles that use only person-first and only condition-first language in the corpus.
variable	value
uncorrected_method	Pearson’s Chi-squared test
uncorrected_parameter	1
uncorrected_statistic	5.446196
uncorrected_p.value	0.01961098
uncorrected_effect_size	0.0411262107211088
uncorrected_condition_first_observed	1141
uncorrected_person_first_observed	30
uncorrected_condition_first_observed	2020
uncorrected_person_first_observed	29
uncorrected_condition_first_expected	1149.54378881988
uncorrected_person_first_expected	21.4562111801242
uncorrected_condition_first_expected	2011.45621118012
uncorrected_person_first_expected	37.5437888198758

Let’s next use a similar approach to identify whether left- or right- leaning publications are different in their use of condition- vs person-first language. What is the total number of articles that use EITHER condition-first or person-first language by orientation of publication?

Code

language_orientation_table <- get_article_counts_nooverlap(condition_first_annotated, person_first_annotated, var = orientation)
language_orientation_table %>% kable()

Table 9: Number of articles that use either condition-first or person-first language by orientation of publication.
orientation	condition-first	person-first
left	954	26
right	2207	33

Next, let’s run a Chi-square test on this contingency table:

Code

chisq_language_orientation_corr <- chisq.test(language_orientation_table[,c("condition-first", "person-first")])
prettyrbind_chisq_result(
  df = language_orientation_table[,c("condition-first", "person-first")], 
  chisq_language_orientation_corr, 
  prefix = "corrected") |> kable()

Table 10: Results of Chi-Square test without Yates continuity correction of number of articles that use only person-first and only condition-first language in the corpus.
variable	value
corrected_method	Pearson’s Chi-squared test with Yates’ continuity correction
corrected_parameter	1
corrected_statistic	4.640452
corrected_p.value	0.03122677
corrected_effect_size	0.0379622735273705
corrected_condition_first_observed	954
corrected_person_first_observed	26
corrected_condition_first_observed	2207
corrected_person_first_observed	33
corrected_condition_first_expected	962.04347826087
corrected_person_first_expected	17.9565217391304
corrected_condition_first_expected	2198.95652173913
corrected_person_first_expected	41.0434782608696

The Chi-square test of independence is significant.

However, the effect size is negligible (<= 0.2), indicating that once again the fields are only weakly associated.

There has been criticism of use of the Yates correction, so we also provide the uncorrected results below:

Code

chisq_language_orientation_uncorr <- chisq.test(language_orientation_table[,c("condition-first", "person-first")], correct = F)
prettyrbind_chisq_result(
  df = language_orientation_table[,c("condition-first", "person-first")], 
  chisq_language_orientation_uncorr, 
  prefix = "uncorrected") |> kable()

Table 11: Results of Chi-Square test without Yates continuity correction of number of articles that use only person-first and only condition-first language in the corpus.
variable	value
uncorrected_method	Pearson’s Chi-squared test
uncorrected_parameter	1
uncorrected_statistic	5.276
uncorrected_p.value	0.02162136
uncorrected_effect_size	0.0404785049139109
uncorrected_condition_first_observed	954
uncorrected_person_first_observed	26
uncorrected_condition_first_observed	2207
uncorrected_person_first_observed	33
uncorrected_condition_first_expected	962.04347826087
uncorrected_person_first_expected	17.9565217391304
uncorrected_condition_first_expected	2198.95652173913
uncorrected_person_first_expected	41.0434782608696

Comparing article counts that use condition-first, person-first and no language type

As discussed above, the corpus contains articles that use condition-first, person-first and neither of these two language types. We can use repeated sampling of 1000 articles from the corpus 10000 times to explore how frequently we would observe articles from each of the three groups.

Code

single_row <- function(x) {
  no_hits_cond_first <-
  condition_first_annotated %>% 
  filter(article_id %in% x) %>%
  summarise(count_condition_first = sum(no_hits_in_text)) %>% 
  pull()
no_hits_person_first <-
  person_first_annotated %>% 
  filter(article_id %in% x) %>%
  summarise(count_person_first = sum(no_hits_in_text)) %>% 
  pull()
total_words <-
  metadata_full %>%
  filter(article_id %in% x) %>%
  summarise(wc_total = sum(wordcount_total)) %>% 
  pull()
cbind(data.frame(
  condition_first = sum(x %in% condition_first_annotated$article_id),
  person_first = sum(x %in% person_first_annotated$article_id),
  freq_cond_first = 10^6*no_hits_cond_first/total_words,
  freq_pers_first = 10^6*no_hits_person_first/total_words
))
}
diff_boot <- purrr::map_dfr(
  1:10000,
  ~single_row(sample(metadata_full$article_id, 1000, replace = FALSE))
  )

We can visualise the observed counts per 1000 articles from the 10000 resamples:

Code

diff_boot %>%
  select(-starts_with("freq")) %>%
  pivot_longer(cols = everything(),
               names_to = "language_type", 
               values_to = "count_per_10000_articles") %>%
  ggplot(aes(x = count_per_10000_articles,
             fill = language_type)) +
  geom_histogram(bins = 150) + theme(legend.position = "bottom") + 
  labs(
    x = "Count per 1000 articles sampled",
    y = "Resamples with observed count",
    caption = "Total 10 000 resamples of corpus with 1000 articles each"
  )

Figure 2: Histogram of observed counts of person-first and condition-first language from 10000 resamples of 1000 articles each of the Australian Obesity Corpus.

We can compare the mean of these two observed resamples:

Code

PersonFirst <- diff_boot$person_first
ConditionFirst <- diff_boot$condition_first
report(t.test(PersonFirst, ConditionFirst))

The Welch Two Sample t-test testing the difference between PersonFirst and ConditionFirst (mean of x = 4.03, mean of y = 122.61) suggests that the effect is negative, statistically significant, and large (difference = -118.58, 95% CI [-118.78, -118.38], t(10751.74) = -1151.43, p < .001; Cohen’s d = -16.28, 95% CI [-16.50, -16.06])

This shows that on average of every 1000 articles sampled from the corpus, 4 will use person-first and 122 will use condition-first language.

A non-parametric FP test supports this result:

Code

fp_test(
  wc1 = ConditionFirst,
  wc2 = PersonFirst,
  label1 = "condition",
  label2 = "person", 
  dist = mydistribution
)


    Approximative Two-Sample Fisher-Pitman Permutation Test
data:  wc by label (condition, person)
Z = 140.36, p-value < 1e-04
alternative hypothesis: true mu is not equal to 0

Code

ConditionFirst <- NULL
PersonFirst <- NULL

Comparing the frequency of condition-first, person-first and no language type

We can also look at the frequenty of the different language used across the two sets of resamples.

We can visualise the observed frequency per million words from the 10000 resamples:

Code

diff_boot %>%
  select(starts_with("freq")) %>%
  pivot_longer(cols = everything(),
               names_to = "language_type", 
               values_to = "freq_per_million_words") %>%
  ggplot(aes(x = freq_per_million_words,
             fill = language_type)) +
  geom_histogram(bins = 150) + theme(legend.position = "bottom") + 
  labs(
    x = "Frequency per million words in subcorpus",
    y = "Resamples with observed count",
    caption = "Total 10 000 resamples of corpus with 1000 articles each"
  )

Figure 3: Histogram of frequency per million words in subcorpus of person-first and condition-first language from 10000 resamples of 1000 articles each of the Australian Obesity Corpus.

We can compare the mean of these two observed resamples:

Code

ConditionFirst <- diff_boot$freq_cond_first
PersonFirst <- diff_boot$freq_pers_first
report(t.test(PersonFirst, 
              ConditionFirst))

The Welch Two Sample t-test testing the difference between PersonFirst and ConditionFirst (mean of x = 8.23, mean of y = 284.87) suggests that the effect is negative, statistically significant, and large (difference = -276.64, 95% CI [-277.26, -276.01], t(10658.76) = -870.06, p < .001; Cohen’s d = -12.30, 95% CI [-12.47, -12.14])

This shows that on average of every 1000 articles sampled from the corpus, 4 will use person-first and 122 will use condition-first language.

A non-parametric FP test confirms this result:

Code

fp_test(
  wc1 = ConditionFirst,
  wc2 = PersonFirst,
  label1 = "condition",
  label2 = "person", 
  dist = mydistribution
)


    Approximative Two-Sample Fisher-Pitman Permutation Test
data:  wc by label (condition, person)
Z = 139.59, p-value < 1e-04
alternative hypothesis: true mu is not equal to 0

Comparing the number of phrases that use condition-first vs person-first language

We can also take a different approach, comparing the number of phrases that use each language type; Here each phrase will contribute only to one group, i.e. be counted towards either person-first or condition-first. A phrase is defined in this context as an instance of language use, for example the article “AD150801123” contains 7 phrases, numbered below, that are classified by CQPweb as condition-first language:

The discovery by the Murdoch Childrens Institute raises hope that if we can tackle obesity in childhood(1) we can avoid a tsunami of obesity-related(2) health expenses in the future.

<..> “The findings will have major implications for how we treat childhood obesity(3),” says Professor Sabin.

A review by the Murdoch Childrens Research Institute published in the Journal of Paediatrics and Child Health has found childhood obesity(4) has doubled in prevalence since the 1980s. Professor Sabin says while rates of childhood overweight and obesity(5) have plateaued the severity of the problem has increased.

“Childhood obesity(6) has become a global crisis and is one of the world’s most pressing public health issues,” he says.The Murdoch Childrens Institute is undertaking a number of studies and programs to combat childhood obesity(7).

We do, however, need to confirm that most articles have a small number of phrases, vs a small number of articles with a large number of phrases fully underpinning our counts. Let’s look at how many articles have how many counts of each language usage:

Code

count_language_types <- condition_first_annotated %>%
  dplyr::select(article_id, no_hits_in_text) %>%
  mutate(type="condition-first") %>%
  rbind({
    person_first_annotated %>%
      select(article_id, no_hits_in_text) %>%
      mutate(type="person-first")
  })
count_language_types %>% 
  group_by(no_hits_in_text, type) %>%
  count() %>%
  pivot_wider(names_from=type, values_from = n, values_fill = 0) %>%
  kable()

Table 12: CQP-web determined number of hits in text, of condition-first and person-first language.
no_hits_in_text	condition-first	person-first
1	2379	93
2	508	10
3	165	1
4	82	0
5	35	0
6	19	0
7	5	0
8	11	0
10	1	2
12	2	0
13	1	0

Code

count_language_types %>%
  ggplot(aes(x = no_hits_in_text)) + geom_bar() + facet_grid(type~., scales = "free_y")

Figure 4: Histogram of CQP-web determined number of hits in text, of condition-first and person-first language.

Most articles have 1-2 uses of person-first/condition-first language, but some have up to 13 uses. How many total instances are there?

Code

count_language_types %>% 
  group_by(type) %>% 
  summarise( instances = sum(no_hits_in_text),
             articles = n()) %>%
  kable()

Table 13: Total number of CQP-web determined instances and articles with at least one instance of condition-first and person-first language.
type	instances	articles
condition-first	4677	3208
person-first	136	106

Relative frequency of condition- vs person-first language

Let’s explore what the relative frequency, calculated as 10^3*no_hits_in_text/wordcount_total (where no_hits_in_text is determined by CQPweb and wordcount_total is the Python word count), of condition-first vs person-first language looks like.

Code

freq_1 <- condition_first_annotated %>%
  select(frequency) %>%
  mutate(condition = "Condition-first language") %>%
  rbind({
  person_first_annotated %>%
  select(frequency) %>%
  mutate(condition = "Person-first language")
  }) 
freq_1_gt100words <- condition_first_annotated %>%
  filter(wordcount_from_metatata >= 100) %>%
  select(frequency) %>%
  mutate(condition = "Condition-first language") %>%
  rbind({
  person_first_annotated %>%
  filter(wordcount_from_metatata >= 100) %>%
  select(frequency) %>%
  mutate(condition = "Person-first language")
  }) 
freq_1 %>%
  ggplot(aes(x = frequency, fill = condition)) +
  facet_grid(condition~., scales = "free_y") +
  geom_histogram(bins = 100) + 
  xlab("Frequency per thousand words") + 
  ylab("Number of articles") + theme(legend.position = "none") +
  geom_vline(xintercept = 20, lty=2)

Figure 5: Histogram of relative frequency per 1000 words of condition-first and person-first language in the Australian Obesity Corpus.

Let’s create a box plot to compare the frequency per thousand words:

Code

freq_1 %>%
  ggplot(aes( y =  frequency, x = condition)) + 
  geom_boxplot(outlier.shape = NA) +
  scale_y_continuous(limits = quantile(freq_1$frequency, c(0.05, 0.95))) +
  labs(
    x = "",
    y = "Frequency per thousand words"
  )

Figure 6: Box plot comparing the distribution of the relative frequency per 1000 words of condition-first and person-first language in the Australian Obesity Corpus.

We can then use a two-sample t-test to compare the mean frequency of condition-first vs person-first language in the corpus:

Code

condition_first_frequencies <- freq_1 %>% 
  filter(condition == "Condition-first language") %>% 
  pull(frequency)
person_first_frequencies <- freq_1 %>%
  filter(condition == "Person-first language") %>% 
  pull(frequency)
report(t.test(condition_first_frequencies,
       person_first_frequencies
       ))

The Welch Two Sample t-test testing the difference between condition_first_frequencies and person_first_frequencies (mean of x = 4.34, mean of y = 2.67) suggests that the effect is positive, statistically significant, and small (difference = 1.66, 95% CI [1.16, 2.17], t(131.59) = 6.49, p < .001; Cohen’s d = 0.44, 95% CI [0.30, 0.58])

In texts that use either condition-first, person-first languages or both, the frequency of condition-first language is higher (mean 4 words per 1000) than person-first (mean 2.7 words per 1000).

We can also use a non-parametric FP test to support this:

Code

fp_test(
  wc1 = condition_first_frequencies,
  wc2 = person_first_frequencies,
  label1 = "condition",
  label2 = "person", 
  dist = mydistribution
)


    Approximative Two-Sample Fisher-Pitman Permutation Test
data:  wc by label (condition, person)
Z = 3.5855, p-value = 0.0012
alternative hypothesis: true mu is not equal to 0

The below plot shows the article ids of articles with a word count less than 100 for person-first language, and article ids with word counts less than 100 where the frequency is greater than 20 for condition-first language

Code

condition_first_annotated %>%
  # select only texts less than 100 words
  filter(wordcount_total <= 100) %>%
  select(article_id, frequency) %>%
  mutate(condition = "Condition-first language") %>%
  # note that for condition-first only looking at those that are very high frequency here
  filter(frequency >= 20) %>%
  rbind({
  person_first_annotated %>%
  # select only texts less than 100 words
  filter(wordcount_total <= 100) %>%
  select(article_id, frequency) %>%
  mutate(condition = "Person-first language")
  }) %>% 
  group_by(frequency) %>%
  mutate(cnt = n()) %>%
  ggplot(aes(x = frequency, y = cnt, fill = condition, label = article_id)) +
  facet_grid(condition~., scales = "free_y") +
  geom_text_repel(check_overlap = TRUE, angle = 90) +
  xlab("Frequency per thousand words") + 
  ylab("Article ID & count") + theme(legend.position = "none") +
  geom_vline(xintercept = 20, lty=2)

Figure 7: Article ids of articles with (top) word counts less than 100 where the frequency is greater than 20 for condition-first language, and (bottom) a word count less than 100 for person-first language.

There are a few texts with very high frequencies. These mostly occur in cases where the text length itself is quite short. We can consider whether we want to filter out texts with a word count of less than 100 words.

If we run a t-test on the dataset filtered to only contain texts greater than 100 words, we can see that while the results are still significant, the mean difference is less.

Code

condition_first_frequencies_gt100 <- freq_1_gt100words %>% 
  filter(condition == "Condition-first language") %>% 
  pull(frequency)
person_first_frequencies_gt100 <- freq_1_gt100words %>%
  filter(condition == "Person-first language") %>% 
  pull(frequency)
report(t.test(
  condition_first_frequencies_gt100,
  person_first_frequencies_gt100
       ))

The Welch Two Sample t-test testing the difference between condition_first_frequencies_gt100 and person_first_frequencies_gt100 (mean of x = 3.69, mean of y = 2.60) suggests that the effect is positive, statistically significant, and small (difference = 1.10, 95% CI [0.62, 1.57], t(118.21) = 4.58, p < .001; Cohen’s d = 0.38, 95% CI [0.21, 0.55])

This is supported by a non-parametric FP test:

Code

fp_test(
  wc1 = condition_first_frequencies_gt100,
  wc2 = person_first_frequencies_gt100,
  label1 = "condition",
  label2 = "person", 
  dist = mydistribution
)


    Approximative Two-Sample Fisher-Pitman Permutation Test
data:  wc by label (condition, person)
Z = 3.3794, p-value = 8e-04
alternative hypothesis: true mu is not equal to 0

Person-first language frequency

Let’s visualise the frequency, calculated as 10^3*no_hits_in_text/wordcount_total (where no_hits_in_text is determined by CQPweb and wordcount_total is the Python word count), of person-first language by publication:

Code

person_first_annotated %>%
  select(source, frequency, year, source_type) %>%
  ggplot(aes(x = reorder(source, frequency), 
             y = frequency, 
             fill = source_type)) +
  geom_boxplot(outlier.shape = NA) + 
  theme(axis.text.x=element_text(angle = 45, hjust =1),
        legend.position = "bottom") +
  labs(x = NULL, 
       y = "Frequency per thousand words") +
  geom_jitter(width = 0.25, alpha = 0.5)

Figure 8: Box and jitter plot showing the summary statistics and raw values of the frequency per 1000 words of person-first language in the different sources, with boxes coloured based on source type. This shows, for example, that the Northern Territorian has the highest median frequncy of person-first language, but this is because the summary statistics are only based on two quite divergent data points.

And per year:

Code

person_first_annotated %>%
  select(source, frequency, year, source_type) %>%
  ggplot(aes(x = as.factor(year), y = frequency, 
             fill = source_type,  shape = source_type)) +
  facet_wrap(~source_type) +
  geom_boxplot(outlier.shape = NA) + theme_bw() +
  theme(axis.text.x=element_text(angle = 45, hjust =1),
        legend.position = "bottom") +
  labs(x = NULL, y = "Frequency per thousand words") +
  geom_jitter(width = 0.1, alpha = 0.5)

Figure 9: Box and jitter plot showing the summary statistics and raw values of the frequency per 1000 words of person-first language by year and source type. This shows that due to the low numbers of articles using person-first language, there is substantial variability in the summary statistics (median, IQR) across the years.

Condition- and person-first language normalised by total article length

We can also look at the frequency of person-first and condition-first language by dividing the number of observations of each language type by the total word count of articles in which they are found (i.e. frequency per 1000 words not on a per-article basis, but on a per total word count of articles in the group).

For example, in the table below, there are 6 instances of person-first language in 2008, and the total word count of the articles that contain these 6 instances is 4386 words. Therefor, the normalised frequency is 6*1000/4386 = 1.37.

Code

freq_per_subcorpus <- function(df, group_vars){
  df %>%
  group_by(!!!group_vars) %>%
  summarise(
    total_instances = sum(no_hits_in_text),
    total_wc = sum(wordcount_total),
    instances_per_1000 = 1000 * total_instances/total_wc
  )
}
inner_join(
  {freq_per_subcorpus(person_first_annotated, 
                      group_vars=vars(year)) %>%
      rename("instances_person_first" = "total_instances",
             "wc_personfirst" = "total_wc",
             "person first instances per 1000 words" = "instances_per_1000")
  },
  {freq_per_subcorpus(
    condition_first_annotated, 
    group_vars=vars(year)) %>%
      rename("instances_cond_first" = "total_instances",
             "wc_condfirst" = "total_wc",
             "condition first instances per 1000 words" = "instances_per_1000")
  }) %>% kable(digits = 2)

Table 14: Number of instances of the two language types by year, total word count of articles that use the specified language type and the relative frequency of instances per 1000 words across all articles that year.
year	instances_person_first	wc_personfirst	person first instances per 1000 words	instances_cond_first	wc_condfirst	condition first instances per 1000 words
2008	6	4386	1.37	553	225124	2.46
2009	7	4319	1.62	507	187891	2.70
2010	4	2104	1.90	452	170920	2.64
2011	5	4144	1.21	447	152357	2.93
2012	5	2820	1.77	385	170630	2.26
2013	12	7251	1.65	382	156399	2.44
2014	9	3322	2.71	408	155966	2.62
2015	9	5223	1.72	436	145468	3.00
2016	15	8846	1.70	388	145091	2.67
2017	18	10394	1.73	309	104202	2.97
2018	9	3460	2.60	273	145452	1.88
2019	37	10379	3.56	137	59344	2.31

Code

rbind({freq_per_subcorpus(person_first_annotated, 
                   group_vars=vars(year)) %>%
    mutate(language = "Person-first")},
      {freq_per_subcorpus(condition_first_annotated, 
                   group_vars=vars(year)) %>%
    mutate(language = "Condition-first")}
        ) %>%
  ggplot(aes(x = as.factor(year),
             y = instances_per_1000,
             col = language)) +
  geom_point() +
  labs(x = "Year",
       y = "Frequency per 1000 words",
       caption = "Instances per 1000 words across all articles that use language type")

Figure 10: Visualisation of the above relative frequency per 1000 words of person-first and condition-first language by year.

We can also look at this across sources:

Code

inner_join(
  {freq_per_subcorpus(person_first_annotated, 
                      group_vars=vars(source)) %>%
      rename("instances_person_first" = "total_instances",
             "wc_personfirst" = "total_wc",
             "person first instances per 1000 words" = "instances_per_1000")
  },
  {freq_per_subcorpus(
    condition_first_annotated, 
    group_vars=vars(source)) %>%
      rename("instances_cond_first" = "total_instances",
             "wc_condfirst" = "total_wc",
             "condition first instances per 1000 words" = "instances_per_1000")
  }) %>% kable(digits = 2)

Table 15: Number of instances of the two language types by source, total word count of articles that use the specified language type and the relative frequency of instances per 1000 words across all articles from that source.
source	instances_person_first	wc_personfirst	person first instances per 1000 words	instances_cond_first	wc_condfirst	condition first instances per 1000 words
Advertiser	9	3655	2.46	636	215470	2.95
Age	21	13586	1.55	509	228541	2.23
Australian	10	6368	1.57	283	163862	1.73
BrisTimes	1	1101	0.91	35	16760	2.09
CanTimes	17	4774	3.56	342	130561	2.62
CourierMail	12	5018	2.39	607	236038	2.57
HeraldSun	16	8030	1.99	705	239849	2.94
HobMercury	2	673	2.97	240	68552	3.50
NorthernT	2	521	3.84	120	30149	3.98
SydHerald	37	18111	2.04	697	319498	2.18
Telegraph	6	3636	1.65	186	70524	2.64
WestAus	3	1175	2.55	317	99040	3.20

Code

rbind({freq_per_subcorpus(person_first_annotated, 
                   group_vars=vars(source)) %>%
    mutate(language = "Person-first")},
      {freq_per_subcorpus(condition_first_annotated, 
                   group_vars=vars(source)) %>%
    mutate(language = "Condition-first")}
        ) %>%
  ggplot(aes(x = source,
             y = instances_per_1000,
             col = language)) +
  geom_point() +
  labs(x = "Source",
       y = "Frequency per 1000 words",
       caption = "Instances per 1000 words across all sources that use language type") +
  theme(axis.text.x = element_text(angle=90))

Figure 11: Visualisation of the above relative frequency per 1000 words of person-first and condition-first language by source.

We can also do this across source and year:

Code

inner_join(
  {freq_per_subcorpus(person_first_annotated, 
                      group_vars=vars(source, year)) %>%
      rename("instances_person_first" = "total_instances",
             "wc_personfirst" = "total_wc",
             "person first instances per 1000 words" = "instances_per_1000")
  },
  {freq_per_subcorpus(
    condition_first_annotated, 
    group_vars=vars(source, year)) %>%
      rename("instances_cond_first" = "total_instances",
             "wc_condfirst" = "total_wc",
             "condition first instances per 1000 words" = "instances_per_1000")
  }) %>% kable(digits = 2)

Table 16: Number of instances of the two language types by source, year, total word count of articles that use the specified language type and the relative frequency of instances per 1000 words across all articles from that source that year.
source	year	instances_person_first	wc_personfirst	person first instances per 1000 words	instances_cond_first	wc_condfirst	condition first instances per 1000 words
Advertiser	2008	1	243	4.12	49	13109	3.74
Advertiser	2009	1	159	6.29	69	22082	3.12
Advertiser	2012	1	462	2.16	58	20250	2.86
Advertiser	2013	1	413	2.42	67	22586	2.97
Advertiser	2016	1	561	1.78	52	19254	2.70
Advertiser	2017	3	1571	1.91	40	12681	3.15
Advertiser	2019	1	246	4.07	21	7529	2.79
Age	2008	2	1316	1.52	68	36713	1.85
Age	2010	1	487	2.05	44	13081	3.36
Age	2011	2	2630	0.76	47	13058	3.60
Age	2012	2	953	2.10	49	27123	1.81
Age	2013	1	1319	0.76	34	21790	1.56
Age	2014	2	935	2.14	33	14387	2.29
Age	2015	3	2257	1.33	79	25206	3.13
Age	2016	2	1721	1.16	48	20528	2.34
Age	2017	2	973	2.06	26	14002	1.86
Age	2018	2	465	4.30	31	20118	1.54
Age	2019	2	530	3.77	10	6019	1.66
Australian	2009	2	1289	1.55	50	22444	2.23
Australian	2013	1	361	2.77	13	7129	1.82
Australian	2016	1	758	1.32	15	9868	1.52
Australian	2017	3	1188	2.53	9	5008	1.80
Australian	2018	1	1357	0.74	14	11260	1.24
Australian	2019	2	1415	1.41	11	6369	1.73
BrisTimes	2016	1	1101	0.91	3	1661	1.81
CanTimes	2009	1	246	4.07	23	8318	2.77
CanTimes	2012	1	539	1.86	32	14363	2.23
CanTimes	2013	2	1764	1.13	29	10193	2.85
CanTimes	2015	1	573	1.75	48	12869	3.73
CanTimes	2017	1	520	1.92	21	6585	3.19
CanTimes	2019	11	1132	9.72	8	2428	3.29
CourierMail	2008	1	580	1.72	92	44008	2.09
CourierMail	2010	1	543	1.84	72	29876	2.41
CourierMail	2013	1	425	2.35	60	24932	2.41
CourierMail	2014	2	550	3.64	51	13082	3.90
CourierMail	2015	1	293	3.41	50	15976	3.13
CourierMail	2016	4	1292	3.10	41	10375	3.95
CourierMail	2017	1	1077	0.93	44	11410	3.86
CourierMail	2018	1	258	3.88	38	18314	2.07
HeraldSun	2008	1	1815	0.55	90	31663	2.84
HeraldSun	2011	2	946	2.11	94	30705	3.06
HeraldSun	2013	4	975	4.10	62	22042	2.81
HeraldSun	2014	2	671	2.98	52	16593	3.13
HeraldSun	2016	2	1049	1.91	46	17646	2.61
HeraldSun	2017	1	1079	0.93	43	13259	3.24
HeraldSun	2018	1	358	2.79	31	10046	3.09
HeraldSun	2019	3	1137	2.64	14	4331	3.23
HobMercury	2009	1	405	2.47	23	7858	2.93
HobMercury	2017	1	268	3.73	9	2758	3.26
NorthernT	2014	1	92	10.87	11	1968	5.59
NorthernT	2015	1	429	2.33	9	1351	6.66
SydHerald	2008	1	432	2.31	71	29391	2.42
SydHerald	2009	2	2220	0.90	90	34411	2.62
SydHerald	2010	1	586	1.71	66	30403	2.17
SydHerald	2011	1	568	1.76	61	23340	2.61
SydHerald	2012	1	866	1.15	56	32263	1.74
SydHerald	2013	2	1994	1.00	52	29797	1.75
SydHerald	2014	1	622	1.61	45	24530	1.83
SydHerald	2015	3	1671	1.80	80	26517	3.02
SydHerald	2016	1	1101	0.91	68	26221	2.59
SydHerald	2017	3	1340	2.24	37	16969	2.18
SydHerald	2018	3	792	3.79	46	31308	1.47
SydHerald	2019	18	5919	3.04	25	14348	1.74
Telegraph	2014	1	452	2.21	45	17477	2.57
Telegraph	2016	2	967	2.07	36	13829	2.60
Telegraph	2017	2	1987	1.01	54	13954	3.87
Telegraph	2018	1	230	4.35	27	13466	2.01
WestAus	2010	1	488	2.05	39	10600	3.68
WestAus	2016	1	296	3.38	31	11382	2.72
WestAus	2017	1	391	2.56	16	3727	4.29

Code

rbind({freq_per_subcorpus(person_first_annotated, 
                   group_vars=vars(source, year)) %>%
    mutate(language = "Person-first")},
      {freq_per_subcorpus(condition_first_annotated, 
                   group_vars=vars(source, year)) %>%
    mutate(language = "Condition-first")}
        ) %>%
  ggplot(aes(x = as.factor(year),
             y = instances_per_1000,
             col = language)) +
  geom_jitter() +
  facet_wrap(~source) +
  labs(x = "Year",
       y = "Frequency per 1000 words",
       caption = "Instances per 1000 words across all sources & years that use language type") +
  theme(axis.text.x = element_text(angle=90))

Figure 12: Visualisation of the above relative frequency per 1000 words of person-first and condition-first language by source and year.

Note that above we considered the word count ONLY in articles that featured that particular language type in the denominator

Next, we conduct the same analysis, but including all articles from that particular source, year or both (irrespective of whether they feature the language type).

Code

freq_entire_corpus <- function(df, group_vars){
  df %>%
  group_by(!!!group_vars) %>%
  summarise(
    person_first_instances = sum(person_first),
    condition_first_instances = sum(condition_first),
    total_wc = sum(wordcount_total),
    person_first_per_1000 = 1000 * person_first_instances/total_wc,
    cond_first_per_1000 = 1000 * condition_first_instances/total_wc
  )
}
condition_person_comparison_together <- 
  full_join(
    full_join(
      {person_first_annotated %>%
          select(article_id, no_hits_in_text) %>%
          rename(person_first = no_hits_in_text)},
      {condition_first_annotated %>%
          select(article_id, no_hits_in_text) %>%
          rename(condition_first = no_hits_in_text)}
    ),
    {metadata_full %>%
        select(article_id, source, year, wordcount_total)}) %>%
  # fill NAs with 0
  mutate_if(is.numeric,coalesce,0) 
  # now generate the instances
freq_entire_corpus(condition_person_comparison_together, vars(year)) %>% 
  kable()

Table 17: Number of instances of the two language types by year, total word count of articles published that year and the relative frequency of instances per 1000 words across all articles from that source.
year	person_first_instances	condition_first_instances	total_wc	person_first_per_1000	cond_first_per_1000
2008	6	553	1863018	0.0032206	0.2968302
2009	7	507	1524207	0.0045926	0.3326320
2010	4	452	1454853	0.0027494	0.3106843
2011	5	447	1318579	0.0037920	0.3390013
2012	5	385	1280045	0.0039061	0.3007707
2013	12	382	1625785	0.0073810	0.2349634
2014	9	408	1271294	0.0070794	0.3209328
2015	9	436	1596597	0.0056370	0.2730808
2016	15	388	1156387	0.0129714	0.3355278
2017	18	309	1121553	0.0160492	0.2755108
2018	9	273	1211129	0.0074311	0.2254095
2019	37	137	1012594	0.0365398	0.1352961

Code

freq_entire_corpus(condition_person_comparison_together, vars(year)) %>%
  select(year, ends_with("per_1000")) %>%
  pivot_longer(cols = ends_with("1000"), names_to = "type", values_to = "value") %>%
  mutate(type = stringr::str_replace_all(type, "_per_1000", ""),
         type = case_when(type == "cond_first" ~ "Condition-first", TRUE ~ "Person-first")) %>%
  ggplot(aes(x = as.factor(year), y= value, col = type)) +
  geom_point() + 
  labs(x = "Year",
       y = "Instances per 1000 words in corpus by year",
       col = "")

Figure 13: Visualisation of the above relative frequency per 1000 words of person-first and condition-first language by year, considering the word count of all articles in the corpus that year.

We can see that if we consider the entire corpus, instances of condition-first language seem to be somewhat lower by year in 2017 onward.

Let’s look at the numbers by source, across all years:

Code

freq_entire_corpus({condition_person_comparison_together %>% filter(!(source %in% c("Telegraph", "BrisTimes")))}, vars(source)) %>% 
  kable()

Table 18: Number of instances of the two language types by source, total word count of articles published in that source in the corpus and the relative frequency of instances per 1000 words across all articles from that source.
source	person_first_instances	condition_first_instances	total_wc	person_first_per_1000	cond_first_per_1000
Advertiser	9	636	1750948	0.0051401	0.3632318
Age	21	509	2399436	0.0087521	0.2121332
Australian	10	283	1722950	0.0058040	0.1642532
CanTimes	17	342	1431478	0.0118758	0.2389139
CourierMail	12	607	1684101	0.0071255	0.3604297
HeraldSun	16	705	1867259	0.0085687	0.3775588
HobMercury	2	240	690782	0.0028953	0.3474323
NorthernT	2	120	301923	0.0066242	0.3974523
SydHerald	37	697	2901109	0.0127537	0.2402530
WestAus	3	317	892280	0.0033622	0.3552696

Code

freq_entire_corpus({
  condition_person_comparison_together %>% 
    filter(!(source %in% c("Telegraph", "BrisTimes")))
}, vars(source)) %>%
  select(source, ends_with("per_1000")) %>%
  pivot_longer(cols = ends_with("1000"), names_to = "type", values_to = "value") %>%
  mutate(type = stringr::str_replace_all(type, "_per_1000", ""),
         type = case_when(type == "cond_first" ~ "Condition-first", TRUE ~ "Person-first")) %>%
  ggplot(aes(x = source, y= value, col = type)) +
  geom_point() + 
  labs(x = "Source",
       y = "Instances per 1000 words in corpus by source",
       col = "") +
  theme(axis.text.x = element_text(angle=90))

Figure 14: Visualisation of the above relative frequency per 1000 words of person-first and condition-first language by source, considering the word count of all articles in the corpus from that source.

We can see that usage of condition-first language is quite varied by source.

Let’s look at source and year simultaneously:

Code

freq_entire_corpus({condition_person_comparison_together %>% filter(!(source %in% c("Telegraph", "BrisTimes")))}, vars(source, year)) %>% 
  kable()

Table 19: Relative frequency per 1000 words of person-first and condition-first language by source and year, considering the word count of all articles in the corpus from that source in that year.
source	year	person_first_instances	condition_first_instances	total_wc	person_first_per_1000	cond_first_per_1000
Advertiser	2008	1	49	152154	0.0065723	0.3220421
Advertiser	2009	1	69	158395	0.0063133	0.4356198
Advertiser	2010	0	58	155561	0.0000000	0.3728441
Advertiser	2011	0	67	151211	0.0000000	0.4430895
Advertiser	2012	1	58	162157	0.0061669	0.3576781
Advertiser	2013	1	67	207306	0.0048238	0.3231937
Advertiser	2014	0	63	151973	0.0000000	0.4145473
Advertiser	2015	0	62	173493	0.0000000	0.3573631
Advertiser	2016	1	52	126129	0.0079284	0.4122763
Advertiser	2017	3	40	117002	0.0256406	0.3418745
Advertiser	2018	0	30	100696	0.0000000	0.2979264
Advertiser	2019	1	21	94871	0.0105406	0.2213532
Age	2008	2	68	316472	0.0063197	0.2148689
Age	2009	0	40	201930	0.0000000	0.1980884
Age	2010	1	44	181424	0.0055119	0.2425258
Age	2011	2	47	167790	0.0119197	0.2801120
Age	2012	2	49	163014	0.0122689	0.3005877
Age	2013	1	34	269995	0.0037038	0.1259283
Age	2014	2	33	172156	0.0116174	0.1916866
Age	2015	3	79	283729	0.0105735	0.2784347
Age	2016	2	48	197653	0.0101187	0.2428498
Age	2017	2	26	157039	0.0127357	0.1655640
Age	2018	2	31	161020	0.0124208	0.1925227
Age	2019	2	10	127214	0.0157215	0.0786077
Australian	2008	0	59	258517	0.0000000	0.2282248
Australian	2009	2	50	188289	0.0106220	0.2655492
Australian	2010	0	27	154723	0.0000000	0.1745054
Australian	2011	0	30	166076	0.0000000	0.1806402
Australian	2012	0	22	142596	0.0000000	0.1542820
Australian	2013	1	13	132956	0.0075213	0.0977767
Australian	2014	0	24	130256	0.0000000	0.1842525
Australian	2015	0	9	116308	0.0000000	0.0773807
Australian	2016	1	15	91995	0.0108702	0.1630523
Australian	2017	3	9	96918	0.0309540	0.0928620
Australian	2018	1	14	127452	0.0078461	0.1098453
Australian	2019	2	11	116864	0.0171139	0.0941265
CanTimes	2008	0	20	111564	0.0000000	0.1792693
CanTimes	2009	1	23	119572	0.0083632	0.1923527
CanTimes	2010	0	24	106258	0.0000000	0.2258653
CanTimes	2011	0	36	114314	0.0000000	0.3149221
CanTimes	2012	1	32	105464	0.0094819	0.3034211
CanTimes	2013	2	29	175309	0.0114084	0.1654222
CanTimes	2014	0	46	145337	0.0000000	0.3165058
CanTimes	2015	1	48	162947	0.0061370	0.2945743
CanTimes	2016	0	32	77167	0.0000000	0.4146850
CanTimes	2017	1	21	119117	0.0083951	0.1762973
CanTimes	2018	0	23	135017	0.0000000	0.1703489
CanTimes	2019	11	8	59412	0.1851478	0.1346529
CourierMail	2008	1	92	249904	0.0040015	0.3681414
CourierMail	2009	0	59	228541	0.0000000	0.2581594
CourierMail	2010	1	72	191892	0.0052113	0.3752111
CourierMail	2011	0	43	131489	0.0000000	0.3270236
CourierMail	2012	0	32	124784	0.0000000	0.2564431
CourierMail	2013	1	60	165817	0.0060307	0.3618447
CourierMail	2014	2	51	100723	0.0198564	0.5063392
CourierMail	2015	1	50	110106	0.0090822	0.4541079
CourierMail	2016	4	41	71961	0.0555857	0.5697531
CourierMail	2017	1	44	93348	0.0107126	0.4713545
CourierMail	2018	1	38	114008	0.0087713	0.3333099
CourierMail	2019	0	25	101528	0.0000000	0.2462375
HeraldSun	2008	1	90	244227	0.0040946	0.3685096
HeraldSun	2009	0	91	191536	0.0000000	0.4751065
HeraldSun	2010	0	82	195937	0.0000000	0.4185019
HeraldSun	2011	2	94	188824	0.0105919	0.4978181
HeraldSun	2012	0	65	162385	0.0000000	0.4002833
HeraldSun	2013	4	62	202876	0.0197165	0.3056054
HeraldSun	2014	2	52	126605	0.0157972	0.4107263
HeraldSun	2015	0	35	142687	0.0000000	0.2452921
HeraldSun	2016	2	46	100735	0.0198541	0.4566437
HeraldSun	2017	1	43	104785	0.0095434	0.4103641
HeraldSun	2018	1	31	104080	0.0096080	0.2978478
HeraldSun	2019	3	14	102582	0.0292449	0.1364762
HobMercury	2008	0	45	93465	0.0000000	0.4814636
HobMercury	2009	1	23	68016	0.0147024	0.3381557
HobMercury	2010	0	23	72280	0.0000000	0.3182070
HobMercury	2011	0	26	47973	0.0000000	0.5419715
HobMercury	2012	0	19	60444	0.0000000	0.3143405
HobMercury	2013	0	31	87371	0.0000000	0.3548088
HobMercury	2014	0	12	29587	0.0000000	0.4055835
HobMercury	2015	0	24	33753	0.0000000	0.7110479
HobMercury	2016	0	10	60253	0.0000000	0.1659668
HobMercury	2017	1	9	46038	0.0217212	0.1954907
HobMercury	2018	0	13	50759	0.0000000	0.2561122
HobMercury	2019	0	5	40843	0.0000000	0.1224200
NorthernT	2008	0	12	34752	0.0000000	0.3453039
NorthernT	2009	0	27	33503	0.0000000	0.8058980
NorthernT	2010	0	17	32606	0.0000000	0.5213764
NorthernT	2011	0	6	21798	0.0000000	0.2752546
NorthernT	2012	0	10	21011	0.0000000	0.4759412
NorthernT	2013	0	10	26708	0.0000000	0.3744196
NorthernT	2014	1	11	20290	0.0492854	0.5421390
NorthernT	2015	1	9	26207	0.0381577	0.3434197
NorthernT	2016	0	6	29146	0.0000000	0.2058602
NorthernT	2017	0	8	22253	0.0000000	0.3595021
NorthernT	2018	0	3	16032	0.0000000	0.1871257
NorthernT	2019	0	1	17617	0.0000000	0.0567634
SydHerald	2008	1	71	263075	0.0038012	0.2698850
SydHerald	2009	2	90	232899	0.0085874	0.3864336
SydHerald	2010	1	66	271396	0.0036847	0.2431871
SydHerald	2011	1	61	247334	0.0040431	0.2466301
SydHerald	2012	1	56	262970	0.0038027	0.2129520
SydHerald	2013	2	52	256756	0.0077895	0.2025269
SydHerald	2014	1	45	197915	0.0050527	0.2273703
SydHerald	2015	3	80	325532	0.0092157	0.2457516
SydHerald	2016	1	68	224438	0.0044556	0.3029790
SydHerald	2017	3	37	191646	0.0156539	0.1930643
SydHerald	2018	3	46	219004	0.0136984	0.2100418
SydHerald	2019	18	25	208144	0.0864786	0.1201092
WestAus	2008	0	47	138888	0.0000000	0.3384022
WestAus	2009	0	35	101526	0.0000000	0.3447393
WestAus	2010	1	39	92776	0.0107786	0.4203673
WestAus	2011	0	37	81770	0.0000000	0.4524887
WestAus	2012	0	42	75220	0.0000000	0.5583621
WestAus	2013	0	24	97018	0.0000000	0.2473768
WestAus	2014	0	21	96631	0.0000000	0.2173216
WestAus	2015	0	17	69706	0.0000000	0.2438814
WestAus	2016	1	31	67430	0.0148302	0.4597360
WestAus	2017	1	16	45643	0.0219092	0.3505466
WestAus	2018	0	7	18468	0.0000000	0.3790340
WestAus	2019	0	1	7204	0.0000000	0.1388118

Code

freq_entire_corpus({
  condition_person_comparison_together %>% filter(!(source %in% c("Telegraph", "BrisTimes")))
}, vars(source, year)) %>%
  select(source, year, ends_with("per_1000")) %>%
  pivot_longer(cols = ends_with("1000"), names_to = "type", values_to = "value") %>%
  mutate(type = stringr::str_replace_all(type, "_per_1000", ""),
         type = case_when(type == "cond_first" ~ "Condition-first", TRUE ~ "Person-first")) %>%
  ggplot(aes(x = as.factor(year), y= value, col = type)) +
  geom_point() + 
  facet_wrap(~source) +
  labs(x = "Year",
       y = "Instances per 1000 words in corpus by source & year",
       col = "") +
  theme(axis.text.x = element_text(angle=90))

Figure 15: Visualisation of the above relative frequency per 1000 words of person-first and condition-first language by source and year, considering the word count of all articles in the corpus from that source in that year.

Condition-first language use

We can investigate the prevalence of condition-first language using goodness of fit tests, comparing their distribution in:

tabloids vs broadsheets
left and right leaning publications

We can do this by looking at:

the total number of instances in the subcorpus
the number of articles that feature this language type

Tabloids vs broadsheets

The total number of uses of condition-first language we observe is higher in tabloids and lower in broadsheets than we would expect based on the word count in these subcorpora (p < 0.001).

Code

chisq_instances_wc_normalised(condition_first_annotated, metadata_full, source_type) |> kable()

Table 20: Results of goodness of fit test of the number of uses of condition-first language we observe in tabloids and broadsheets relative to the total word count across these two types of sources.
variable	value
method	Chi-squared test for given probabilities
parameter	1
statistic	308.1251
p.value	5.593314e-69
broadsheet_observed	1866
broadsheet_expected	2465.34566596664
tabloid_observed	2811
tabloid_expected	2211.65433403336

The number of articles with condition-first language we observe is also higher in tabloids and lower in broadsheets than we would expect based on the total article count in these subcorpora (p < 0.001).

Code

chisq_articles_totalart_normalised(condition_first_annotated, metadata_full, source_type) |> kable()

Table 21: Results of goodness of fit test of the number of articles that use condition-first language we observe in tabloids and broadsheets relative to the total word count across these two types of sources.
variable	value
method	Chi-squared test for given probabilities
parameter	1
statistic	25.73612
p.value	3.914328e-07
broadsheet_observed	1170
broadsheet_expected	1311.25451974162
tabloid_observed	2038
tabloid_expected	1896.74548025838

Left vs right-leaning publications

The total number of uses of condition-first language we observe is higher in right and lower in left-leaning publications than we would expect based on the word count in these subcorpora (p < 0.001).

Code

chisq_instances_wc_normalised(condition_first_annotated, metadata_full, orientation) |> kable()

Table 22: Results of goodness of fit test of the number of uses of condition-first language we observe in right and left leaning publications relative to the total word count across these two types.
variable	value
method	Chi-squared test for given probabilities
parameter	1
statistic	134.72
p.value	3.801809e-31
left_observed	1583
left_expected	1975.0671889295
right_observed	3094
right_expected	2701.9328110705

The number of articles with condition-first language we observe is also is higher in right and lower in left-leaning publications than we would expect based on the total article count in these subcorpora (p < 0.001).

Code

chisq_articles_totalart_normalised(condition_first_annotated, metadata_full, orientation) |> kable()

Table 23: Results of goodness of fit test of the number of articles that use condition-first language we observe in left and right leaning publications relative to the total word count across these two types.
variable	value
method	Chi-squared test for given probabilities
parameter	1
statistic	11.84526
p.value	0.0005780843
left_observed	979
left_expected	1070.92734013683
right_observed	2229
right_expected	2137.07265986317

Person-first language use

Tabloids vs broadsheets

The total number of uses of person-first language we observe is somewhat higher in tabloids and lower in broadsheets than we would expect based on the word count in these subcorpora, but this result is not strongly significant (p < 0.05).

Code

chisq_instances_wc_normalised(person_first_annotated, metadata_full, source_type) |> kable()

Table 24: Results of goodness of fit test of the number of uses of person-first language we observe in tabloids and broadsheets relative to the total word count across these two types of sources.
variable	value
method	Chi-squared test for given probabilities
parameter	1
statistic	6.041885
p.value	0.01397036
broadsheet_observed	86
broadsheet_expected	71.6884777788033
tabloid_observed	50
tabloid_expected	64.3115222211967

The number of articles with person-first language we observe is, in contrast, lower in tabloids and higher in broadsheets than we would expect based on the total article count in these subcorpora (p < 0.002).

Code

chisq_articles_totalart_normalised(person_first_annotated, metadata_full, source_type) |> kable()

Table 25: Results of goodness of fit test of the number of articles that use person-first language we observe in tabloids and broadsheets relative to the total word count across these two types of sources.
variable	value
method	Chi-squared test for given probabilities
parameter	1
statistic	9.588964
p.value	0.001957503
broadsheet_observed	59
broadsheet_expected	43.3269884952032
tabloid_observed	47
tabloid_expected	62.6730115047968

Left vs right-leaning publications

The total number of uses of person-first language we observe is higher in left and lower in right-leaning publications than we would expect based on the word count in these subcorpora (p < 0.002).

Code

chisq_instances_wc_normalised(person_first_annotated, metadata_full, orientation) |> kable()

Table 26: Results of goodness of fit test of the number of uses of person-first language we observe in right and left leaning publications relative to the total word count across these two types.
variable	value
method	Chi-squared test for given probabilities
parameter	1
statistic	10.39137
p.value	0.001266055
left_observed	76
left_expected	57.4319302318606
right_observed	60
right_expected	78.5680697681394

The number of articles with person-first language we observe is also is higher in left and lower in right-leaning publications than we would expect based on the total article count in these subcorpora (p < 0.002).

Code

chisq_articles_totalart_normalised(person_first_annotated, metadata_full, orientation) |> kable()

Table 27: Results of goodness of fit test of the number of articles that use person-first language we observe in left and right leaning publications relative to the total word count across these two types.
variable	value
method	Chi-squared test for given probabilities
parameter	1
statistic	10.34217
p.value	0.00130025
left_observed	51
left_expected	35.3860031341972
right_observed	55
right_expected	70.6139968658029

Condition-first language across time

As we discussed, we have sufficient data to explore the use of condition-first language across time and by type of publication, except for the Brisbane Times and Daily Telegraph, for which we are missing data from 2008-2013:

Code

assess_year_source(condition_first_annotated)

Table 28: Number of articles that use condition-first language by year and source.
source	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	Total
Advertiser	40	53	43	49	40	51	43	41	37	25	19	15	456
Age	49	24	24	28	37	25	22	34	26	18	19	9	315
Australian	35	26	20	22	17	11	16	8	9	7	12	8	191
CanTimes	15	18	14	18	24	17	31	24	18	10	19	4	212
CourierMail	65	44	50	31	25	40	38	36	29	29	28	19	434
HeraldSun	66	71	56	61	51	49	35	25	32	28	22	13	509
HobMercury	31	16	15	20	14	23	8	13	8	8	11	5	172
NorthernT	12	21	13	5	7	10	7	5	4	7	3	1	95
SydHerald	47	52	41	35	38	41	30	37	36	23	32	18	430
WestAus	38	23	28	26	36	16	17	14	17	9	3	1	228
BrisTimes	0	0	0	0	0	0	3	5	2	2	5	5	22
Telegraph	0	0	0	0	0	0	33	14	30	35	23	9	144
Total	398	348	304	295	289	283	283	256	248	201	196	107	3208

Code

condition_first_annotated_for_modelling <- condition_first_annotated %>%
  filter( !(source %in% c("BrisTimes", "Telegraph")))

Let’s look at the number of articles by publication and year. We can see that this number is declining; however, this is likely to be attributable to the overall decline in the number of articles featuring obes*, as discussed in the exploratory data analysis section

Code

condition_first_annotated %>%
  select(year, source, source_type) %>%
  group_by(year, source) %>%
  count() %>%
  ggplot(aes(x = year, y = n, col = source)) + 
  geom_line() +
  labs(x = "", y = "Number of articles") +
  theme(axis.text.x = element_text(angle = 90),
        legend.position = "NA") +
  scale_x_continuous(breaks = unique(condition_first_annotated$year)) + 
  facet_wrap(~source)

Figure 16: Line plot of the number of articles that use condition-first language by year and source.

Normalised frequency

The normalised frequency is distributed log-normally across all texts:

Code

condition_first_annotated %>%
  select(frequency) %>%
  ggplot(aes(x = log(frequency))) + 
  geom_histogram(bins = 75)

Figure 17: Histogram of the natural logarithm of the normalised frequency of condition-first language across all articles.

Let’s look at the difference in frequency across time (only the variability of which should be sensitive to the number of articles per year, not the absolute values):

We can start by using a jitter plot:

Code

condition_first_annotated %>%
  ggplot(aes(x = as.factor(year), 
             y = log(frequency), 
             fill = year)) + 
  geom_jitter(alpha = 0.2) +
  geom_smooth(aes(group = source), col = "blue", method = "loess") +
  geom_hline(yintercept = 1, col = "red", lty = 3) + 
  facet_wrap(~source) + 
  theme(axis.text.x = 
          element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "NA") + 
  labs(
    x = "Year",
    y = "log(frequency per 1000 words)"
  )

Figure 18: Jitter plot of raw values and loess smoothing (blue line) of natural logarithm of normalised frequency of condition-first language by year for each source.

Note that the dashed red line is always at the same position (with a value of exp(1) = 2.72). Comparing it with the blue line of best fit for each source for which we have complete data suggests that visually we cannot discern strong trends in the use of condition-first language across the study time period, so using variability-based neighbor clustering (VNC) is unlikely to provide meaningful results for this research question.

We can see that the Advertiser seems to have higher median frequencies than others, as does the Northern Territorian. Let’s look at it grouped as tabloid vs broadsheet (with outliers not shown):

Code

condition_first_annotated %>%
  select(year, source, source_type, frequency) %>%
  mutate(year = as.factor(year)) %>%
  group_by(year, source_type) %>%
  ggplot(aes(x = year, y = frequency, fill = source_type)) +
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = quantile(condition_first_annotated$frequency, c(0.05, 0.95))) +
  labs(
    x = "",
    y = "Frequency per thousand words",
    fill = "Source type"
  )

Figure 19: Boxplot of the frequency of condition-first language by year across tabloids and broadsheets, with outliers not shown.

It appears that median frequency in tabloids is somewhat higher, although the intervals do overlap across all years.

Code

condition_first_annotated %>%
  ggplot(aes(x = as.factor(year), 
             y = log(frequency), 
             fill = year)) + 
  geom_jitter(alpha = 0.2) +
  geom_smooth(aes(group = source_type), col = "blue", method = "lm") +
  geom_hline(yintercept = 1, col = "red", lty = 3) + 
  facet_wrap(~source_type) + 
  theme(axis.text.x = 
          element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "NA") + 
  labs(
    x = "Year",
    y = "log(frequency per 1000 words)"
  )

Figure 20: Jitter plot of natural logarithm of frequency of condition first language per 1000 words by year, split across tabloids and broadsheets, with blue line showing linear trend.

We can see that the frequency seems to decrease in broadsheets but not in tabloids across years.

Let’s quickly look at differences by month:

Code

condition_first_annotated_for_modelling %>% 
  select(month_metadata, source, frequency) %>%
  ggplot(aes(y = frequency, x = month_metadata)) + 
  geom_violin()

Figure 21: Violin plot of frequency of condition-first language by month.

The frequency doesn’t seem to be different month to month, when visualised using a violin or box plots.

Code

condition_first_annotated_for_modelling %>% 
  select(month_metadata, source) %>%
  group_by(month_metadata, source) %>%
  mutate(count_source = n()) %>%
  distinct() %>%
  ungroup() %>%
  ggplot(aes(y = count_source, x = month_metadata)) +
  geom_boxplot()

Figure 22: Boxplot of frequency of condition-first language by month.

Condition-first language - modelling frequency

We will use a linear mixed effects model to consider whether there are differences in the frequency of condition-first language use in broadsheets and tabloids across years, including whether there are differences in specific publications. We will also use simple linear models to explore

When constructing the model we will:

Use log(frequency) as the dependent variable, as this is normally distributed
Center and scale the date

Code

condition_first_annotated_for_modelling$scaled_year <- scale(condition_first_annotated_for_modelling$year, scale = F)
library(broom.mixed)
# base model
m_0_base <- glm(log(frequency) ~ 1, family = gaussian, 
                data = condition_first_annotated_for_modelling)
# with year
m_0_year <- glm(log(frequency) ~ scaled_year, family = gaussian, 
                data = condition_first_annotated_for_modelling)
# with year and source type
m_0_yearsourcetype <- glm(log(frequency) ~ scaled_year + source_type, family = gaussian, 
                data = condition_first_annotated_for_modelling)
# with year and source type
m_0_yearsource <- glm(log(frequency) ~ scaled_year + source, family = gaussian, 
                data = condition_first_annotated_for_modelling)
# with source
m_0_source = lmer(log(frequency) ~ 1 + (1|source), REML = T, 
                  data = condition_first_annotated_for_modelling)

Does including a random intercept for each source improve our model?

Code

rbind(
  {glance(m_0_base) %>% mutate(model = "Base") %>% dplyr::select(-df.null, -null.deviance, -deviance)},
  {glance(m_0_year) %>% mutate(model = "With year") %>% dplyr::select(-df.null, -null.deviance, -deviance)},
  {glance(m_0_yearsourcetype) %>% mutate(model = "With year & source type") %>% dplyr::select(-df.null, -null.deviance, -deviance)},
  {glance(m_0_yearsource) %>% mutate(model = "With year & source") %>% dplyr::select(-df.null, -null.deviance, -deviance)},
  {glance(m_0_source) %>% mutate(model = "With source") %>% dplyr::select(-sigma, -REMLcrit)}
) %>% 
  arrange(AIC) %>%
  kable()

Table 29: Log-likelihood, Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), residual number of degrees of freedom and number of observations used to fit a range of general linear models.
logLik	AIC	BIC	df.residual	nobs	model
-3749.974	7523.948	7596.191	3031	3042	With year & source
-3766.832	7541.663	7565.744	3039	3042	With year & source type
-3774.515	7555.030	7573.091	3039	3042	With source
-3876.393	7758.785	7776.846	3040	3042	With year
-3878.926	7761.851	7773.892	3041	3042	Base

Yes, it seems that the AIC and BIC are reduced while the logLik is higher for the model that includes source and year. So, yes, it seems using a random effects model for source may be an option.

Now let’s build several different random effects models:

Including year as a fixed effect
Including each specific source (random effect) individually and year

Code

#library(afex)
m_1_base <- lmer(log(frequency) ~ 1 + (1|source), 
                 data = condition_first_annotated_for_modelling,
                 REML = FALSE, 
                 control = lmerControl(optimizer ='optimx', optCtrl=list(method='nlminb')))
# random intercept for each source
m_1_year <- lmer(log(frequency) ~ scaled_year + (1|source), 
                 data = condition_first_annotated_for_modelling,
                 REML = FALSE, 
                 control = lmerControl(optimizer ='optimx', optCtrl=list(method='nlminb')))
# random intercept for each source
m_1_year_sourcetype <- lmer(log(frequency) ~ scaled_year + source_type +(1|source), 
                 data = condition_first_annotated_for_modelling,
                 REML = FALSE, 
                 control = lmerControl(optimizer ='optimx', optCtrl=list(method='nlminb')))
# random slope and intercept for each source
m_1_yearsource <- lmer(log(frequency) ~ scaled_year + (scaled_year|source), 
                 data = condition_first_annotated_for_modelling,
                 REML = FALSE, 
                 control = lmerControl(optimizer ='optimx', optCtrl=list(method='nlminb')))
# random slope and intercept for each source
m_1_full <- lmer(log(frequency) ~ scaled_year + source_type + (scaled_year|source), 
                 data = condition_first_annotated_for_modelling,
                 REML = FALSE, 
                 control = lmerControl(optimizer ='optimx', optCtrl=list(method='nlminb')))
# random intercept for each source type
m_1_year_sourcetype_nosource <- 
  lmer(log(frequency) ~ scaled_year +(1|source_type), 
                 data = condition_first_annotated_for_modelling,
                 REML = FALSE, 
                 control = lmerControl(optimizer ='optimx', optCtrl=list(method='nlminb')))
# use the all_fit function to assess which optimisers work
#all_fit(m_1_yearsource)
# m_1_yearsource_apex <- 
#   mixed(log(frequency) ~ scaled_year + (scaled_year|source), 
#       data = condition_first_annotated_for_modelling,
#       method = "PB",
#       REML=FALSE,
#       control = lmerControl(optimizer ='optimx', optCtrl=list(method='nlminb')))
# m_1_yearsource_apex

We end up needing to use the nlminb optimiser from the optimx library (originally used by lme4), as the default REML fails to converge for the most complex model.

Code

purrr::map_dfr(list(
  m_1_base,
  m_1_year,
  m_1_year_sourcetype,
  m_1_yearsource,
  m_1_year_sourcetype_nosource,
  m_1_full),
        ~(glance(.x))) %>%
  mutate(model = c(
    "1 + (1|source)",
    "scaled_year + (1|source)",
    "scaled_year + source_type +(1|source)",
    "scaled_year + (scaled_year|source)",
    "scaled_year + (1|sourcetype)",
    "scaled_year + source_type + (scaled_year|source)"
  )) %>%
  arrange(AIC) %>%
  kable()

Table 30: Log-likelihood, Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), residual number of degrees of freedom and number of observations used to fit a range of random effects models.
nobs	sigma	logLik	AIC	BIC	deviance	df.residual	model
3042	0.8284958	-3755.568	7525.136	7567.277	7511.136	3035	scaled_year + source_type + (scaled_year\|source)
3042	0.8315654	-3761.807	7533.613	7563.715	7523.613	3037	scaled_year + source_type +(1\|source)
3042	0.8281507	-3764.163	7540.327	7576.448	7528.327	3036	scaled_year + (scaled_year\|source)
3042	0.8314829	-3771.826	7551.651	7575.732	7543.651	3038	scaled_year + (1\|source)
3042	0.8317932	-3773.010	7552.021	7570.082	7546.021	3039	1 + (1\|source)
3042	0.8349904	-3772.598	7553.197	7577.278	7545.197	3038	scaled_year + (1\|sourcetype)

The full model (scaled_year + source_type + (scaled_year|source)) has the lowest AIC and highest log-Likelihood among the mixed effects models. However, it’s AIC is not that different (7524 vs 7525) to the simpler model scaled_year + source, while the simpler model has a lower BIC and higher logLik.

Let’s compare the two models: the full mixed effects model and the simple scaled_year + source

Code

anova(m_1_full, m_0_yearsource)

Data: condition_first_annotated_for_modelling
Models:
m_1_full: log(frequency) ~ scaled_year + source_type + (scaled_year | source)
m_0_yearsource: log(frequency) ~ scaled_year + source
               npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
m_1_full          7 7525.1 7567.3 -3755.6   7511.1                       
m_0_yearsource   12 7523.9 7596.2 -3750.0   7499.9 11.188  5    0.04778 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems that the more complex model does not offer a substantial improvement in fit over the simpler one. Let’s summarise that model.

Code

report::report(m_0_yearsource)

‘r2()’ does not support models of class ‘glm’. ‘r2()’ does not support models of class ‘glm’. We fitted a linear model (estimated using ML) to predict frequency with scaled_year and source (formula: log(frequency) ~ scaled_year + source). . The model’s intercept, corresponding to scaled_year = 0 and source = Advertiser, is at 1.28 (95% CI [1.21, 1.36], t(3031) = 32.98, p < .001). Within this model:

The effect of scaled year is statistically non-significant and negative (beta = -7.05e-03, 95% CI [-0.02, 2.13e-03], t(3031) = -1.50, p = 0.132; Std. beta = -9.82e-03, 95% CI [-0.02, 3.54e-03])
The effect of source [Age] is statistically significant and negative (beta = -0.45, 95% CI [-0.57, -0.33], t(3031) = -7.43, p < .001; Std. beta = -0.19, 95% CI [-0.25, -0.14])
The effect of source [Australian] is statistically significant and negative (beta = -0.65, 95% CI [-0.79, -0.51], t(3031) = -9.12, p < .001; Std. beta = -0.27, 95% CI [-0.33, -0.21])
The effect of source [CanTimes] is statistically significant and negative (beta = -0.37, 95% CI [-0.50, -0.23], t(3031) = -5.33, p < .001; Std. beta = -0.18, 95% CI [-0.24, -0.12])
The effect of source [CourierMail] is statistically significant and negative (beta = -0.14, 95% CI [-0.25, -0.03], t(3031) = -2.48, p = 0.013; Std. beta = -0.07, 95% CI [-0.12, -0.02])
The effect of source [HeraldSun] is statistically non-significant and negative (beta = -0.06, 95% CI [-0.17, 0.04], t(3031) = -1.18, p = 0.240; Std. beta = -0.05, 95% CI [-0.09, -2.66e-04])
The effect of source [HobMercury] is statistically non-significant and positive (beta = 0.13, 95% CI [-0.02, 0.27], t(3031) = 1.68, p = 0.092; Std. beta = 0.05, 95% CI [-0.01, 0.12])
The effect of source [NorthernT] is statistically non-significant and positive (beta = 0.16, 95% CI [-0.02, 0.35], t(3031) = 1.72, p = 0.086; Std. beta = 0.04, 95% CI [-0.04, 0.12])
The effect of source [SydHerald] is statistically significant and negative (beta = -0.53, 95% CI [-0.64, -0.42], t(3031) = -9.42, p < .001; Std. beta = -0.23, 95% CI [-0.28, -0.19])
The effect of source [WestAus] is statistically non-significant and positive (beta = 0.02, 95% CI [-0.12, 0.15], t(3031) = 0.24, p = 0.808; Std. beta = -0.02, 95% CI [-0.08, 0.04])

Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using

If we use the BIC as our model selection criteria instead, the model of the form has the lowest BIC:

Code

purrr::map_dfr(list(
  m_1_base,
  m_1_year,
  m_1_year_sourcetype,
  m_1_yearsource,
  m_1_year_sourcetype_nosource,
  m_1_full),
        ~(glance(.x))) %>%
  mutate(model = c(
    "1 + (1|source)",
    "scaled_year + (1|source)",
    "scaled_year + source_type +(1|source)",
    "scaled_year + (scaled_year|source)",
    "scaled_year + (1|sourcetype)",
    "scaled_year + source_type + (scaled_year|source)"
  )) %>%
  arrange(BIC) %>%
  kable()

Table 31: Log-likelihood, Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), residual number of degrees of freedom and number of observations used to fit a range of random effects models.
nobs	sigma	logLik	AIC	BIC	deviance	df.residual	model
3042	0.8315654	-3761.807	7533.613	7563.715	7523.613	3037	scaled_year + source_type +(1\|source)
3042	0.8284958	-3755.568	7525.136	7567.277	7511.136	3035	scaled_year + source_type + (scaled_year\|source)
3042	0.8317932	-3773.010	7552.021	7570.082	7546.021	3039	1 + (1\|source)
3042	0.8314829	-3771.826	7551.651	7575.732	7543.651	3038	scaled_year + (1\|source)
3042	0.8281507	-3764.163	7540.327	7576.448	7528.327	3036	scaled_year + (scaled_year\|source)
3042	0.8349904	-3772.598	7553.197	7577.278	7545.197	3038	scaled_year + (1\|sourcetype)

We obtain a result similar to that of the simpler model:

Code

report::report(m_1_year_sourcetype)

We fitted a linear mixed model (estimated using ML and optimx optimizer) to predict frequency with scaled_year and source_type (formula: log(frequency) ~ scaled_year + source_type). The model included source as random effect (formula: ~1 | source). The model’s total explanatory power is weak (conditional R2 = 0.09) and the part related to the fixed effects alone (marginal R2) is of 0.08. The model’s intercept, corresponding to scaled_year = 0 and source_type = broadsheet, is at 0.79 (95% CI [0.69, 0.88], t(3037) = 16.43, p < .001). Within this model:

The effect of scaled year is statistically non-significant and negative (beta = -6.94e-03, 95% CI [-0.02, 2.23e-03], t(3037) = -1.48, p = 0.138; Std. beta = -9.60e-03, 95% CI [-0.02, 3.73e-03])
The effect of source type [tabloid] is statistically significant and positive (beta = 0.50, 95% CI [0.38, 0.62], t(3037) = 8.06, p < .001; Std. beta = 0.20, 95% CI [0.16, 0.25])

Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using

To summarise:

The effect of year was not found to be significant.
Relative to the Advertiser, the Age, Australian, Canberra Times, Courier Mail and Sydney Morning Herald had less frequency of condition-first language.