Exploring usage of FAT with a negative connotation

Author

Darya Vanichkina

Exploring usage of the lemma FAT

There is interest in exploring statistics around the use of the word “fat” with a negative connotation around the following research questions:

  • Do tabloids use “fat” with a negative connotation more than broadsheets?
  • Or do broadsheets use “fat” with a negative connotation more than tabloids?
  • Or is there no discernible pattern of use?
  • Is there any newspaper that uses fat with a negative connotation more than others?
  • Any year where fat with a negative connotation is the most frequent/least frequent?

Additional question:

  • Is there a difference of use by primary topic?

Executive summary

1. Tabloids use the word “fat” in a negative context more frequently than broadsheets.

  • More specifically, a Welch Two Sample t-test testing the difference between the frequency in broadsheets per 1000 words and frequency in tabloids (mean of broadsheets = 2.62, mean of tabloids = 4.49) suggests that the effect is negative, statistically significant, and small (difference = -1.87, 95% CI [-2.18, -1.56], t(2251.24) = -11.73, p < .001; Cohen’s d = -0.47, 95% CI [-0.55, -0.39])

  • The total number of uses of “fat” we observe is higher in tabloids and lower in broadsheets than we would expect based on the word count in these subcorpora (p < 0.001).

  • The number of articles with use of “fat” we observe is not different between tabloids and broadsheets.

2. Tabloids have shorter article lengths than broadsheets, as demonstrated in the analysis of obese.

3. In this case (but not that of obese or overweight), bootstrap re-sampling of the counts of use of the word “fat” in a negative connotation revealed that raw counts were higher in broadsheets than in tabloids (so the opposite of the above result), in part due to more articles in broadsheets having 7+ uses vs only tabloids (23 vs 10 articles, respectively). So the result described above is in part attributable to the longer length of articles in broadsheets vs tabloids.

4. Investigating the data by source and year revealed that these variables explained a small amount of variance in the data, with the word fat being used less frequently in the Age, Australian, Canberra Times and Sydney Morning Herald had a lower frequency of use of the word “fat” relative to the Advertiser, while in the Northern Territorian the word “fat” was used somewhat more frequently than in the Advertiser. The word “fat” was not used differently in the corpus across the time period.

4. Significant differences in the use of “fat” were observed in articles on different topics, with articles annotated as “Awards”, “Women and Pregnancy” and “Bio-medical Research” using more instances per 1000 words than articles discussing “Students and teachers”, “Politics” and “Food”.

5. Significant differences in article topics were also observed between tabloids and broadsheets, as discussed in the analysis of obese.

Data source

CQPweb data was provided. To calculate normalised frequency, we divide the number of observations from CQPweb by the word count as calculated in Python, and multiple by 1000.

Code
adj_fat <- read_csv(here("200_data_clean", "fat_annotated.csv"))
topic_labels <- read_csv(here("100_data_raw", "topic_labels.csv"))
fat_annotated <- inner_join(
  adj_fat, topic_labels, by = c("dominant_topic" = "topic_number")) %>% 
  mutate(frequency = 10^3*no_hits_in_text/wordcount_total) 
metadata <- read_csv(here("100_data_raw", "corpus_cqpweb_metadata.csv"))
additional_source_metadata <- read_csv(here("100_data_raw", "addition_source_metadata.csv"))
topic_labels <- read_csv(here("100_data_raw", "topic_labels.csv"))
metadata_full <- inner_join(inner_join(metadata,
                                       topic_labels,
                                       by = c("dominant_topic" = "topic_number")),
                            additional_source_metadata)

We group articles into tabloids and broadsheets, and by orientation, in the following manner:

Code
metadata_full |>
  select(source, source_type, orientation) |>
  distinct() |>
  kable()
Table 1: Classification of sources into types and by orientation.
source source_type orientation
Advertiser tabloid right
Australian broadsheet right
NorthernT tabloid right
CourierMail tabloid right
Age broadsheet left
SydHerald broadsheet left
Telegraph tabloid right
WestAus tabloid right
CanTimes broadsheet left
HeraldSun tabloid right
HobMercury tabloid right
BrisTimes broadsheet left

Tabloid vs broadsheet

What is the distribution of “fat” used with a negative connotation in articles?

Code
fat_annotated %>%
  ggplot(aes(x = no_hits_in_text, fill = source_type)) +
  geom_bar(position = "dodge2") +
  labs(
    x = "Number of hits in article, CQPWeb",
    y = "Number of articles in corpus",
    fill = "Source type"
  )
no_articles_broadsheet <- metadata_full %>% 
  filter(source_type == "broadsheet") %>% nrow()
no_articles_tabloid <- metadata_full %>% 
  filter(source_type == "tabloid") %>% nrow()

Figure 1: Number of uses of FAT across articles from broadsheets and tabloids.

It seems like there is more usage in tabloids than broadsheets. Let’s divide the numbers by the total number of articles in tabloids/broadsheets, respectively.

Code
fat_annotated %>%
  group_by(source_type, no_hits_in_text) %>%
  count() %>%
  pivot_wider(names_from = source_type, values_from = n, values_fill = 0) %>%
  mutate(broadsheet = 1000*broadsheet/no_articles_broadsheet,
         tabloid = 1000*tabloid/no_articles_tabloid) %>%
  pivot_longer(cols = c(broadsheet, tabloid)) %>%
  ggplot(aes(x = no_hits_in_text, y=value, fill = name)) +
  geom_bar(position = "dodge2", stat="identity") +
  labs(
    x = "Number of hits in article, CQPWeb",
    y = "Normalised number of articles, per 1000, in corpus",
    fill = "Source type"
  )

Figure 2: Number of uses of FAT across articles from broadsheets and tabloids, normalised to the number of articles in tabloids and broadsheets in which FAT is used.

We can see that the trend of higher usage holds true for tabloids even if we consider that tabloids contribute more articles to the corpus than broadsheets.

How is this usage distributed by year (number of articles in corpus)?

Code
fat_annotated_filtered <- fat_annotated %>%
  # these sources are already not present in the data, but we will
  # retain this code chunk as it keeps consistency of variable names with
  # other files in work, including for obese and overweight
  filter(!(source %in% c("BrisTimes", "Telegraph")))
assess_year_source(fat_annotated)
Table 2: Number of articles that use FAT by year and source.
source 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Total
Advertiser 34 39 27 28 29 39 25 28 17 18 11 9 304
Age 53 24 17 16 24 22 23 24 18 15 13 3 252
Australian 27 21 18 19 11 10 19 6 10 13 9 9 172
CanTimes 19 12 11 19 13 15 20 14 10 10 12 1 156
CourierMail 48 43 26 30 27 32 29 27 24 22 17 14 339
HeraldSun 65 54 50 50 36 35 26 18 24 18 9 6 391
HobMercury 19 14 12 12 9 11 5 3 6 9 8 4 112
NorthernT 19 14 9 4 6 7 10 0 5 7 2 2 85
SydHerald 41 43 38 27 31 29 27 27 25 22 18 10 338
WestAus 26 24 17 20 14 13 15 5 11 3 1 2 151
Total 351 288 225 225 200 213 199 152 150 137 100 60 2300

How is the frequency (per thousand words) of the usage of fat distributed by tabloids/broadsheets?

Code
fat_annotated_filtered %>%
  ggplot(aes(x = source_type, y = log(frequency))) + 
  geom_boxplot() + 
  labs(
    x = "Source type",
    y = "log(frequency per 1000 words)"
  )

Figure 3: Frequency per 1000 words of usage of FAT in tabloids and broadsheets.

We can see that the frequency is slightly higher in tabloids than in broadsheets.

And let’s also use a histogram to look at the distribution:

Code
fat_annotated_filtered %>%
  ggplot(aes(fill = source_type,x = log(frequency))) + 
  geom_histogram() + 
  labs(
    fill = "Source type",
    x = "log(frequency per 1000 words)",
    y = "Number of articles"
  )

Figure 4: Histogram of the distribution of usage of FAT in tabloids and broadsheets.

Note that broadsheets have somewhat longer texts than tabloids:

Code
fat_annotated_filtered %>%
  ggplot(aes(x = source_type, y = log(wordcount_total))) + 
  geom_boxplot() + 
  labs(
    x = "Source type",
    y = "log(Python word count)"
  )

Figure 5: Boxplot length of articles in tabloids and broadsheets that use FAT.

Let’s use a histogram to look at the distribution in more detail:

Code
fat_annotated_filtered %>%
  ggplot(aes(fill = source_type, x = log(wordcount_total))) + 
  geom_histogram(binwidth = 0.1)+ 
  labs(
    y = "Number of articles",
    x = "log(Python word count)"
  )

Figure 6: Histogram of article lengths in tabloids and broadsheets that use the lemma FAT.

The log-transformed word count data is approximately normally distributed.

Let’s see if the difference in length of articles using the word “fat” in tabloids and broadsheets is significant?

A higher frequency per 1000 words is detected for usage of the word “fat” in tabloids than broadsheets.

Code
frequency_broadsheet <- fat_annotated_filtered[fat_annotated_filtered$source_type == "broadsheet","frequency"] %>% pull()
frequency_tabloid <- fat_annotated_filtered[fat_annotated_filtered$source_type == "tabloid","frequency"] %>% pull()
report::report(t.test(frequency_broadsheet, frequency_tabloid))

The Welch Two Sample t-test testing the difference between frequency_broadsheet and frequency_tabloid (mean of x = 2.62, mean of y = 4.49) suggests that the effect is negative, statistically significant, and small (difference = -1.87, 95% CI [-2.18, -1.56], t(2251.24) = -11.73, p < .001; Cohen’s d = -0.47, 95% CI [-0.55, -0.39])

The non-parametric FP test supports this observation:

Code
fp_test(
  wc1 = frequency_broadsheet,
    wc2 = frequency_tabloid,
    label1 = "broadsheet",
    label2 = "tabloid",
    dist = mydistribution
)

    Approximative Two-Sample Fisher-Pitman Permutation Test
data:  wc by label (broadsheet, tabloid)
Z = -10.336, p-value < 1e-04
alternative hypothesis: true mu is not equal to 0

So, yes, the frequency of use of the word “fat” is lower in broadsheets than in tabloids.

Let’s see if the difference in length of articles using the word “fat” in tabloids and broadsheets is significant?

Code
wordcount_broadsheet <- fat_annotated_filtered[fat_annotated_filtered$source_type == "broadsheet","wordcount_total", drop = T]
wordcount_tabloid <- fat_annotated_filtered[fat_annotated_filtered$source_type == "tabloid","wordcount_total", drop = T]
report::report(t.test(wordcount_broadsheet,wordcount_tabloid))

The Welch Two Sample t-test testing the difference between wordcount_broadsheet and wordcount_tabloid (mean of x = 922.40, mean of y = 573.96) suggests that the effect is positive, statistically significant, and medium (difference = 348.45, 95% CI [301.88, 395.01], t(1740.32) = 14.68, p < .001; Cohen’s d = 0.64, 95% CI [0.55, 0.72])

Yes, overall articles that use the word “fat” in tabloids are significantly shorter than in broadsheets.

The non-parametric FP test supports this observation:

Code
fp_test(
  wc1 = wordcount_broadsheet,
    wc2 = wordcount_tabloid,
    label1 = "broadsheet",
    label2 = "tabloid",
    dist = mydistribution
)

    Approximative Two-Sample Fisher-Pitman Permutation Test
data:  wc by label (broadsheet, tabloid)
Z = 14.462, p-value < 1e-04
alternative hypothesis: true mu is not equal to 0

Let’s use bootstrapping to see if the raw frequencies of usage of the word “fat” are different?

Code
broadsheet_counts <- NULL
tabloid_counts <- NULL
for (i in 1:10000) {
  x <- mean(sample({fat_annotated_filtered %>% 
        filter(source_type == "broadsheet") %>%
      # note sampling 450 here as this is approx 1/2 of broadsheet
        pull(no_hits_in_text)}, 450, replace = FALSE))
  y <- mean(sample({fat_annotated_filtered %>% 
        filter(source_type == "tabloid") %>%
        pull(no_hits_in_text)}, 450, replace = FALSE))
  broadsheet_counts <- c(broadsheet_counts, x)
  tabloid_counts <- c(tabloid_counts, y)
}
counts_comparison <- data.frame(
  mean_sample = c(broadsheet_counts, tabloid_counts),
  source_type = c(
    rep("broadsheet", length(broadsheet_counts)),
    rep("tabloid", length(tabloid_counts))))
counts_comparison %>%
  ggplot(aes(x = mean_sample, fill = source_type)) + 
  geom_histogram()

Figure 7: Histogram of sampling of counts per article of the raw frequency of use of FAT in tabloids and broadsheets.

It is interesting how these two distributions do not overlap, and the raw frequencies of the use of the words “fat” in a negative connotation are higher in broadsheets than tabloids. However, the longer length of the articles from broadsheets most likely results in a lower frequency per 1000 words of counts from broadsheets than tabloids.

Code
report::report(
  t.test(
    broadsheet_counts,
    tabloid_counts
  )
)

The Welch Two Sample t-test testing the difference between broadsheet_counts and tabloid_counts (mean of x = 1.85, mean of y = 1.48) suggests that the effect is positive, statistically significant, and large (difference = 0.37, 95% CI [0.37, 0.37], t(17591.34) = 493.31, p < .001; Cohen’s d = 6.98, 95% CI [7.67, 7.67])

The non-parametric FP test supports this observation:

Code
fp_test(
  wc1 = broadsheet_counts,
    wc2 = tabloid_counts,
    label1 = "broadsheet",
    label2 = "tabloid",
    dist = mydistribution
)

    Approximative Two-Sample Fisher-Pitman Permutation Test
data:  wc by label (broadsheet, tabloid)
Z = 135.94, p-value < 1e-04
alternative hypothesis: true mu is not equal to 0

Let’s look at the jittered plot to explore this as well:

Code
fat_annotated_filtered %>%
#| fig-cap: Raw and smoothed frequency of use of FAT in the corpus in tabloids and broadsheets, by year and publication type.
#| label: fig-l10
  ggplot(aes(x = as.factor(year), 
             y = log(frequency), 
             fill = year)) + 
  geom_jitter(alpha = 0.2) +
  geom_smooth(aes(group = source_type), col = "blue", method = "loess") +
  geom_hline(yintercept = 1, col = "red", lty = 3, size = 3) + 
  facet_wrap(~source_type) + 
  theme(axis.text.x = 
          element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "NA") + 
  labs(
    x = "Year",
    y = "log(frequency per 1000 words)"
  )

We can see that:

  • Broadsheets do have a lower frequency per 1000 words than tabloids.
  • There may be an effect of year for broadsheets but not tabloids with time, with relative frequency decreasing by year.

Comparing observed to normalised to subcorpus size data

We can investigate the prevalence of the use of “fat” using goodness of fit tests, comparing the distribution in:

  • tabloids vs broadsheets
  • left and right leaning publications

We can do this by looking at:

  • the total number of instances in each subcorpus, normalised to the total number of words in each subcorpus
  • the number of articles that feature this language type, normalised to the total number of articles in each subcorpus

Tabloids vs broadsheets

The total number of uses of “fat” we observe is higher in tabloids and lower in broadsheets than we would expect based on the word count in these subcorpora (p < 0.001).

Code
chisq_instances_wc_normalised(fat_annotated_filtered, metadata_full, source_type)|> kable()
Table 3: Chi-Squared test for number of uses of FAT in tabloids and broadsheets vs what would be expected given word count across all articles in the publication types.
variable value
method Chi-squared test for given probabilities
parameter 1
statistic 80.19926
p.value 3.384939e-19
broadsheet_observed 1698
broadsheet_expected 1971.43313891709
tabloid_observed 2042
tabloid_expected 1768.56686108291

The number of articles with use of “fat” we observe is not different between tabloids and broadsheets.

Code
chisq_articles_totalart_normalised(fat_annotated_filtered, metadata_full, source_type)|> kable()
Table 4: Chi-Squared test for number of articles that use FAT in tabloids and broadsheets vs what would be expected given total number of articles in the publication types.
variable value
method Chi-squared test for given probabilities
parameter 1
statistic 0.879783
p.value 0.3482611
broadsheet_observed 918
broadsheet_expected 940.113901311012
tabloid_observed 1382
tabloid_expected 1359.88609868899

Left vs right-leaning publications

The total number of uses of “fat” we observe is higher in right and lower in left-leaning publications than we would expect based on the word count in these subcorpora (p < 0.001).

Code
chisq_instances_wc_normalised(fat_annotated_filtered, metadata_full, orientation)|> kable()
Table 5: Chi-Squared test for number of uses of FAT in left and right leaning publications vs what would be expected given word count across all articles in the publication types.
variable value
method Chi-squared test for given probabilities
parameter 1
statistic 29.25461
p.value 6.346508e-08
left_observed 1416
left_expected 1579.37808137617
right_observed 2324
right_expected 2160.62191862383

The number of articles that use “fat” we observe is not different between left and right leaning publications.

Code
chisq_articles_totalart_normalised(fat_annotated_filtered, metadata_full, orientation)|> kable()
Table 6: Chi-Squared test for number of articles that use FAT left and right leaning publications vs what would be expected given total number of articles in the publication types.
variable value
method Chi-squared test for given probabilities
parameter 1
statistic 0.9299361
p.value 0.3348795
left_observed 746
left_expected 767.809501968429
right_observed 1554
right_expected 1532.19049803157

Differences in usage by source

Is there a difference in the usage of fat by source?

Code
fat_annotated_filtered %>%
  ggplot(aes(x = reorder(source,frequency), 
             y = log(frequency), 
             fill = source_type)) + 
  geom_boxplot() +
  theme(axis.text.x = 
          element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "NA")

Figure 8: Box plot of usage of FAT by source.

The visualisation suggests there may be some differences, and definitely between tabloids vs broadsheets.

What are the means and standard deviations of the frequency by source?

Code
fat_annotated_filtered %>% 
  group_by(source) %>%
  summarise(
    mean = mean(frequency),
    median = median(frequency),
    sd = sd(frequency),
    type = (source_type)
  ) %>%
  distinct() %>%
  arrange(mean) %>%
  kable()
Table 7: Mean, median and standard deviation of frequency of use of FAT by source.
source mean median sd type
Australian 2.281607 1.524404 2.481898 broadsheet
Age 2.614527 1.787311 3.036150 broadsheet
SydHerald 2.659197 1.826485 2.634891 broadsheet
CanTimes 2.944274 1.996294 2.861244 broadsheet
CourierMail 4.074611 2.816901 4.354308 tabloid
WestAus 4.155035 2.808989 4.476407 tabloid
Advertiser 4.314423 2.645503 4.884672 tabloid
HeraldSun 4.616753 2.976190 5.133447 tabloid
HobMercury 4.908499 3.017147 5.403502 tabloid
NorthernT 6.287128 4.878049 4.825449 tabloid

It seems that within the different sources among broadsheets there is not much difference among the frequency of use of the word “fat”. Among tabloids, the Northern Territorian seems to have higher frequency of use of “fat” with a negative connotation than other tabloids.

Differences in usage by year

Is there any year when “fat” is used more frequently than others?

Code
fat_annotated_filtered %>%
  ggplot(aes(x = reorder(year,frequency), 
             y = log(frequency), 
             fill = year)) + 
  geom_boxplot() +
  theme(axis.text.x = 
          element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "NA")

Figure 9: Use of FAT by year, sorted by median use per year from least frequent to most frequent.

Based on the visualisation it seems not. If we separate out by source there also doesn’t seem to be much difference. If we use a jitter plot to visualise the data, then fit a smoothing line and compare with the line of “no change” (dashed red line), we can see that there really isn’t much of a difference by source and year:

Code
fat_annotated_filtered %>%
  ggplot(aes(x = as.factor(year), 
             y = log(frequency), 
             fill = year)) + 
  geom_jitter(alpha = 0.2) +
  geom_smooth(aes(group = source), col = "blue", method = "loess") +
  geom_hline(yintercept = 1, col = "red", lty = 3, size = 3) + 
  facet_wrap(~source) + 
  theme(axis.text.x = 
          element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "NA") + 
  labs(
    x = "Year",
    y = "log(frequency per 1000 words)"
  )

Figure 10: Use of FAT by year and source, split by publication.

There may be a subtle decrease over time in the Advertiser, Age, Canberra Times and Sydney Morning Herald, however, it is unclear whether this trend is very strong.

The distributions each year also look quite similar:

Code
fat_annotated_filtered %>%
  ggplot(aes(x = reorder(year,frequency), 
             y = log(frequency), 
             fill = year)) + 
  geom_violin() +
  theme(axis.text.x = 
          element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "NA")

Figure 11: Violin plot of the distributions of use of FAT by year.

Code
fat_annotated_filtered %>% 
  group_by(year) %>%
  summarise(
    mean = mean(frequency),
    median = median(frequency),
    sd = sd(frequency),
    year = (year)
  ) %>%
  distinct() %>%
  arrange(mean) %>%
  kable()
Table 8: Descriptive statistics of the use of FAT by year across the dataset.
year mean median sd
2018 2.602021 1.883246 2.687849
2019 2.793541 1.908397 2.386139
2015 3.142368 2.132351 2.599177
2010 3.347193 2.493766 3.009772
2012 3.542590 2.366867 3.768130
2016 3.559982 2.425721 3.288919
2017 3.574660 2.538071 3.578586
2008 3.961569 2.597403 4.442870
2013 4.038693 2.673797 4.032539
2009 4.125576 2.496879 5.879841
2014 4.186377 2.500000 5.015075
2011 4.249711 2.403846 4.927228

Differences in usage by source type, source and year

We can also try to simultaneously model differences by source type, source and year.

As expected, a model that includes the source type (tabloid vs broadsheet) gives a better fit than one that does not:

Code
fat_annotated_filtered$scaled_year <- scale(
  fat_annotated_filtered$year, scale = F)
library(lme4)
# base model
m_0_base <- lm(log(frequency) ~ 1, 
                data = fat_annotated_filtered)
# with source type
m_0_sourcetype <- lm(log(frequency) ~ source_type,  
                data = fat_annotated_filtered)
# with year
m_0_year <- lm(log(frequency) ~ scaled_year,  
                data = fat_annotated_filtered)
# with source 
m_0_source <- lm(log(frequency) ~ source,  
                data = fat_annotated_filtered)
# with source and year
m_0_sourceyear <- lm(log(frequency) ~ source + scaled_year,  
                data = fat_annotated_filtered)
# compare
rbind({broom::glance(m_0_base) %>% 
    dplyr::select(-df.residual,- deviance, -nobs) %>%
    mutate(model = "1")},
    {broom::glance(m_0_sourcetype)%>% 
          dplyr::select(-df.residual,- deviance, -nobs) %>%
        mutate(model = "source_type")},
    {broom::glance(m_0_year) %>% 
    dplyr::select(-df.residual,- deviance, -nobs) %>%
    mutate(model = "scaled_year")},
    {broom::glance(m_0_source) %>% 
    dplyr::select(-df.residual,- deviance, -nobs) %>%
    mutate(model = "source")},
    {broom::glance(m_0_sourceyear)%>% 
          dplyr::select(-df.residual,- deviance, -nobs) %>%
        mutate(model = "source_year")}
) %>% 
  dplyr::select(model, everything()) %>%
  arrange(AIC) %>%
  kable()
Table 9: Comparison of model summary statistics for simple linear models that incorporate source, scaled year and/or source type.
model r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
source_year 0.0957258 0.0917753 0.8221521 24.231192 0.0000000 10 -2807.637 5639.274 5708.162
source 0.0945436 0.0909850 0.8225097 26.567921 0.0000000 9 -2809.139 5640.279 5703.426
source_type 0.0753951 0.0749928 0.8297134 187.385926 0.0000000 1 -2833.206 5672.412 5689.634
scaled_year 0.0029755 0.0025416 0.8615945 6.858039 0.0088823 1 -2919.926 5845.852 5863.074
1 0.0000000 0.0000000 0.8626915 NA NA NA -2923.353 5850.706 5862.187

We can see that the model incorporating source and scaled year provides the best fit for the data, explaining somewhat more variability than that which includes only source or source type.

Code
anova(m_0_base, m_0_sourceyear)
Analysis of Variance Table
Model 1: log(frequency) ~ 1
Model 2: log(frequency) ~ source + scaled_year
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1   2299 1711.0                                  
2   2289 1547.2 10    163.79 24.231 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
source(here::here("400_analysis","model_diagnostics.R"))
df_model_resid <- get_model_resid_df(m_0_sourceyear)
diagnostic_lm_plot(df_model_resid)

Figure 12: Plot of model residuals for the fixed effects model log(frequency) ~ source + scaled_year.

Proportion of data points with:

  • abs(standardized residuals) > 3.29: 0.13%
  • abs(standardized residuals) > 2.58: 1.09%
  • abs(standardized residuals) > 1.96: 5.09%

All of these indicate the model is performing reasonably well on the data.

Let’s summarise the model (note that the intercept corresponds to the “first” source, i.e. the Advertiser):

Code
sjPlot::tab_model(m_0_sourceyear) 
Table 10: Coefficients and confidence intervals for the fixed effects model log(frequency) ~ source + scaled_year.
  log(frequency)
Predictors Estimates CI p
(Intercept) 1.02 0.92 – 1.11 <0.001
source [Age] -0.39 -0.53 – -0.26 <0.001
source [Australian] -0.52 -0.68 – -0.37 <0.001
source [CanTimes] -0.25 -0.41 – -0.09 0.002
source [CourierMail] 0.05 -0.08 – 0.17 0.483
source [HeraldSun] 0.12 -0.01 – 0.24 0.066
source [HobMercury] 0.18 -0.00 – 0.35 0.052
source [NorthernT] 0.59 0.40 – 0.79 <0.001
source [SydHerald] -0.35 -0.48 – -0.23 <0.001
source [WestAus] 0.06 -0.10 – 0.22 0.489
scaled year -0.01 -0.02 – 0.00 0.084
Observations 2300
R2 / R2 adjusted 0.096 / 0.092

This model indicates that the Age, Australian, Canberra Times and Sydney Morning Herald (all broadsheets in the dataset) have lower frequency of usage of the word “fat” with a negative connotation than the Advertiser, whereas the Northern Territorian has a higher usage of the language type.

Code
report::report(m_0_sourceyear)

We fitted a linear model (estimated using OLS) to predict frequency with source and scaled_year (formula: log(frequency) ~ source + scaled_year). The model explains a statistically significant and weak proportion of variance (R2 = 0.10, F(10, 2289) = 24.23, p < .001, adj. R2 = 0.09). The model’s intercept, corresponding to source = Advertiser and scaled_year = 0, is at 1.02 (95% CI [0.92, 1.11], t(2289) = 21.55, p < .001). Within this model:

  • The effect of source [Age] is statistically significant and negative (beta = -0.39, 95% CI [-0.53, -0.26], t(2289) = -5.62, p < .001; Std. beta = -0.18, 95% CI [-0.24, -0.12])
  • The effect of source [Australian] is statistically significant and negative (beta = -0.52, 95% CI [-0.68, -0.37], t(2289) = -6.66, p < .001; Std. beta = -0.22, 95% CI [-0.29, -0.15])
  • The effect of source [CanTimes] is statistically significant and negative (beta = -0.25, 95% CI [-0.41, -0.09], t(2289) = -3.07, p = 0.002; Std. beta = -0.13, 95% CI [-0.20, -0.06])
  • The effect of source [CourierMail] is statistically non-significant and positive (beta = 0.05, 95% CI [-0.08, 0.17], t(2289) = 0.70, p = 0.483; Std. beta = -6.33e-03, 95% CI [-0.06, 0.05])
  • The effect of source [HeraldSun] is statistically non-significant and positive (beta = 0.12, 95% CI [-7.55e-03, 0.24], t(2289) = 1.84, p = 0.066; Std. beta = 0.03, 95% CI [-0.02, 0.09])
  • The effect of source [HobMercury] is statistically non-significant and positive (beta = 0.18, 95% CI [-1.69e-03, 0.35], t(2289) = 1.94, p = 0.052; Std. beta = 0.06, 95% CI [-0.02, 0.14])
  • The effect of source [NorthernT] is statistically significant and positive (beta = 0.59, 95% CI [0.40, 0.79], t(2289) = 5.88, p < .001; Std. beta = 0.23, 95% CI [0.15, 0.32])
  • The effect of source [SydHerald] is statistically significant and negative (beta = -0.35, 95% CI [-0.48, -0.23], t(2289) = -5.46, p < .001; Std. beta = -0.17, 95% CI [-0.22, -0.11])
  • The effect of source [WestAus] is statistically non-significant and positive (beta = 0.06, 95% CI [-0.10, 0.22], t(2289) = 0.69, p = 0.489; Std. beta = -8.61e-03, 95% CI [-0.08, 0.06])
  • The effect of scaled year is statistically non-significant and negative (beta = -9.25e-03, 95% CI [-0.02, 1.24e-03], t(2289) = -1.73, p = 0.084; Std. beta = -0.02, 95% CI [-0.03, -9.21e-04])

Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using the Wald approximation.

Next, let’s compare if using a mixed model results in improved fit.

Code
m_1_source <- lmer(log(frequency) ~  1 + (1|source), 
                data = fat_annotated_filtered, REML = T)
m_1_source_type <- lmer(log(frequency) ~  source_type + (1|source), 
                data = fat_annotated_filtered, REML = T)
m_1_source_year_type <- lmer(log(frequency) ~  source_type + scaled_year + (1|source), 
                data = fat_annotated_filtered, REML = T)
m_1_source_year <- lmer(log(frequency) ~  scaled_year + (1|source), 
                data = fat_annotated_filtered, REML = T)
m_1_source_year2 <- lmer(log(frequency) ~  scaled_year + (scaled_year|source), 
                data = fat_annotated_filtered, REML = T)
m_1_source_year_type2 <- lmer(log(frequency) ~  scaled_year + source_type+ (scaled_year|source), 
                data = fat_annotated_filtered, REML = T)
rbind(
    {broom::glance(m_0_source)%>% 
          dplyr::select(AIC, BIC, logLik) %>%
        mutate(model = "1 + source")},
    {broom.mixed::glance(m_1_source_year_type)%>% 
          dplyr::select(AIC, BIC, logLik) %>%
        mutate(model = "source_type + scaled_year + (1/source)")},
    {broom::glance(m_0_sourceyear)%>% 
          dplyr::select(AIC, BIC, logLik) %>%
        mutate(model = "scaled_year + source")},
    {broom.mixed::glance(m_1_source) %>% 
    dplyr::select(AIC, BIC, logLik) %>%
    mutate(model = "1 + (1/source)")
    },
    {broom.mixed::glance(m_1_source_year_type2) %>% 
    dplyr::select(AIC, BIC, logLik) %>%
    mutate(model = "source_type + scaled_year + (1/source)")
    },
    {broom.mixed::glance(m_1_source_year) %>% 
    dplyr::select(AIC, BIC, logLik) %>%
    mutate(model = "scaled_year + (1/source)")},
    {broom.mixed::glance(m_1_source_year2) %>% 
    dplyr::select(AIC, BIC, logLik) %>%
    mutate(model = "scaled_year + (scaled_year/source)")},
    {broom.mixed::glance(m_1_source_type) %>% 
    dplyr::select(AIC, BIC, logLik) %>%
    mutate(model = "source_type + (1/source)")
    }
) %>% arrange(BIC) %>%
  kable()
Table 11: Comparison of model summary statistics for linear mixed effects models that incorporate source, scaled year and/or source type.
AIC BIC logLik model
5663.524 5686.486 -2827.762 source_type + (1/source)
5671.055 5688.277 -2832.527 1 + (1/source)
5671.065 5699.768 -2830.532 source_type + scaled_year + (1/source)
5678.552 5701.515 -2835.276 scaled_year + (1/source)
5640.279 5703.426 -2809.139 1 + source
5639.274 5708.162 -2807.637 scaled_year + source
5668.853 5709.037 -2827.426 source_type + scaled_year + (1/source)
5676.593 5711.037 -2832.296 scaled_year + (scaled_year/source)

The model that includes a random intercept for each of the sources and considers source type as a fixed effect provides the best fit based on the Bayesian Information Criterion. Tabloids have a higher usage of the language type.

Code
sjPlot::tab_model(m_1_source_type) 
Table 12: Coefficients and confidence intervals for the mixed effects model log(frequency) ~ source_type + 1/source
  log(frequency)
Predictors Estimates CI p
(Intercept) 0.63 0.47 – 0.80 <0.001
source type [tabloid] 0.54 0.32 – 0.76 <0.001
Random Effects
σ2 0.68
τ00source 0.03
ICC 0.04
N source 10
Observations 2300
Marginal R2 / Conditional R2 0.090 / 0.124

Let’s summarise this model:

Code
report::report(m_1_source_type)

We fitted a linear mixed model (estimated using REML and nloptwrap optimizer) to predict frequency with source_type (formula: log(frequency) ~ source_type). The model included source as random effect (formula: ~1 | source). The model’s total explanatory power is weak (conditional R2 = 0.12) and the part related to the fixed effects alone (marginal R2) is of 0.09. The model’s intercept, corresponding to source_type = broadsheet, is at 0.63 (95% CI [0.47, 0.80], t(2296) = 7.41, p < .001). Within this model:

  • The effect of source type [tabloid] is statistically significant and positive (beta = 0.54, 95% CI [0.32, 0.76], t(2296) = 4.84, p < .001; Std. beta = 0.22, 95% CI [0.14, 0.31])

Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using

The intercepts for for each of the sources were:

Code
coef(m_1_source_type)$source %>% 
  arrange(`(Intercept)`) %>% 
  kable()
Table 13: Intercepts and coefficients for the model log(frequency) ~ source_type + 1/source.
(Intercept) source_typetabloid
Advertiser 0.4894885 0.5374269
Australian 0.5108950 0.5374269
CourierMail 0.5297357 0.5374269
WestAus 0.5567112 0.5374269
HeraldSun 0.6025750 0.5374269
Age 0.6235424 0.5374269
HobMercury 0.6516703 0.5374269
SydHerald 0.6575644 0.5374269
CanTimes 0.7443664 0.5374269
NorthernT 0.9743715 0.5374269

Differences in usage by topic

Code
fat_annotated_filtered %>%
  ggplot(aes(x = as.factor(
    reorder(topic_label,frequency)), 
             y = log(frequency), 
             fill = topic_label)) + 
  geom_boxplot() +
  theme(axis.text.x = 
          element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "NA") +
  labs(x = "Topic label",
       y = "log(frequency per 1000 words)")

Figure 13: Frequency of use of FAT per 1000 words in articles annotated with specific topic lables.

It seems there are some topics that use fat more than others.

We use a simple linear model with post-hoc comparisons and Bonferroni multiple testing correction:

Code
fat_by_topic <- lm(
  frequency ~ as.factor(topic_label), 
  data = fat_annotated_filtered)
library(emmeans)
fat_by_topic_comp <- emmeans(fat_by_topic, pairwise ~ as.factor(topic_label), adjust = "bonferroni")
fat_by_topic_comp$contrasts %>%
    summary(infer = TRUE) %>%
  filter(p.value < 0.01) %>% 
  kable()
Table 14: Differences in use of FAT in articles annoated with specific topics, with topics compared pairwise with each other.
contrast estimate SE df lower.CL upper.CL t.ratio p.value
Awards - FastFood&Drinks 7.973995 1.6993808 2283 1.9112363 14.0367536 4.692294 0.0003891
Awards - FitnessExercise 7.631596 1.6893761 2283 1.6045303 13.6586615 4.517405 0.0008948
Awards - Food 9.314549 1.7208689 2283 3.1751289 15.4539691 5.412701 0.0000093
Awards - MusicMovies 8.720687 1.6878559 2283 2.6990449 14.7423289 5.166725 0.0000352
Awards - NutritionStudy 7.918747 1.6962952 2283 1.8669970 13.9704979 4.668260 0.0004371
Awards - Politics 9.122837 1.6967161 2283 3.0695851 15.1760889 5.376761 0.0000114
Awards - PublicHealthReport 7.463489 1.7046693 2283 1.3818625 13.5451149 4.378262 0.0017003
Awards - Students&Teachers 9.378821 1.7908695 2283 2.9896643 15.7679773 5.237021 0.0000242
Awards - Transport&Commuting 9.041890 1.8113633 2283 2.5796197 15.5041610 4.991760 0.0000875
Awards - WomenGirls 7.929696 1.6876047 2283 1.9089498 13.9504419 4.698787 0.0003771
BiomedResearch - FastFood&Drinks 1.892684 0.3814379 2283 0.5318552 3.2535126 4.961971 0.0001019
BiomedResearch - FitnessExercise 1.550285 0.3340530 2283 0.3585078 2.7420618 4.640834 0.0004987
BiomedResearch - Food 3.233238 0.4679631 2283 1.5637195 4.9027563 6.909172 0.0000000
BiomedResearch - MusicMovies 2.639376 0.3262780 2283 1.4753374 3.8034142 8.089348 0.0000000
BiomedResearch - NutritionStudy 1.837436 0.3674470 2283 0.5265222 3.1483505 5.000548 0.0000837
BiomedResearch - Politics 3.041526 0.3693850 2283 1.7236976 4.3593542 8.234027 0.0000000
BiomedResearch - Students&Teachers 3.297510 0.6817724 2283 0.8651994 5.7298200 4.836672 0.0001915
BiomedResearch - Transport&Commuting 2.960579 0.7339189 2283 0.3422297 5.5789289 4.033933 0.0077049
BiomedResearch - WomenGirls 1.848385 0.3249763 2283 0.6889902 3.0077793 5.687753 0.0000020
ChildrenParents - FastFood&Drinks 1.828524 0.4436993 2283 0.2455704 3.4114783 4.121090 0.0053114
ChildrenParents - Food 3.169078 0.5199650 2283 1.3140366 5.0241202 6.094792 0.0000002
ChildrenParents - MusicMovies 2.575216 0.3972800 2283 1.1578688 3.9925638 6.482118 0.0000000
ChildrenParents - NutritionStudy 1.773277 0.4317307 2283 0.2330222 3.3135315 4.107368 0.0056344
ChildrenParents - Politics 2.977366 0.4333814 2283 1.4312228 4.5235099 6.870084 0.0000000
ChildrenParents - Students&Teachers 3.233350 0.7184620 2283 0.6701448 5.7965556 4.500377 0.0009689
ChildrenParents - WomenGirls 1.784225 0.3962117 2283 0.3706892 3.1977613 4.503212 0.0009561
FastFood&Drinks - WomenPregnancy -2.730627 0.6110072 2283 -4.9104733 -0.5507806 -4.469059 0.0011207
FitnessExercise - WomenPregnancy -2.388228 0.5826030 2283 -4.4667388 -0.3097170 -4.099237 0.0058345
Food - WomenPregnancy -4.071181 0.6684492 2283 -6.4559592 -1.6864028 -6.090487 0.0000002
MusicMovies - WomenPregnancy -3.477319 0.5781800 2283 -5.5400502 -1.4145875 -6.014249 0.0000003
NutritionStudy - WomenPregnancy -2.675379 0.6023721 2283 -4.8244190 -0.5263398 -4.441407 0.0012734
Politics - WomenPregnancy -3.879469 0.6035562 2283 -6.0327332 -1.7262048 -6.427684 0.0000000
Students&Teachers - WomenPregnancy -4.135453 0.8322550 2283 -7.1046292 -1.1662763 -4.968973 0.0000983
Transport&Commuting - WomenPregnancy -3.798522 0.8754837 2283 -6.9219227 -0.6751220 -4.338770 0.0020335
WomenGirls - WomenPregnancy -2.686328 0.5774465 2283 -4.7464422 -0.6262135 -4.652081 0.0004725

We can see that there are differences in the frequency of use of the word “fat” by topic, with articles annotated as “Awards”, “Women and Pregnancy” and “Bio-medical Research” using more instances per 1000 words than articles discussing “Students and teachers”, “Politics” and “Food”.