Fertility

code
Author

Nikhita Purohit

Published

September 28, 2025

Fertility

Studying Fertility Decisions

EXPERIMENT OBJECTIVE:

The aim of this experiment could be to understand the characteristics of women of different ethnicity, especially Hispanic and African American women, in order to study their fertility decisions.

1. Setting up R Packages

# SETUP CHUNK- LIBRARIES
#| label: setup
#| echo: false
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mosaic) # Our all-in-one package
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.

Attaching package: 'mosaic'

The following object is masked from 'package:Matrix':

    mean

The following objects are masked from 'package:dplyr':

    count, do, tally

The following object is masked from 'package:purrr':

    cross

The following object is masked from 'package:ggplot2':

    stat

The following objects are masked from 'package:stats':

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var

The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum
library(skimr) # Looking at data

Attaching package: 'skimr'

The following object is masked from 'package:mosaic':

    n_missing
library(janitor) # Clean the data

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(naniar) # Handle missing data

Attaching package: 'naniar'

The following object is masked from 'package:skimr':

    n_complete
library(visdat) # Visualise missing data
library(tinytable) # Printing Static Tables for our data

Attaching package: 'tinytable'

The following object is masked from 'package:ggplot2':

    theme_void
library(DT) # Interactive Tables for our data
library(crosstable) # Multiple variable summaries

Attaching package: 'crosstable'

The following object is masked from 'package:purrr':

    compact

2. Read Data

fertility_modified <- fertility <- readr::read_csv("../data/Fertility.csv")%>%
  # Clean variable names
  janitor::clean_names(case="snake")
Rows: 254654 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): morekids, gender1, gender2, afam, hispanic, other
dbl (3): rownames, age, work

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fertility_modified
# A tibble: 254,654 × 9
   rownames morekids gender1 gender2   age afam  hispanic other  work
      <dbl> <chr>    <chr>   <chr>   <dbl> <chr> <chr>    <chr> <dbl>
 1        1 no       male    female     27 no    no       no        0
 2        2 no       female  male       30 no    no       no       30
 3        3 no       male    female     27 no    no       no        0
 4        4 no       male    female     35 yes   no       no        0
 5        5 no       female  female     30 no    no       no       22
 6        6 no       male    female     26 no    no       no       40
 7        7 no       female  male       29 no    no       no        0
 8        8 no       male    male       33 no    no       no       52
 9        9 no       female  male       29 no    no       no        0
10       10 no       male    female     27 no    no       no        0
# ℹ 254,644 more rows

3. Examine Data

dplyr::glimpse(fertility_modified)
Rows: 254,654
Columns: 9
$ rownames <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ morekids <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "…
$ gender1  <chr> "male", "female", "male", "male", "female", "male", "female",…
$ gender2  <chr> "female", "male", "female", "female", "female", "female", "ma…
$ age      <dbl> 27, 30, 27, 35, 30, 26, 29, 33, 29, 27, 28, 28, 35, 34, 32, 2…
$ afam     <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no", …
$ hispanic <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "…
$ other    <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "…
$ work     <dbl> 0, 30, 0, 0, 22, 40, 0, 52, 0, 0, 0, 52, 52, 52, 8, 7, 0, 40,…
skimr::skim(fertility_modified)
Data summary
Name fertility_modified
Number of rows 254654
Number of columns 9
_______________________
Column type frequency:
character 6
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
morekids 0 1 2 3 0 2 0
gender1 0 1 4 6 0 2 0
gender2 0 1 4 6 0 2 0
afam 0 1 2 3 0 2 0
hispanic 0 1 2 3 0 2 0
other 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rownames 0 1 127327.50 73512.42 1 63664.25 127327.5 190990.8 254654 ▇▇▇▇▇
age 0 1 30.39 3.39 21 28.00 31.0 33.0 35 ▁▃▅▇▇
work 0 1 19.02 21.87 0 0.00 5.0 44.0 52 ▇▁▁▁▃
crosstable(work + age ~ hispanic + afam + other,
           data = fertility_modified
) %>%
    crosstable::as_flextable()

other

no

yes

afam

no

yes

no

yes

hispanic

no

yes

no

yes

no

yes

no

yes

work

Min / Max

0 / 52.0

0 / 52.0

0 / 52.0

0 / 52.0

0 / 52.0

0 / 52.0

NA / NA

NA / NA

Med [IQR]

3.0 [0;41.0]

0 [0;42.0]

36.0 [0;52.0]

12.0 [0;50.2]

10.0 [0;50.0]

4.0 [0;40.0]

NA [NA;NA]

NA [NA;NA]

Mean (std)

18.4 (21.7)

18.1 (21.7)

29.2 (22.4)

22.4 (23.2)

21.4 (22.8)

18.3 (21.5)

NA (NA)

NA (NA)

N (NA)

216033 (0)

11117 (0)

12960 (0)

196 (0)

6764 (0)

7584 (0)

0 (0)

0 (0)

age

Min / Max

21.0 / 35.0

21.0 / 35.0

21.0 / 35.0

22.0 / 35.0

21.0 / 35.0

21.0 / 35.0

NA / NA

NA / NA

Med [IQR]

31.0 [28.0;33.0]

30.0 [27.0;33.0]

30.0 [27.0;33.0]

30.0 [27.0;33.0]

31.0 [29.0;33.0]

30.0 [26.0;32.0]

NA [NA;NA]

NA [NA;NA]

Mean (std)

30.5 (3.3)

29.9 (3.6)

29.9 (3.5)

30.0 (3.6)

30.7 (3.3)

29.2 (3.7)

NA (NA)

NA (NA)

N (NA)

216033 (0)

11117 (0)

12960 (0)

196 (0)

6764 (0)

7584 (0)

0 (0)

0 (0)

4. Data Dictionary

Quantitative Data

  1. rownames(dbl): Number of observations
  2. age(dbl): Age of the mother (years)
  3. work(dbl): Number of hours worked in a week

Qualitative Data

  1. morekids(fct): Does the mother want more kids? (yes or no)
  2. gender1(fct): Gender of the first born child
  3. gender2(fct): Gender of the second born child
  4. afam(fct): If the mother is African American (yes or no)
  5. hispanic(fct): If the mother is Hispanic (yes or no)
  6. other(fct): If the mother is neither African American nor Hispanic (yes or no)
fertility_modified %>%
  dplyr::count(morekids,gender1,gender2,afam,hispanic,other) %>%
  tt()
morekids gender1 gender2 afam hispanic other n
no female female no no no 30506
no female female no no yes 846
no female female no yes no 1252
no female female no yes yes 812
no female female yes no no 1619
no female female yes yes no 22
no female male no no no 35651
no female male no no yes 1110
no female male no yes no 1451
no female male no yes yes 940
no female male yes no no 1812
no female male yes yes no 23
no male female no no no 35926
no male female no no yes 1104
no male female no yes no 1480
no male female no yes yes 894
no male female yes no no 1866
no male female yes yes no 34
no male male no no no 35261
no male male no no yes 1123
no male male no yes no 1379
no male male no yes yes 876
no male male yes no no 1730
no male male yes yes no 25
yes female female no no no 21132
yes female female no no yes 726
yes female female no yes no 1466
yes female female no yes yes 1018
yes female female yes no no 1522
yes female female yes yes no 25
yes female male no no no 17564
yes female male no no yes 566
yes female male no yes no 1257
yes female male no yes yes 923
yes female male yes no no 1411
yes female male yes yes no 16
yes male female no no no 17539
yes male female no no yes 609
yes male female no yes no 1326
yes male female no yes yes 986
yes male female yes no no 1401
yes male female yes yes no 20
yes male male no no no 22454
yes male male no no yes 680
yes male male no yes no 1506
yes male male no yes yes 1135
yes male male yes no no 1599
yes male male yes yes no 31
fertility_modified %>%
  dplyr::summarise(across(
    .cols = c(age,work), # select columns

    .fns = list(
      mean = ~ mean(., na.rm = T),
      sd = sd,
      min = min, max = max
    )
  ))
# A tibble: 1 × 8
  age_mean age_sd age_min age_max work_mean work_sd work_min work_max
     <dbl>  <dbl>   <dbl>   <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1     30.4   3.39      21      35      19.0    21.9        0       52
fertility_modified <- fertility %>% tidyr::drop_na()
fertility_modified
# A tibble: 254,654 × 9
   rownames morekids gender1 gender2   age afam  hispanic other  work
      <dbl> <chr>    <chr>   <chr>   <dbl> <chr> <chr>    <chr> <dbl>
 1        1 no       male    female     27 no    no       no        0
 2        2 no       female  male       30 no    no       no       30
 3        3 no       male    female     27 no    no       no        0
 4        4 no       male    female     35 yes   no       no        0
 5        5 no       female  female     30 no    no       no       22
 6        6 no       male    female     26 no    no       no       40
 7        7 no       female  male       29 no    no       no        0
 8        8 no       male    male       33 no    no       no       52
 9        9 no       female  male       29 no    no       no        0
10       10 no       male    female     27 no    no       no        0
# ℹ 254,644 more rows
fertility_modified <- fertility %>%
  dplyr::mutate(across(where(is.character), as.factor)) %>% 
  # arrange the Qual variables first, Quant next
  dplyr::relocate(where(is.factor), .after = rownames)

glimpse(fertility_modified)
Rows: 254,654
Columns: 9
$ rownames <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ morekids <fct> no, no, no, no, no, no, no, no, no, no, yes, no, no, no, no, …
$ gender1  <fct> male, female, male, male, female, male, female, male, female,…
$ gender2  <fct> female, male, female, female, female, female, male, male, mal…
$ afam     <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, no, …
$ hispanic <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n…
$ other    <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n…
$ age      <dbl> 27, 30, 27, 35, 30, 26, 29, 33, 29, 27, 28, 28, 35, 34, 32, 2…
$ work     <dbl> 0, 30, 0, 0, 22, 40, 0, 52, 0, 0, 0, 52, 52, 52, 8, 7, 0, 40,…
fertility_modified2 <- fertility_modified %>%
  stats::setNames(c("Row ID","Want More Kids","Gender of First Born","Gender of Second Born","African American","Hispanic","Other","Age","Work"))

fertility_modified2 %>%
  dplyr::slice_sample(n = 500) %>%
  DT::datatable(
    style = "default",
    caption = htmltools::tags$caption(
      style = "caption-side: top; text-align: left; color: black; font-size: 100%;", "Fertility Dataset (Clean)"
    ),
    options = list(pageLength = 10, autoWidth = TRUE)
  ) %>%
  DT::formatStyle(
    columns = names(fertility_modified2),
    fontFamily = "Roboto Condensed",
    fontSize = "12px",
  )

5. Graphs

1. Which Ethnic Groups are More Likely to Want More Children?

fertility_modified  %>%
  dplyr::slice_sample(n = 500) %>%
  dplyr::count(morekids,afam,hispanic,other) %>%
  tt()
morekids afam hispanic other n
no no no no 270
no no no yes 5
no no yes no 12
no no yes yes 10
no yes no no 16
yes no no no 148
yes no no yes 15
yes no yes no 8
yes no yes yes 10
yes yes no no 5
yes yes yes no 1
fertility_chunk <- tibble::tribble(
  ~morekids, ~afam, ~hispanic, ~other, ~n,
  "no",  "no",  "no",  "no",  270,
  "no",  "no",  "no",  "yes", 6,
  "no",  "no",  "yes", "no", 12,
  "no",  "no",  "yes", "yes", 7,
  "no",  "yes", "no",  "no", 18,
  "yes", "no",  "no",  "no", 155,
  "yes", "no",  "no",  "yes", 4,
  "yes", "no",  "yes", "no", 5,
  "yes", "no",  "yes", "yes", 8,
  "yes", "yes", "no", "no", 15
)

fertility_chunk_modified <- fertility_chunk %>%
  filter((afam == "yes" & hispanic == "no" & other == "no") |
         (afam == "no" & hispanic == "yes" & other == "no") |
         (afam == "no" & hispanic == "no" & other == "yes"))

fertility_chunk_plot <- fertility_chunk_modified %>%
  mutate(ethnicity = case_when(
    afam == "yes" ~ "afam",
    hispanic == "yes" ~ "hispanic",
    other == "yes" ~ "other"
  )) %>%
  group_by(ethnicity) %>%
  summarise(count = sum(n))

gf_col(count ~ ethnicity, data = fertility_chunk_plot, fill = ~ ethnicity) %>%
  gf_labs(title = "Which Ethnic Groups are More Likely to Want More Children?",
          x = "Ethnicity",
          y = "Number of Mothers")%>%
  gf_refine(scale_fill_brewer(name = "Legend = Ethnic Group",palette = "Pastel1"))

2. Does a Mother’s Age Relate to How Many Hours She Works, and Does Wanting More Children Make a Difference?

fertility_modified3 <- fertility_modified %>%
  group_by(age, morekids) %>%
  summarize(avg_work = mean(work, na.rm = TRUE))
`summarise()` has grouped output by 'age'. You can override using the `.groups`
argument.
gf_col(avg_work ~ age, fill = ~morekids, data = fertility_modified3, position = "dodge") %>%
  gf_labs(title = "Average Work Hours vs Age by Desire for More Kids",
          x = "Age of Mother",
          y = "Average Hours Worked per Week") %>%
  gf_refine(scale_fill_brewer(name = "Legend = Want More Kids", palette = "Paired"))

3. Are Mothers Who Want More Kids Generally Younger or Older Than Those Who Don’t?

fertility_chunk_modified2 <- fertility_modified %>%
  gf_boxplot(age ~ morekids,fill = ~morekids) %>%
  gf_refine(scale_fill_brewer(name = "Legend = Desire For More Kids", palette = "PiYG")) %>%
  gf_labs(
    title = "Are Mothers Who Want More Kids Younger or Older?",
    x = "Desire For More Kids", y = "Age of Mother"
  )
fertility_chunk_modified2

4. How Does The Distribution Of Weekly Work Hours Vary Based On The Gender Of The First-Born Child?

gf_histogram(~ work | gender1, data = fertility_modified, binwidth = 5, fill = "#c885da") %>%
  gf_labs(title = "Work Hours Distribution by Gender of First Born",
          x = "Hours Worked per Week",
          y = "Count")

6. Summary of Findings

From the data and the graphs plotted, it is evident that

  1. African American women are most likely to want more children, compared to women of other ethnicity.

  2. Younger mothers work for fewer hours than older women, perhaps because they have just started working, or they are prioritizing family.

  3. Women who work for longer hours in a week tend not to want more children.

  4. In the box plot, we see that the IQR for the ‘no’ plot has wider variation as compared to the ‘yes’, suggesting that women of a wide range of ages do not prefer having children.

  5. Most mothers either do not work at all, or work around 50 hours per week.

  6. Whether the first born child is male or female, it does not seem to effect the working hours of the mother.

7. Surprising Aspects

  1. According to the U.S. Census and CDC fertility data, Hispanic families are usually bigger than African American families, both in terms of fertility rates and household size. So, I was surprised to see that African American mothers desire having more kids than Hispanic mothers.

  2. I was not expecting such a gradual rise for the 2nd chart. I thought that older women who work over 15 hours a week would prefer not having more children.

  3. There are very few young mothers, under the age of 24 who want more children. It is evident from the outliers in the box plot.

  4. Only few mothers are into part-time jobs. I assumed that most of the mothers would prefer part-time over full-time jobs.