Skip to content
Module 04 of 1245 min readBeginner

Data frames and tibbles

Creating, importing, inspecting. Why tibbles are an improvement and where they break.

33%

Listen along

Read “Data frames and tibbles” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Create data frames and tibbles and inspect them with str(), summary(), head()
  • 02Read CSV files using both base read.csv and tidyverse read_csv
  • 03Distinguish data.frame from tibble and articulate why tibbles are an improvement
  • 04Subset rows and columns using base R bracket syntax and the $ operator

Data frames are the central data structure for R analysis. A data frame is a list of equal-length vectors, displayed as a table. Most functions you'll meet — read.csv, lm, summary, ggplot — are built around data frames.

Creating a data frame

r
df <- data.frame(
bank = c("KCB", "Equity", "Coop", "NCBA"),
assets_bn = c(1500, 1700, 600, 700),
tier = c(1, 1, 1, 1)
)
head(df)
nrow(df); ncol(df)
str(df) # structure: types and a sample
summary(df) # summary statistics

Selecting and filtering

r
df$bank # column as vector
df[, "bank"] # same
df[df$assets_bn > 1000, ] # filter rows
df[1, ] # first row
df[, c("bank", "assets_bn")] # subset columns

Tibbles — the tidyverse upgrade

A tibble is a modern data frame with better printing, better subsetting behaviour, and stricter type rules. Most tidyverse functions return tibbles. They are interchangeable with data.frame for almost all purposes.

r
library(tibble)
tb <- tibble(
bank = c("KCB", "Equity"),
assets = c(1500, 1700)
)
tb # prints with column types and dimensions
# Tibbles never auto-convert characters to factors (data.frame did, until R 4.0)

Reading and writing

r
df <- read.csv("rates.csv")
df <- read.csv("rates.csv", stringsAsFactors = FALSE) # historical, not needed in R 4.0+
library(readr)
df <- read_csv("rates.csv") # tidyverse version, faster, returns tibble
write_csv(df, "output.csv")

Always inspect after reading

After read_csv, always run str(df), summary(df), and head(df). One stray text value in a numeric column will break every subsequent analysis without an error message.

Exercise

Print the first 5 rows of the pre-loaded bankrates data frame.

Key takeaways

  • A data frame is a list of equal-length vectors, displayed as a table — the central R data structure
  • Tibbles print better, never auto-convert characters to factors, and have stricter subsetting semantics
  • After every read_csv, run str(df), summary(df), head(df) — never skip this step
  • Use df$column for a column as vector, df[, 'col'] for a column as a data frame

Further reading

  1. 01
  2. 02
  3. 03

    Data Manipulation with R

    Stephen Milborrow · Springer · 2015

Loading progress…
LeadAfrikPublic Economics Hub