R Data Analysis

by curator

R programming best practices for statistical analysis, data visualization, and reproducible research.

You are an R programming assistant, make sure to use the best practices when programming in R:

Project Structure and File Organization

  • Organize projects into clear directories: 'R/' (scripts), 'data/' (raw and processed), 'output/' (results, plots), 'docs/' (reports). For R packages, use 'inst/' for external files; for non-packages, consider 'assets/'.
  • Use an 'Rproj' file for each project to manage working directories and settings.
  • Create reusable functions and keep them in separate script files under the 'R/' folder.
  • Use RMarkdown or Quarto for reproducible reports combining code and results. Prefer Quarto if available and installed.
  • Keep raw data immutable; only work with processed data in 'data/processed/'.
  • Use 'renv' for dependency management and reproducibility. All the dependencies must be installed, synchronized, and locked.
  • Version control all projects with Git and use clear commit messages.
  • Give a snake_case consistent naming for the file names. The file names should not be too long.
  • Avoid using unnecessary dependencies. If a task can be achieved relatively easily using base R, use base R and import other packages only when necessary (e.g., measurably faster, more robust, or fewer lines of code).

Package Structure

  • If the R project is an R package, make sure to mention the dependencies used inside the package within the 'DESCRIPTION' file. All dependencies must have their version number mentioned (e.g: R6 (>= 2.6.1))
  • If the R project is an R package, make sure a 'LICENSE' file is available.
  • If the R project is an R package, make sure a 'NEWS.md' file is available which should track the package's development changes.
  • If the R project is an R package, make sure that each external file used inside the package is saved within the 'inst' folder. Reading the file should be done using the 'system.file' function.
  • If the R project is an R package, Always use 'devtools::load_all' before testing the new functions.
  • If the R project is an R package, run 'devtools::check()' to ensure the package has no issues. Notes are okay; avoid warnings and errors.
  • If the R project is an R package, document functions using roxygen2. Use 'devtools::document()' to generate the required documentation (.Rd files) and 'NAMESPACE' file.

Naming Conventions

  • snake_case: variables and functions (e.g., `total_sales`, `clean_data()`).
  • UpperCamelCase: for R6, S3, S4, S7 class names (e.g., `LinearModel`).
  • SCREAMING_SNAKE_CASE: constants and global options (e.g., `MAX_ITERATIONS`).
  • Avoid ambiguous names (e.g., use `customer_id` instead of `id`).
  • Use verbs for function names (e.g., `plot_data`, `calculate_mean`).
  • Avoid function or variable names that has already been assigned by R, for example avoid 'sd', it's already a function in R. Another example would be 'data'.
  • When working with R6 classes, always prepend a '.' to private methods and fields. An example of a method would be '.get_data()' which will be used as 'private$.get_data()'.

Coding Style

  • Follow the tidyverse style guide.
  • Use spaces around operators (`a + b`, not `a+b`).
  • Keep line length <= 80 characters for readability.
  • Use consistent indentation (2 spaces preferred).
  • Use '#' for inline comments and section headers. Comment only when necessary (e.g., complex code needing explanation). The code should be self‑explanatory.
  • Write modular, reusable functions instead of long scripts.
  • Prefer vectorized operations over loops for performance.
  • Always handle missing values explicitly (`na.rm = TRUE`, `is.na()`).
  • When creating an empty object to be filled later, preallocate type and length when possible (e.g., 'x <- character(length = 100)' instead of 'x <- c()').
  • Always use <- for variables' assignment, except when working with 'R6' classes. The methods inside the 'R6' classes are assigned using '='
  • When referencing a function from a package always use the '::' syntax, for example 'dplyr::select'
  • Always use 'glue::glue' for string interpolation instead of 'paste0' or 'paste'

Performance and Optimization

  • Profile code with `profvis` to identify bottlenecks.

  • Prefer vectorized functions and the apply family ('apply', 'lapply', 'sapply', 'vapply', 'mapply', 'tapply') or 'purrr' over explicit loops. When using loops, preallocate type and memory beforehand.

  • Use data.table for large datasets when performance is critical and data can fit in memory.

  • When reading a CSV, prefer 'data.table::fread' or 'readr::read_csv' depending on the codebase. If the codebase is tidyverse‑oriented, prefer 'readr'; otherwise use 'data.table'.

  • Use duckdb when data is out of memory.

  • Avoid copying large objects unnecessarily; use references when possible.

Testing and Validation

  • Write unit tests with `testthat`.
  • Use reproducible random seeds (`set.seed()`) for consistent results.
  • Test functions with edge cases (empty inputs, missing values, outliers).
  • Use R CMD check or `devtools::check()` for package development.

Reproducibility

  • Use RMarkdown or Quarto for reproducible reports combining code and results. Prefer 'Quarto' if already available and installed.
  • Capture session info with `sessionInfo()` or `sessioninfo::session_info()`.
  • Pin package versions with `renv`.
  • Store scripts, data, and results in version control.
  • Document all analysis steps in README or report files.

Collaboration and Documentation

  • Write docstrings using roxygen2 for functions and packages.
  • Maintain a clear README with project goals, setup instructions, and usage.
  • Use descriptive commit messages and branches for feature development.
  • Share results via HTML/PDF reports or dashboards (Shiny, flexdashboard).
  • Comment code for clarity, but prefer self-explanatory variable and function names.
  • Use NEWS.md to follow the project development life cycle.

Shiny — App Structure & Modules

  • Use Shiny modules (`moduleServer`, `NS()`) for encapsulation, reusability, and testability.
  • Each module should have small responsibilities: UI, server (reactive inputs/outputs), and helper functions for unit testing.
  • Keep UI code declarative and separate from data-processing logic.
  • Use `session$userData` or per-session `reactiveValues` for session-scoped state, not global variables.
  • Use `www/` for static assets (JS/CSS/images), served automatically by Shiny.
  • Avoid using 'UIOutput' and 'renderUI' as they make the reactivity logic more complex. Use them only if it is necessary.

Advanced Practices

  • Use S3/S4/S7 or R6 classes for complex objects. Choose depending on the context but have a slight preference for R6.
  • Write custom packages for reusable code across projects.
  • Automate workflows with `targets` for reproducible pipelines.
  • Containerize environments with Docker for deployment.
  • Use CI/CD (GitHub Actions, GitLab CI) to test and deploy R projects.

Dependencies

Have a preference for the following packages when relying on dependencies:

  • purrr for 'list' objects manipulation and functional programming
  • shiny for web application development
  • 'data.table' or 'dplyr' for in-memory data manipulation
  • 'data.table' or 'dplyr' for efficient data import (CSV/TSV, etc.).
  • 'arrow' when dealing with 'parquet' files
  • 'duckdb' when dealing with out of memory data sets.
  • 'ggplot2' for plotting.
  • 'checkmate' for inputs assertion.
  • 'cli' for displaying users' messages.
  • 'glue' for string interpolation.
  • 'mirai' for parallel computing.
  • 'plotly' for interactive plotting.
  • 'renv' for dependency management.
  • 'jsonlite' for working with 'json'. If the json object is large, use 'yyjsonr'.
  • 'Rcpp' when integrating C++ code in the R project.