Skip to content
/ joinspy Public

R package for diagnosing data frame joins - detect duplicates, mismatches, and predict row counts before joining

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

gcol33/joinspy

joinspy

CRAN status CRAN downloads Monthly downloads R-CMD-check Codecov test coverage License: MIT

Diagnostic Tools for Data Frame Joins in R

The joinspy package helps you understand and debug join operations by analyzing key columns before and after joins, detecting common issues, and explaining unexpected row count changes. Catch problems early instead of discovering them when downstream analysis breaks.

Quick Start

library(joinspy)

# Pre-join diagnostics
report <- join_spy(orders, customers, by = "customer_id")
summary(report)

# Quick pass/fail check
key_check(orders, customers, by = "customer_id")

# Safe join with cardinality enforcement
result <- join_strict(orders, customers, by = "customer_id", expect = "1:1")

# Auto-repair common issues
orders_fixed <- join_repair(orders, by = "customer_id")

Statement of Need

Joins silently produce unexpected results when:

  • Duplicate keys cause row multiplication
  • Trailing whitespace breaks matches invisibly
  • Case mismatches ("ABC" vs "abc") prevent joins
  • Encoding issues make identical-looking strings different
  • NA values in keys cause unexpected behavior
  • Type mismatches (character "1" vs numeric 1) fail silently

These problems are discovered only when downstream analysis breaks. joinspy catches them upfront by analyzing keys before you join, explaining why joins misbehave, and showing where the problems are.

Features

Pre-Join Diagnostics

  • join_spy(): Comprehensive pre-flight diagnostic report
  • key_check(): Quick pass/fail key quality assessment
  • key_duplicates(): Find and locate duplicate keys

Post-Join Analysis

  • join_explain(): Explain row count changes after a join
  • join_diff(): Compare before/after states

Safe Join Wrappers

  • join_strict(): Join with cardinality enforcement (1:1, 1:m, m:1, m:m)
  • left_join_spy(), right_join_spy(), inner_join_spy(), full_join_spy(): Joins with automatic diagnostics
  • last_report(): Retrieve diagnostics after silent (.quiet = TRUE) joins

Auto-Repair

  • join_repair(): Fix whitespace, case, encoding, empty strings automatically
  • suggest_repairs(): Generate R code snippets to fix detected issues

Advanced Analysis

  • detect_cardinality(): Determine actual relationship (1:1, 1:m, m:1, m:m)
  • check_cartesian(): Warn about Cartesian product explosions
  • analyze_join_chain(): Analyze multi-table join sequences

Visualization & Logging

  • plot(): Venn diagram of key overlap (with optional file param to save)
  • summary(): Compact metrics table (with optional format param for text/markdown)
  • log_report(): Write reports to file (text/JSON/RDS)
  • set_log_file(): Enable automatic logging

Installation

# Install from CRAN (when available)
install.packages("joinspy")

# Or install development version from GitHub
# install.packages("pak")
pak::pak("gcol33/joinspy")

Usage Examples

Pre-Join Diagnostics

library(joinspy)

orders <- data.frame(
  customer_id = c("A", "B", "B", "C", "D "),
  amount = c(100, 200, 150, 300, 50),
  stringsAsFactors = FALSE
)

customers <- data.frame(
  customer_id = c("A", "B", "C", "D", "E"),
  name = c("Alice", "Bob", "Carol", "David", "Eve"),
  stringsAsFactors = FALSE
)

# Full diagnostic report
report <- join_spy(orders, customers, by = "customer_id")

# Compact summary
summary(report)
#>              metric value
#> 1         left_rows     5
#> 2        right_rows     5
#> 3   left_unique_keys    4
#> 4  right_unique_keys    5
#> ...

Cardinality Enforcement

# Succeeds - 1:1 relationship
products <- data.frame(id = 1:3, name = c("Widget", "Gadget", "Gizmo"))
prices <- data.frame(id = 1:3, price = c(10, 20, 30))

join_strict(products, prices, by = "id", expect = "1:1")

# Fails - duplicates violate 1:1
prices_dup <- data.frame(id = c(1, 1, 2, 3), price = c(10, 15, 20, 30))
join_strict(products, prices_dup, by = "id", expect = "1:1")
#> Error: Cardinality violation: expected '1:1' but found '1:m'

Auto-Repair

messy <- data.frame(
  id = c(" A", "B ", "  C  "),
  value = 1:3,
  stringsAsFactors = FALSE
)

# Preview what would be fixed
join_repair(messy, by = "id", dry_run = TRUE)

# Apply fixes
fixed <- join_repair(messy, by = "id")
fixed$id
#> [1] "A" "B" "C"

Silent Pipeline Mode

# Silent join for pipelines
result <- left_join_spy(orders, customers, by = "customer_id", .quiet = TRUE)

# Access diagnostics afterward
last_report()$match_analysis$match_rate
#> [1] 0.8

Visualization

report <- join_spy(orders, customers, by = "customer_id")

# Venn diagram
plot(report)

# Save to file
plot(report, file = "overlap.png")

Documentation

Related Work

  • dplyr - The relationship argument provides basic cardinality checks
  • tidylog - Logs row count changes (but doesn't diagnose causes)

joinspy fills the gap: it tells you why joins misbehave and where the problems are.

Support

"Software is like sex: it's better when it's free." — Linus Torvalds

I'm a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.

If this package saved you some time, buying me a coffee is a nice way to say thanks.

Buy Me A Coffee

License

MIT (see the LICENSE.md file)

Citation

@software{joinspy,
  author = {Colling, Gilles},
  title = {joinspy: Diagnostic Tools for Data Frame Joins},
  year = {2025},
  url = {https://github.com/gcol33/joinspy}
}

About

R package for diagnosing data frame joins - detect duplicates, mismatches, and predict row counts before joining

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •