--- title: "vtable Bonus Functions" author: "Nick Huntington-Klein" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{vtable Bonus Functions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The `vtable` package serves the purpose of outputting automatic variable documentation that can be easily viewed while continuing to work with data. `vtable` contains four main functions: `vtable()` (or `vt()`), `sumtable()` (or `st()`), `labeltable()`, and `dftoHTML()`/`dftoLaTeX()`. This vignette focuses on some bonus helper functions that come with `vtable` that have been exported because they may be handy to you. This can come in handy for saving a little time, and can help you avoid having to create an unnamed function when you need to call a function. ----- # Shortcut Helper Functions `vtable` includes four shortcut functions. These are generally intended for use with the `summ` option in `vtable` and `sumtable` because nested functions don't look very nice in a `vtable`, or in a `sumtable` unless you explicitly set the `summ.names`. ## `nuniq` `nuniq(x)` returns `length(unique(x))`, the number of unique values in the vector. ## `countNA`, `propNA`, and `notNA` These three functions are shortcuts for dealing with missing data. You have probably written out the nested versions of these many times! | Function | Short For | |------------| -----------------------------------------| | `countNA()` | `sum(is.na())` | | `propNA()` | `mean(is.na())` | | `notNA()` | `sum(!is.na())` | Note that `notNA()` also has some additional formatting options, which you would probably ignore if using it iteractively. ## `is.round` This function is a shortcut for `!any(!(x == round(x,digits)))`. It takes two arguments: a vector `x` and a number of `digits` (0 by default). It checks whether you can round to `digits` digits without losing any information. ----- # Other Helper Functions ## `formatfunc` `formatfunc()` is a function that returns a function, which itself helps format numbers using the `format()` function, in the same spirit as the `label_` functions in the scales package. It is largely used for the `numformat` argument of `sumtable()`. `formatfunc()` for the most part takes the same arguments as `format()`, and so `help(format)` can be a guide for using it. However, there are some differences. Some defaults are changed. By default, `scientific = FALSE, trim = TRUE`. There are four new arguments as well. `percent = TRUE` will format the number as a percentage by multiplying it by 100 and adding a % at the end. You can instead set `percent` equal to some number, and that number will instead be taken as 100%, instead of 1. So `percent = 100`, for example, will just add a % at the end without doing any multiplying. `prefix` and `suffix` will, naturally, add prefixes or suffixes to the formatted number. So `prefix = '$', suffix = 'M'`, for example, will produce a function that will turn `3` into `$3M`. `scale` will multiply the number by `scale` before formatting it. So `prefix = '$', suffix = 'M', scale = 1/1000000` will turn `3000000` into `$3M`. ```{r} library(vtable) my_formatter_func <- formatfunc(percent = TRUE, digits = 3, nsmall = 2, big.mark = ',') my_formatter_func(523.2355987) ``` ## `pctile` `pctile(x)` is short for `quantile(x,1:100/100)`. So in one sense this is another shortcut function. But this inherently lets you interact with percentiles a bit differently. While `quantile()` has you specify which percentile you want in the function call, `pctile()` returns an object with all integer percentiles, and you can pull out which ones you want afterwards. `pctile(x)[50]` is the 50th percentile, etc.. This can be convenient in several applications, an obvious one being in `sumtable`. ```{r} library(vtable) #Some random normal data, and its percentiles d <- rnorm(1000) pc <- pctile(d) #25th, 50th, 75th percentile pc[c(25,50,75)] ``` ```{r} #Inverse normal CDF with 100 points of articulation plot(pc) ``` ## `weighted.sd` `weighted.sd(x, w)` is a function to calculate a weighted standard deviation of `x` using `w` as weights, much like the base `weighted.mean()` does for means. It is mostly used as a helper function for `sumtable()` when `group.weights` is specified. However, you can use it on its own if you like. Unlike `weighted.mean()`, setting `na.rm = TRUE` will account for missings both in `x` and `w`. The weighted standard deviation is calculated as $$ \frac{\sum_i(w_i*(x_i-\bar{x}_w)^2)}{\frac{N_w-1}{N_w}\sum_iw_i} $$ Where $\bar{x}_w$ is the weighted mean of $x$, and $N_w$ is the number of observations with a nonzero weight. ```{r} x <- 1:100 w <- 1:100 weighted.mean(x, w) sd(x) weighted.sd(x, w) ``` # `independence.test` `independence.test` is a helper function for `sumtable(group.test=TRUE)` that tests for independence between a categorical variable `x` and another variable `y` that may be categorical or numerical. Then, it outputs a *formatted string* as its output, with significance stars, for printing. The function takes the format ```{r, eval = FALSE} independence.test(x,y,w=NA, factor.test = NA, numeric.test = NA, star.cutoffs = c(.01,.05,.1), star.markers = c('***','**','*'), digits = 3, fixed.digits = FALSE, format = '{name}={stat}{stars}', opts = list()) ``` ## `factor.test` and `numeric.test` These are functions that actually perform the independence test. `numeric.test` is used when `y` is numeric, and `factor.test` is used in all other instances. Specifically, these functions should take only `x`, `y`, and `w=NULL` as arguments, and should return a list with three elements: the name of the test statistic, the test statistic itself, and the p-value of the test. By default, these are the internal functions `vtable:::chisq.it` for `factor.test` and `vtable:::groupf.it` for `numeric.test`, so you can take a look at those (just put `vtable:::chisq.it` in the terminal and it will show you the function's code) if you'd like to make your own test functions. ## `star.cutoffs` and `star.markers` These are numeric and character vectors, respectively, used for p-value cutoffs and to create significance markers. `star.cutoffs` indicates the cutoffs, and `star.markers` indicates the markers to be used with each cutoff, in the same order. So with `star.cutoffs = c(.01,.05,.1)` and `star.markers = c('***','**','*')`, each p-value below .01 will get marked with `'***'`, each from .01 to .05 will get `'**'`, and each from .05 to .1 will get `*`. Defaults are set to "economics defaults" (.1, .05, .01). But these are of course easy to change. ```{r} data(iris) independence.test(iris$Species, iris$Sepal.Length, star.cutoffs = c(.05,.01,.001)) ``` ## `digits` and `fixed.digits` `digits` indicates how many digits after the decimal place from the test statistics and p-values should be displayed. `fixed.digits` determines whether trailing zeros are maintained. ```{r} independence.test(iris$Species, iris$Sepal.Width, digits=1) ``` ```{r} independence.test(iris$Species, iris$Sepal.Width, digits=4, fixed.digits = TRUE) ``` ## `format` This is the printing format that the output will produce, incorporating the name of the test statistic `{name}`, the test statistic `{stat}`, the significance markers `{stars}`, and the p-value `{pval}`. If your `independence.test` is heading out to another format besides being printed in the R console, you may want to add additional markup like `'{name}$={stat}^{stars}$'}` in LaTeX or `'{name}={stat}{stars}'` in HTML. If you do this, be sure to think carefully about escaping or not escaping characters as appropriate when you print! ```{r} independence.test(iris$Species, iris$Sepal.Width, format = 'Pr(>{name}): {pval}{stars}') ``` ## `opts` You can create a named list where the names are the above options and the values are the settings for those options, and input it into `independence.test` using `opts=`. This is an easy way to set the same options for many `independence.test`s.