boxplot

Boxplots in fivecentplots are modeled after the “Variability Chart” in JMP which provides convenient, multi-level group labels automatically along the x-axis. Data can be broken into multiple subsets for easy visualization by simply listing the DataFrame column names of interest in the groups keyword. At a minimum, the boxplot function requires the following keywords:

  • df: a pandas DataFrame

  • y: the name of the DataFrame column containing the y-axis data

Setup

Imports

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import fivecentplots as fcp
import pandas as pd
import numpy as np
import os, sys, pdb
osjoin = os.path.join
db = pdb.set_trace

Sample data

Read some fake boxplot data

In [2]:
df = pd.read_csv(osjoin(os.path.dirname(fcp.__file__), 'tests', 'fake_data_box.csv'))
df.head()
Out[2]:
Batch Sample Region Value ID
0 101 1 Alpha123 3.5 ID701223A
1 101 1 Alpha123 2.1 ID7700-1222B
2 101 1 Alpha123 3.3 ID701223A
3 101 1 Alpha123 3.2 ID7700-1222B
4 101 1 Alpha123 4.0 ID701223A

Set theme

Optionally set the design theme

In [3]:
#fcp.set_theme('gray')
#fcp.set_theme('white')

Other

In [4]:
SHOW = False

Groups

Consider the following boxplot of made-up data:

In [5]:
fcp.boxplot(df=df, y='Value', show=SHOW, tick_labels_minor=True, grid_minor=True)
_images/boxplot_15_0.png

Single group

Rather than lumping the data into a single box, we can separate them into categories to get more information. First, set a single group column of “Batch”:

In [6]:
fcp.boxplot(df=df, y='Value', groups='Batch', show=SHOW, sort=False)
_images/boxplot_18_0.png

Multiple groups

We can dive deeper by specifying more than one value for groups:

In [7]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW)
_images/boxplot_21_0.png

By default, the groups are sorted alphanumerically. To preserve the order of the input DataFrame, add the keyword sort=False:

In [8]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], sort=False, show=SHOW)
_images/boxplot_23_0.png

Groups + legend

Boxplots also support legending for another level of visualization:

In [9]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], legend='Region', show=SHOW)
_images/boxplot_26_0.png

Note: if there are a lot of legend items, the position of the legend will be automatically adjusted to avoid rendering over the box group titles.

In [10]:
df['Row'] = [int(f) for f in df.index / 4]
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], legend='Row', show=SHOW)
_images/boxplot_28_0.png

Grid plots

Like the plot function, boxplots can be broken into subplots based on “row” and/or “col” values or “wrap” values.

Column plots

In [11]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], col='Region', show=SHOW, ax_size=[300, 300])
_images/boxplot_32_0.png

Row plots

In [12]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], row='Region', show=SHOW, ax_size=[300, 300])
_images/boxplot_34_0.png

Wrap plots

In [13]:
fcp.boxplot(df=df, y='Value', groups=['Sample', 'Region'], wrap='Batch', show=SHOW, ax_size=[300, 300])
_images/boxplot_36_0.png

Alternatively, we can wrap multiple y column values and create a unique subplot for each column:

In [14]:
# Make a new y column
df['Value*2'] = 2*df['Value']

# Plot
fcp.boxplot(df=df, y=['Value', 'Value*2'], groups=['Batch', 'Sample', 'Region'], wrap='y', show=SHOW,
            ax_size=[300, 300])
_images/boxplot_38_0.png

Or if we disable y-axis range sharing:

In [15]:
fcp.boxplot(df=df, y=['Value', 'Value*2'], groups=['Batch', 'Sample', 'Region'], wrap='y', show=SHOW,
            ax_size=[300, 300], share_y=False)
_images/boxplot_40_0.png

Other options

Grand Mean/Median

The “grand mean” or “grand median” is the mean/median value for the entire data set in a given plot window. By default, the “grand mean” line is a dashed gray line and the “grand median” is a dashed blue line. Individual line color, styles, and widths can be controlled via the typically-named keywords.

In [16]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW, grand_mean=True, grand_median=True)
_images/boxplot_44_0.png

Both long form and short form keywords are available: i.e., box_grand_mean_ATTRIBUTENAME or grand_mean_ATTRIBUTENAME

In [17]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW, grand_mean=True,
            grand_mean_style=':', grand_mean_color='#FF0000', box_grand_mean_width=0.5)
_images/boxplot_46_0.png

Group Means

Group means that correspond to the first level of grouping (i.e., same as the vertical divider lines). By default, the mean values are depicted with horizontal dashed magenta lines. Style are controlled by box_group_means_ATTRIBUTENAME or group_means_ATTRIBUTENAME.

In [18]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW, group_means=True)
_images/boxplot_49_0.png

Mean Diamonds

The box_mean_diamonds or mean_diamonds keyword allows you to overlay a diamond on the box which shows vertically the span of the data for a given confidence interval (default = 95%) and a horizontal line for the mean value of each group. The following keywords are available to modify the diamond from its default style:

  • box_mean_diamonds_alpha | mean_diamonds_alpha: transparency of the diamond 0 to 1 (default: 1)
  • conf_coeff: confidence interval from 0 to 1 (default: 0.95)
  • box_mean_diamonds_edge_color | mean_diamonds_edge_color: edge color of the diamond (default: ‘#FF0000’)
  • box_mean_diamonds_edge_style | mean_diamonds_edge_style: edge style of the diamond (default: ‘-‘)
  • box_mean_diamonds_edge_width | mean_diamonds_edge_width: edge width of the diamond (default: 0.6)
  • box_mean_diamonds_fill_color | mean_diamonds_fill_color: fill color of the diamond (default: None)
  • box_mean_diamonds_width | mean_diamonds_width: width of the diamond from 0 to 1 (default: 0.8)
In [19]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW, mean_diamonds=True, conf_coeff=0.95)
_images/boxplot_52_0.png

Violins

We can also plot distributions with violin plots that show kernal density estimates of the data. By default, these violin plots also contain a small boxes with whiskers to indicate Q1, Q3, 1.5 * IQR and the median of the distribution. Discrete data points are disabled by default but can be turned on with the keyword violin_markers=True (box style shamelessly appropriated from seaborn).

In [20]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW, violin=True)
_images/boxplot_55_0.png

We can change the style of the violin density profiles and the associated boxplot using keywords starting with violin. Note that the standard box styling attributes are ignored when adding the violin plot. The reason for this is to make it possilbe to maintain different default settings for regular box plots and violin plots in the same theme file.

In [21]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW, violin=True,
            violin_fill_color='#eaef1a', violin_fill_alpha=1, violin_edge_color='#555555', violin_edge_width=2,
            violin_box_color='#ffffff', violin_whisker_color='#ff0000',
            violin_median_marker='+', violin_median_color='#00ffff', violin_median_size=10)
_images/boxplot_57_0.png

We can also disable the box overlay on the violin plot as follows:

In [22]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW, violin=True, violin_box_on=False, violin_markers=True, jitter=False)
_images/boxplot_59_0.png

Stat line

In addition to displaying boxes with a median line and interquartile ranges, a connecting line can be drawn between boxes at some statistical value. By default, the line connects the mean value of each distribution but other DataFrame stat values can be selected. The stat line accepts the typical styling keywords of any line object with the prefix box_stat_line_ (i.e., box_stat_line_color or box_stat_line_width)

Mean

In [23]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW, box_stat_line='mean', ax_size=[300, 300])
_images/boxplot_63_0.png

Median

In [24]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW, box_stat_line='median', ax_size=[300, 300])
_images/boxplot_65_0.png

Std dev

In [25]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW, box_stat_line='std', ax_size=[300, 300])
_images/boxplot_67_0.png

Dividers

Using the keyword box_divider, lines can be drawn on the boxplot to visually segrate main groups of boxes. These lines are enabled by default but can be turned off easily:

In [26]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW, box_divider=False, ax_size=[300, 300])
_images/boxplot_70_0.png

Range lines

Because outlier points by definition fall outside of the span of the box, we can draw lines that span the entire range of the data. This is particularly useful to indicate when there are data points that fall outside of the limits of the y-axis. These lines are enabled by default but can be disabled or styled through keywords with the prefix box_range_lines_:

In [27]:
fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'], show=SHOW, box_range_lines=False, ax_size=[300, 300])
_images/boxplot_73_0.png