boxplot

This section describes various options available for boxplots in fivecentplots

See the full API

Setup

Import packages:


import fivecentplots as fcp
import pandas as pd
from pathlib import Path

Read some fake boxplot data with no real meaning:


df = pd.read_csv(Path(fcp.__file__).parent / 'test_data/fake_data_box.csv')
df.head()

Batch Sample Region Value ID
0 101 1 Alpha123 3.5 ID701223A
1 101 1 Alpha123 0.0 ID7700-1222B
2 101 1 Alpha123 3.3 ID701223A
3 101 1 Alpha123 3.2 ID7700-1222B
4 101 1 Alpha123 4.0 ID701223A

Optionally set the design theme (skipping here and using default):


#fcp.set_theme('gray')
#fcp.set_theme('white')

Groups

One of the most powerful features of JMP is its variability chart, which partitions data into separate boxplots based on one or more grouping criteria. fivecentplots achieves this characteristic via the keyword groups. The value of this keyword is one or more column names from the DataFrame that differentiate the y values being plotted.

Single group

Using our fake dataset, we first plot our data grouped by a single column named “Batch”. Because our dataset has 3 unique values in the “Batch” column, we get three boxplots, each with a descriptive label.


df.Batch.unique()

array([101, 106, 103])

fcp.boxplot(df, y='Value', groups='Batch')
_images/boxplot_15_0.png

Multiple groups

We can dive deeper into the dataset by specifying a list of column names for groups. In this example we use columns “Batch” and “Sample” for which there are 7 unique groups. Notice that the order of the values passed to groups determines the grouping hierarchy, with the first grouping column values on the bottom of the plot.

Note that for purposes of the boxplot API, the values in the white rectangles under the plotting area are styled via keywords prefixed with box_group_label, while the column name strings are styled via keywords prefixed with box_group_title.


df[['Batch', 'Sample']].drop_duplicates()

Batch Sample
0 101 1
10 101 2
22 106 1
31 106 2
41 103 1
50 103 2
60 103 3

fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'])
_images/boxplot_19_0.png

By default, the group values are sorted alphanumerically. To preserve the order of the input DataFrame, add the keyword sort=False:


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], sort=False)
_images/boxplot_21_0.png

No groups

If not groups are specified, no grouping labels are added to the boxplot.


fcp.boxplot(df, y='Value', tick_labels_minor=True, grid_minor=True)
_images/boxplot_24_0.png

Box elements

Several descriptive elements are available within the boxplot to better visualize the dataset. These features are illustrated below.

Dividers

By default, when multiple groups are specified gray divider lines are drawn between the bottom-level groups.


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'])
_images/boxplot_29_0.png

These divider lines can be disabled or styled using the appropriate keywords.

Disabled:


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], box_divider=False)
_images/boxplot_31_0.png

Styled:


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'],
            box_divider_color='#d42269', box_divider_width=10, box_divider_style='--')
_images/boxplot_33_0.png

Whiskers

The “whiskers” on a boxplot span from Q1 to the minimum value on the lower end and Q3 to the maximum on the upper end, excluding outliers. Outliers are defined as any points that are below (Q1 - 1.5 * IQR) on the lower end and above Q3 to (Q3 + 1.5 * IQR) on the upper end. In the example below, notice that the first and second boxes each have an outlier point outside of the blue whisker lines. Whisker lines can be styled using the keywords prefixed by box_whisker. Note the horizontal “cap” at the end of the whisker lines shares the same style as the whiskers themselves. Default behavior is a solid gray line 0.5 pixel in width.

The dashed lines that span to the outlier points are controlled by box_range_lines described in the next section.


fcp.boxplot(df=df, y='Value', groups=['Batch', 'Sample'],
            box_whisker_color='#0000FF', box_whisker_width=3)
_images/boxplot_36_0.png

Whiskers can be disabled (be careful with the lack of plurality in the prefix!) but beware that if the range lines are still enabled you will see similar lines and horizontal caps.


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], box_whisker=False, box_range_lines=False)
_images/boxplot_38_0.png

Range lines

Outlier points by definition fall outside of the box whiskers, but with range lines we can span the entire range of the data (from absolute minimum to absolute maximum). This is particularly useful to indicate when there are outlier data points that fall outside of the limits of the visible y-axis. These range lines are enabled by default but can be disabled or styled through keywords with the prefix box_range_lines.

Default behavior is a dashed gray line 0.5 pixel in width:


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], box_whisker=False)
_images/boxplot_41_0.png

If we disable range lines and leave whiskers enabled we get this:


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], box_range_lines=False)
_images/boxplot_43_0.png

Styled:


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], box_whisker=False,
            box_range_lines_color='#d42269', box_range_lines_width=10, box_range_lines_style='-.')
_images/boxplot_45_0.png

Markers

Comment on box_ prefix

Jitter

To improve visibility of the actual data points in the boxplots, fivecentplots automatically jitters the data points (i.e., adds some random noise along the x-axis). This can be disabled using the keyword jitter.


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], jitter=False)
_images/boxplot_50_0.png

More grouping

Legend

Boxplots also support legending for another level of data visualization:


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], legend='Region')
_images/boxplot_54_0.png

Note: if there are a lot of legend items, the position of the legend will be automatically adjusted to avoid rendering over the box group titles.


df['Row'] = [int(f) for f in df.index / 4]
fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], legend='Row')
_images/boxplot_56_0.png

Column plots

boxplots can also be broken into subplots based on “row” and/or “col” values or “wrap” keywords. In each case, a column name in the DataFrame is supplied as the keyword value.


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], col='Region', ax_size=[300, 300])
_images/boxplot_59_0.png

Row plots


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], row='Region', ax_size=[300, 300])
_images/boxplot_61_0.png

Wrap plots


fcp.boxplot(df, y='Value', groups=['Sample', 'Region'], wrap='Batch', ax_size=[300, 300])
_images/boxplot_63_0.png

Alternatively, we can wrap multiple y column values and create a unique subplot for each column:


# Make a new y column
df['Value*2'] = 2*df['Value']

# Plot
fcp.boxplot(df, y=['Value', 'Value*2'], groups=['Batch', 'Sample', 'Region'], wrap='y',
            ax_size=[300, 300])
_images/boxplot_65_0.png

Or if we disable y-axis range sharing:


fcp.boxplot(df, y=['Value', 'Value*2'], groups=['Batch', 'Sample', 'Region'], wrap='y',
            ax_size=[300, 300], share_y=False)
_images/boxplot_67_0.png

Stats

Grand Mean/Median

The “grand mean” or “grand median” is the mean/median value for the entire data set in a given plot window. By default, the “grand mean” line is a dashed gray line and the “grand median” is a dashed blue line. Individual line color, styles, and widths can be controlled via the typically-named keywords prefixed by box_grand_mean or box_grand_median.


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], grand_mean=True, grand_median=True)
_images/boxplot_71_0.png

Both long form and short form keywords are available: i.e., box_grand_mean_ATTRIBUTE or grand_mean_ATTRIBUTE


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], grand_mean=True,
            grand_mean_style=':', grand_mean_color='#FF0000', box_grand_mean_width=0.5)
_images/boxplot_73_0.png

Group Means

Group means that correspond to the first level of grouping (i.e., same as the vertical divider lines). By default, the mean values are depicted with horizontal dashed magenta lines. Style are controlled by box_group_means_ATTRIBUTE or group_means_ATTRIBUTE.


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], group_means=True)
_images/boxplot_76_0.png

Mean Diamonds

The box_mean_diamonds or mean_diamonds keyword allows you to overlay a diamond on the box which shows vertically the span of the data for a given confidence interval (default = 95%) and a horizontal line for the mean value of each group. Using default parameters the diamonds are green (like the program that inspired them :) )


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], mean_diamonds=True, conf_coeff=0.95)
_images/boxplot_79_0.png

Stat line

In addition to displaying boxes with a median line and interquartile ranges, a connecting line can be drawn between boxes at some statistical value. By default, the line connects the mean value of the distribution for each box, but other DataFrame stat values can be selected. The stat line accepts the typical styling keywords of any line object with the prefix box_stat_line (i.e., box_stat_line_color or box_stat_line_width)

Mean


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], box_stat_line='mean')
_images/boxplot_83_0.png

Median


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], box_stat_line='median')
_images/boxplot_85_0.png

Std dev


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], box_stat_line='std')
_images/boxplot_87_0.png

Quantile

To define a quantile, use the convention “q{number between 0-100}”:


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], box_stat_line='q95')
_images/boxplot_90_0.png

Violins

We can also plot distributions with violin plots that show kernal density estimates of the data. By default, these violin plots also contain a small boxes with whiskers to indicate Q1, Q3, 1.5 * IQR, and the median of the distribution (the white point). Discrete data points are disabled by default but can be turned on with the keyword violin_markers=True (default box style shamelessly appropriated from seaborn).


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], violin=True)
_images/boxplot_93_0.png

We can change the style of the violin density profiles and the associated boxplot using keywords starting with violin. Note that the standard box styling attributes are ignored when adding the violin plot. The reason for this is to make it possilbe to maintain different default settings for regular box plots and violin plots in the same theme file.


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], violin=True,
            violin_fill_color='#eaef1a', violin_fill_alpha=1, violin_edge_color='#555555', violin_edge_width=2,
            violin_box_color='#ffffff', violin_whisker_color='#ff0000',
            violin_median_marker='+', violin_median_color='#00ffff', violin_median_size=10)
_images/boxplot_95_0.png

We can also disable the box overlay on the violin plot as follows:


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], violin=True, violin_box_on=False, violin_markers=True, jitter=False)
_images/boxplot_97_0.png

Notch

Notch-style boxes can also be created using the notch keyword:


fcp.boxplot(df, y='Value', groups=['Batch', 'Sample'], notch=True)
_images/boxplot_100_0.png

A Note on Size

Depending on the number of group labels in a boxplot, it is possible that the box tick labels will be squished together and become unreadable. fivecentplots offers three options to deal with this:

  1. Auto-rotation of long labels (by default)


df2 = df.copy()
df2['Identification is a longer way to say ID'] = df['ID'] * 2
fcp.boxplot(df2, y='Value', groups=['Batch', 'Identification is a longer way to say ID', 'Sample'])
_images/boxplot_103_0.png
  1. Manual increase of ax_size parameter:

  • Default:


    df2 = df.copy()
    df2.Value *= 2
    df2.loc[df2.Sample == 1, 'Sample'] = 4
    df2.loc[df2.Sample == 2, 'Sample'] = 5
    df2.loc[df2.Sample == 3, 'Sample'] = 6
    df3 = df.copy()
    df3.Value *= 3
    df3.loc[df3.Sample == 1, 'Sample'] = 7
    df3.loc[df3.Sample == 2, 'Sample'] = 8
    df3.loc[df3.Sample == 3, 'Sample'] = 9
    df4 = df.copy()
    df4.Value *= 4
    df4.loc[df4.Sample == 1, 'Sample'] = 10
    df4.loc[df4.Sample == 2, 'Sample'] = 11
    df4 = pd.concat([df4, df3, df2, df])
    fcp.boxplot(df4, y='Value', groups=['Batch', 'ID', 'Sample'])
_images/boxplot_106_0.png
  • With twice the width:


    fcp.boxplot(df4, y='Value', groups=['Batch', 'ID', 'Sample'], ax_size=[800, 400])
_images/boxplot_108_0.png
  1. auto-width scaling. This feature is new with fivecentplots v0.5.0 and still experimental so your mileage may vary. To enable this, set ax_scale='auto':


    fcp.boxplot(df4, y='Value', groups=['Batch', 'ID', 'Sample'], ax_size='auto')
_images/boxplot_110_0.png