Grouping data

A single pandas DataFrame can often be decomposed into smaller subsets of data that need to be visualized. One of the key advantages of fivecentplots is its incredibly easy syntax to isolate data sets based on other columns in a DataFrame. The following grouping options are available:

  • filter: simple string which a more convenient syntax than standard numpy filtering used to select only DataFrame rows which meet certain criteria

  • legend: color lines and markers according to unique values in a DataFrame column

  • groups:

    • for xy plots: separates unique subsets of data so that plot lines are not continuously looped back to the origin (useful for replicates of similar data)

    • for boxplots: groups boxes by unique values in one or more DataFrame columns

  • row: makes a grid of subplots with each unique value of a single DataFrame column grouped in rows in a single column; group labels are on the right side of each plot

  • col: makes a grid of subplots with each unique value of a single DataFrame column grouped in columns in a single row; group labels are on the top of each plot

  • wrap:

    • Option 1: similar to row and column grouping, makes a grid of subplots for each unique value of one or more the DataFrame column names; group labels and values are on top of each plot

    • Option 2: wrap by x or y to create a uniqe subplots for each column name listed

  • figure: makes a unique figure for each unique value of a DataFrame column

Note

We can achieve an extra grouping dimension using the kwarg filter followed by a string matching DataFrame column names with values (filter='Column1==5 & Column2>3')

Setup

Import packages:


import fivecentplots as fcp
import pandas as pd
from pathlib import Path

Read some dummy data for examples:


df1 = pd.read_csv(Path(fcp.__file__).parent / 'test_data/fake_data.csv')
df2 = pd.read_csv(Path(fcp.__file__).parent / 'test_data/fake_data_box.csv')

Optionally set the design theme (skipping here and using default):


#fcp.set_theme('gray')
#fcp.set_theme('white')

filter

Consider the following set of fake current vs voltage data, which comes from multiple samples at multiple conditions:


fcp.plot(df1, x='Voltage', y='I [A]', lines=False)
_images/grouping_11_0.png

To isolate these data to particular subsets of interest we need to filter the original data set. Because our data comes from pandas DataFrames, this is fairly straight forward as shown below. However, depending on how our DataFrame columns are named, the syntax could get ugly quickly:


fcp.plot(df1[(df1.Substrate=='Si')&(df1['Target Wavelength']==450)&(df1['Boost Level']==0.2)&(df1['Temperature [C]']==25)&(df1['Die']=='(1,1)')],
         x='Voltage', y='I [A]')
_images/grouping_13_0.png

For convenience, fivecentplots provides an alternate filtering syntax which allows the user to specify matching criteria is a more human-readable way. Here we apply the same filter as above using this option:


fcp.plot(df1, x='Voltage', y='I [A]',
         filter='Substrate=="Si" & Target Wavelength==450 & Boost Level==0.2 & Temperature [C]==25 & Die=="(1,1)"')
_images/grouping_15_0.png

This new syntax also allows for more complicated filter schemes to be easily defined:


fcp.plot(df1, x='Voltage', y='I [A]',
         filter='Substrate not in ["GaAs", "InP"] & Target Wavelength <= 455 & Boost Level==0.2 & Temperature [C]==25 & Die in ["(1,1)"]')
_images/grouping_17_0.png

legend

The legend keyword can be a single DataFrame column name or a list of column names. The data set will then be grouped according to each unique value of the legend column(s) and a separate plot line will be added to the figure for each group. A different color and marker type will be used for better visualization.

Single legend column

In our sample data set, we have fake IV data for various devices, each labeled with a unique “Die” address. By setting the kwarg legend='Die', we can separate the data set for each test site and display it clearly:


fcp.plot(df1, x='Voltage', y='I [A]', legend='Die',
         filter='Substrate=="Si" & Target Wavelength==450 & Boost Level==0.2 & Temperature [C]==25')
_images/grouping_22_0.png

Multiple legend columns

We can also legend by multiple DataFrame columns. In this case, we pass a list of column names is passed to the legend keyword and get a result where the unique combination of values of each legend column is listed, separated by a pipe character (|):


fcp.plot(df1, x='Voltage', y='I [A]', legend=['Die', 'Substrate'],
         filter='Target Wavelength==450 & Boost Level==0.2 & Temperature [C]==25')
_images/grouping_25_0.png

Multiple x & y values

Legends are also useful when plotting more than one DataFrame column on the y axis without a specific grouping column. The legend shows automatically in this case to identify which data set belows to which DataFrame column, but can be disabled by setting legend=False:


fcp.plot(df1, x='Voltage', y=['I [A]', 'Voltage'], lines=False,
         filter='Substrate=="Si" & Target Wavelength==450 & Boost Level==0.2')
_images/grouping_28_0.png

Secondary x|y plots

When plotting with a secondary axis (i.e., twinning), there are three options for the legend: (1) no legend; (2) legend based on the values of the primary/secondary axes; or (3) legend based on another column.

No legend


fcp.plot(df1, x='Voltage', y=['Voltage', 'I [A]'], twin_x=True,
         filter='Substrate=="Si" & Target Wavelength==450 & Boost Level==0.2 & Temperature [C]==25 & Die=="(-1,2)"')
_images/grouping_32_0.png

With legend

Force the legend to show with kwarg legend=True:


fcp.plot(df1, x='Voltage', y=['Voltage', 'I [A]'], twin_x=True, legend=True,
         filter='Substrate=="Si" & Target Wavelength==450 & Boost Level==0.2 & Temperature [C]==25 & Die=="(-1,2)"')
_images/grouping_35_0.png

Legend by another column

Legend by the name of another DataFrame column (each value in the legend will also state if it is linked to the primary or secondary axis):


fcp.plot(df1, x='Voltage', y=['Voltage', 'I [A]'], twin_x=True, legend='Die',
         filter='Substrate=="Si" & Target Wavelength==450 & Boost Level==0.2 & Temperature [C]==25 & Die=="(-1,2)"')
_images/grouping_38_0.png

Location

By default, fivecentplots places the legend in the upper right corner of a figure, outside of the plot axes as this is the only way to be certain the legend will never obscure the actual data. However, the legend can be placed inside the axes area by specifying the keyword legend_location with one of the following values (which align roughly with matplotlib syntax):

  • ‘outside’ or 0

  • ‘upper right’ or 1

  • ‘upper left’ or 2

  • ‘lower left’ or 3

  • ‘lower right’ or 4

  • ‘right’ or 5

  • ‘center left’ or 6

  • ‘center right’ or 7

  • ‘lower center’ or 8

  • ‘upper center’ or 9

  • ‘center’ or 10

  • ‘below’ or 11

Note

This feature is not currently supported with the bokeh plotting engine

Outside right

No legend_location is required to place the legend to the right of the axes area. This is the default for fivecentplots, completely eliminating the need for the complex syntax required to do this with other plotting solutions (like bbox_to_anchor with pure matplotlib code). “Everything… in its right place”


fcp.plot(df1, x='Voltage', y='I [A]', legend='Die',
         filter='Substrate=="Si" & Target Wavelength==450 & Boost Level==0.2 & Temperature [C]==25')
_images/grouping_43_0.png

Inside the plot

Locations 1 through 10 will place the legend within the plot window. While this can result in a more compact figure, there is a also a significant risk that the legend will obscure the data in the plot. Care should be exercised when placing the legend inside the plot.


fcp.plot(df1, x='Voltage', y='I [A]', legend='Die',
         filter='Substrate=="Si" & Target Wavelength==450 & Boost Level==0.2 & Temperature [C]==25', legend_location=2)
_images/grouping_46_0.png

Below the plot

Long legend values can by annoying to deal with. If the legend is placed in the default location outside the axis, the figure size will have a large width. If the legend is placed within the axis window, it can extend longer than the axis itself. fivecentplots also provides the option of placing the legend “below” the plot to deal with this case:


df1.loc[df1.Die=='(1,1)', 'Long Legend'] = 'Sample #ABCDEFGHIJKLMNOPQRSTUVWXYZ'
df1.loc[df1.Die=='(2,-1)', 'Long Legend'] = 'Sample #RUNFORYOURLIFEWITHME'
df1.loc[df1.Die=='(-1,2)', 'Long Legend'] = 'Sample #THESKYISANEIGHBORHOOD!!!!!!!!!'
fcp.plot(df1, x='Voltage', y='I [A]', legend='Long Legend',
         filter='Substrate=="Si" & Target Wavelength==450 & Boost Level==0.2 & Temperature [C]==25', legend_location='below')
_images/grouping_49_0.png

Note on location

Note

Relocating the legend is straightforward when dealing with a single plot axes, but the ideal situation is a bit murky for grid plots with legends. For the default legend_location on the outside right of the axes area there is no issue: one legend will be placed in this space for all subplots. When placing the legend inside the axes area however, a duplicate of the legend will be placed in each subplot window unless keyword nleg=1 is included, which will enable the legend only

within the upper right subplot.

groups

Kwarg groups adds another level of grouping to a plot. This keyword requires a DataFrame column name or list of column names which are used to identify subsets in the data and make sure they are properly distinguished in the plot. This option behaves differently for xy plots and boxplots as described below.

xy plots

Some data sets contain multiple sets of similar data (think replicates of a measurement or a large quantity of parts in a production line). Consider the following example where we plot fake measurement data from a large sampling of parts with lines enabled. Notice how the line loops from the end of the data back to the beginning repeatedly. This occurs because the arrays used for the plot come from a DataFrame which has results for multiple parts concatenated together. Without any additional information provided, it is not possible to know if these data are from one part or multiple parts.


fcp.plot(df1, x='Voltage', y='I [A]', legend='Temperature [C]',
         filter='Substrate=="Si" & Target Wavelength==450 & Boost Level==0.2')
_images/grouping_56_0.png

To handle cases like this, we can add the keyword groups and specify another DataFrame column name that tells fivecentplots how the data are separated within the table (in this case we use the column “Die” which has an unique x,y address for each part). Now we get distinct lines for each instance of the measurement data and we can clearly tell one part from another. Unlike with legend however we do not know which part is which.

Note

The groups keyword can also be combined with a legend from a different DataFrame column


fcp.plot(df1, x='Voltage', y='I [A]', groups='Die', legend='Temperature [C]',
         filter='Substrate=="Si" & Target Wavelength==450 & Boost Level==0.2')
_images/grouping_58_0.png

groups also supports multiple column names. Here we remove “Temperature [C]” from the legend and add it to the list of groups:


fcp.plot(df1, x='Voltage', y='I [A]', groups=['Die', 'Temperature [C]'],
         filter='Substrate=="Si" & Target Wavelength==450 & Boost Level==0.2')
_images/grouping_60_0.png

boxplots

Like x-y plots, the groups keyword is used to break the data set into subsets. However, for boxplots the group column names and values are actually displayed along the x-axis. This style mimics the incredibly useful variability gauge charts found in JMP:


fcp.boxplot(df2, y='Value', groups=['Batch', 'Sample'], legend='Region')
_images/grouping_63_0.png

row | col subplots

By unique values

For deeper analysis of our data, we can make a grid of subplots based on the unique values within DataFrame columns other than the x and y columns we are plotting. In this case, we remove the “Temperature [C]” and “Boost Level” columns from our filter and add them to the row and column kwargs, respectively. For our example data, there are three unique values of “Boost Level” and three unique values of “Temperature [C]” which leads to a 3x3 grid of subplots. Each row in the plot corresponds to one of the 3 unique values of “Temperature [C]”. Likewise, each column in the plot corresponds to one of the 3 unique values of “Boost Level”. The specific value in each row and column are shown in labels to the right and above of the subplots, respectively.


fcp.plot(df1, x='Voltage', y='I [A]', legend='Die', col='Boost Level', row='Temperature [C]',
         ax_size=[225, 225], filter='Substrate=="Si" & Target Wavelength==450', label_rc_font_size=14)
_images/grouping_67_0.png

By x or y

Alternatively, we can use row and column grids to compare different x or y values. For example, we can plot two different DataFrame columns on our y-axis (one per row) by adding the keyword row='y'. Each column in this plot shares the same x-axis variable:


fcp.plot(df1, x='Voltage', y=['Voltage', 'I [A]'], legend='Die', col='Boost Level', row='y', ax_size=[225, 225],
         filter='Substrate=="Si" & Target Wavelength==450 & Temperature [C]==75', label_rc_font_size=14)
_images/grouping_70_0.png

We can make a similar plot with common row values and different x-axes by setting col='x':


fcp.plot(df1, x=['Voltage', 'I [A]'], y='Voltage', legend='Die', row='Boost Level', col='x',
         ax_size=[225, 225], filter='Substrate=="Si" & Target Wavelength==450 & Temperature [C]==75', label_rc_font_size=14)
_images/grouping_72_0.png

wrap subplots

By unique values

The wrap keyword also allows us to create a grid of subplots. Unlike row and col plots, wrap plots simply “wrap” a series of data subsets, each with the unique values of one or more DataFrame, into a grid. There is no alignment of a common value across rows or column. Additionally, wrap plots remove the spacing and ticks between subplots by default for a tighter grid (this can be changed via keywords), axis sharing is forced (this cannot be changed), and the row|column value labels are condensed to a single label above each subplot:


fcp.plot(df1, x='Voltage', y='I [A]', legend='Die', wrap=['Temperature [C]', 'Boost Level'],
         ax_size=[225, 225], filter='Substrate=="Si" & Target Wavelength==450')
_images/grouping_76_0.png

By default, wrap subplots will be arranged in a the closest thing to a square grid as possible. If the number of subplots does not match a perfectly square grid, the last row will have one or more empty spaces in the grid. To override the default grid size, use the keyword ncol to sets a custom number of columns. The number of rows will be automatically determined based on the total number of subplots:


fcp.plot(df1, x='Voltage', y='I [A]', legend='Die', wrap=['Temperature [C]', 'Boost Level'],
         ax_size=[225, 225], filter='Substrate=="Si" & Target Wavelength==450', ncol=2)
_images/grouping_78_0.png

By x and y

We can also “wrap” by column names instead of unique column values. In this case we list the columns to plot in the x or y keywords as usual but we add wrap='x' or wrap='y' to the function call. This will create a unique subplot in a grid for each wrap value. Unlike the case above where we wrapped by unique values in a DataFrame column, when we wrap by “x” or “y” the plot will not have wrap labels and axis sharing can be overridden. As before, use ncol to override the default square grid.


fcp.plot(df1, x='Voltage', y=['I Set', 'I [A]'], legend='Die', wrap='y', groups=['Boost Level', 'Temperature [C]'],
         ax_size=[325, 325], filter='Substrate=="Si" & Target Wavelength==450')
_images/grouping_81_0.png

Next, we replot the above data in an alternative style, with tick and label sharing disabled:


fcp.plot(df1, x='Voltage', y=['I Set', 'I [A]'], legend='Die', wrap='y', groups=['Boost Level', 'Temperature [C]'],
         ax_size=[525, 170], filter='Substrate=="Si" & Target Wavelength==450', ncol=1, ws_row=0,
         separate_labels=False, separate_ticks=False)
_images/grouping_83_0.png

figure plots

To add another dimension of grouping, fivecentplots supports grouping by “figure”. This means a separate figure (i.e., a separate png) is created for each unique value in the DataFrame column(s) listed in kwarg fig_groups. Below, we will plot a unique figure for each value of the “Die” column. We also enable the save command and the debug kwarg print_filename to show that we actually are getting a separate image for each plot (note we also explicitly call the inline keyword as well just so we can render the plots in this tutorial notebook):


fcp.plot(df1, x='Voltage', y='I [A]', fig_groups='Die', wrap=['Temperature [C]', 'Boost Level'], save=True, inline=True,
         ax_size=[225, 225], filter='Substrate=="Si" & Target Wavelength==450', print_filename=True)
I [A] vs Voltage by Temperature [C] & Boost Level where Die=(-1,2).png
_images/grouping_86_1.png
I [A] vs Voltage by Temperature [C] & Boost Level where Die=(1,1).png
_images/grouping_86_3.png
I [A] vs Voltage by Temperature [C] & Boost Level where Die=(2,-1).png
_images/grouping_86_5.png