Stemgraphic Modules¶

stemgraphic¶

stem_graphic

Package implementing a complete toolkit for text and a graphical stem-and-leaf plots and other visualizations adapted to stem-and-leaf pair values, such as heatmaps and sunburst charts.

It also handles very large data sets through scaling, sampling, trimming and other techniques.

See research paper ( http://artchiv.es/pydata2016/stemgraphic ) for more technical details.

A command line utility was installed along with the package, allowing to process excel or csv files. See: stem -h

`aliases`¶

Handy aliases for stem_graphic options.

stemgraphic.aliases.stem_hist(x, aggregation=False, alpha=1, asc=True, column=None, color='b', delimiter_color='r', display=300, flip_axes=True, legend_pos='short', outliers=False, trim=False)¶

stem_hist builds a histogram matching the stem-and-leaf plot, with the numbers hidden, as shown on the cover of the companion brochure.

Parameters:

legend_pos –
x – list, numpy array, time series, pandas or dask dataframe
aggregation – Boolean for sum, else specify function
alpha – opacity of the bars, median and outliers, defaults to 15%
asc – stem sorted in ascending order, defaults to True
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
color – the bar facecolor
delimiter_color – color of the line between aggregate and stem and stem and leaf
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – X becomes Y and Y becomes X
outliers – this is NOP, for compatibility
trim – this is NOP, for compatibility

Returns:

matplotlib figure and axes instance

stemgraphic.aliases.stem_kde(x, **kw_args)¶

stem_kde buils a stem-and-leaf plot and adds an overlaid kde as secondary plot.

Parameters:	x – list, numpy array, time series, pandas or dask dataframe kw_args –
Returns:	matplotlib figure and axes instance

stemgraphic.aliases.stem_line(x, aggregation=False, alpha=0, asc=True, column=None, color='k', delimiter_color='r', display=300, flip_axes=True, outliers=False, secondary_plot=None, trim=False)¶

stem_line builds a stem-and-leaf plot with lines instead of bars.

Parameters:

x – list, numpy array, time series, pandas or dask dataframe
aggregation – Boolean for sum, else specify function
alpha – opacity of the bars, median and outliers, defaults to 15%
asc – stem sorted in ascending order, defaults to True
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
color – the color of the line
delimiter_color – color of the line between aggregate and stem and stem and leaf
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – X becomes Y and Y becomes X
outliers –
secondary_plot – One or more of ‘dot’, ‘kde’, ‘margin_kde’, ‘rug’ in a comma delimited string or None
trim – this is NOP, for compatibility

Returns:

matplotlib figure and axes instance

`alpha`¶

stemgraphic.alpha.

BRAND NEW in V.0.5.0!

Stemgraphic provides a complete set of functions to handle everything related to stem-and-leaf plots. alpha is a module of the stemgraphic package to add support for categorical and text variables.

The module also adds functionality to handle whole words, beside stem-and-leaf bigrams and n-grams.

For example, for the word “alabaster”:

With word_ functions, we can look at the word frequency in a text, or compare it through a distance function (default to Levenshtein) to other words in a corpus

With stem_ functions, we can look at the fundamental stem-and-leaf, stem would be ‘a’ and leaf would be ‘l’, for a bigram ‘al’. With a stem_order of 1 and a leaf_order of 2, we would have ‘a’ and ‘la’, for a trigram ‘ala’, so on and so forth.

stemgraphic.alpha.add_missing_letters(mat, stem_order, leaf_order, letters=None)¶

Add missing stems based on LETTERS. defaults to a-z alphabet.

Parameters:	mat – matrix to modify stem_order – how many stem characters per data point to display, defaults to 1 leaf_order – how many leaf characters per data point to display, defaults to 1 letters – letters that must be present as stems
Returns:	the modified matrix

stemgraphic.alpha.heatmap(src, alpha_only=False, annotate=False, asFigure=False, ax=None, caps=False, compact=True, display=None, interactive=True, leaf_order=1, leaf_skip=0, random_state=None, stem_order=1, stem_skip=0, stop_words=None)¶

The heatmap displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves,: they are left in their respective columns. Row ‘a’ and Column ‘b’ would have the count of words starting with ‘ab’. The heatmap is useful to look at patterns. For distribution, stem_graphic is better suited.

Parameters:

src – string, filename, url, list, numpy array, time series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet
annotate – display annotations (Z) on heatmap
asFigure – return plot as plotly figure (for web applications)
ax – matplotlib axes instance, usually from a figure or other plot
caps – bool, True to be case sensitive
compact – remove empty stems
display – maximum number of data points to display, forces sampling if smaller than len(df)
interactive – if cufflinks is loaded, renders as interactive plot in notebook
leaf_order – how many leaf characters per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
random_state – initial random seed for the sampling process, for reproducible research
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

Returns:

stemgraphic.alpha.heatmap_grid(src1, src2, src3=None, src4=None, alpha_only=True, annot=False, caps=False, center=0, cmap=None, display=1000, leaf_order=1, leaf_skip=0, random_state=None, reverse=False, robust=False, stem_order=1, stem_skip=0, stop_words=None, threshold=0)¶

heatmap_grid.

With stem_graphic, it is possible to directly compare two different sources. In the case of a heatmap, two different data sets cannot be visualized directly on a single heatmap. For this task, we designed heatmap_grid to adapt to the number of sources to build a layout. It can take from 2 to 4 different source.

With 2 sources, a square grid will be generated, allowing for horizontal and vertical comparisons, with an extra heatmap showing the difference between the two matrices. It also computes a norm for that difference matrix. The smaller the value, the closer the two heatmaps are.

With 3 sources, it builds a triangular grid, with each source heatmap in a corner and the difference between each pair in between.

Finally, with 4 sources, a 3 x 3 grid is built, each source in a corner and the difference between each pair in between, with the center expressing the difference between top left and bottom right diagonal.

Parameters:

src1 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (required)
src2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (required)
src3 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional)
src4 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional)
alpha_only – only use stems from a-z alphabet
annot – display annotations (Z) on heatmap
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.
center – the center of the divergent color map for the difference heatmaps
cmap – color map for difference heatmap or None (default) to use the builtin red / blue divergent map
display – maximum number of data points to display, forces sampling if smaller than len(df)
leaf_order – how many leaf characters per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
robust – reduce effect of outliers on difference heatmap
random_state – initial random seed for the sampling process, for reproducible research
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
threshold – absolute value minimum count difference for a difference heatmap element to be visible

Returns:

stemgraphic.alpha.matrix_difference(mat1, mat2, thresh=0, ord=None)¶

Parameters:	mat1 – first heatmap dataframe mat2 – second heatmap dataframe thresh – : absolute value minimum count difference for a difference heatmap element to be visible
Returns:	difference matrix, norm and ratio of the sum of the first matrix over the second

stemgraphic.alpha.ngram_data(df, alpha_only=False, ascending=True, binary=False, break_on=None, caps=False, char_filter=None, column=None, compact=False, display=750, leaf_order=1, leaf_skip=0, persistence=None, random_state=None, remove_accents=False, reverse=False, rows_only=True, sort_by='len', stem_order=1, stem_skip=0, stop_words=None)¶

This is the main text ingestion function for stemgraphic.alpha. It is used by most of the visualizations. It can also be used directly, to feed a pipeline, for example.

If selected (rows_only=False), the returned dataframe includes in each row a single word, the stem, the leaf and the ngram (stem + leaf) - the index is the ‘token’ position in the original source:

word stem leaf ngram

12 salut s a sa 13 chéri c h ch

Parameters:

df – list, numpy array, series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – bool if the sort is ascending
binary – bool if True forces counts to 1 for anything greater than 0
break_on – letter on which to break a row, or None (default)
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
char_filter – list of characters to ignore. If None (default) CHAR_FILTER list will be used
column – specify which column (string or number) of the dataframe to use, or group of columns (stems) else the frame is assumed to only have one column with words.
compact – remove empty stems
display – maximum number of data points to display, forces sampling if smaller than len(df)
leaf_order – how many leaf characters per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
persistence – will save the sampled datafrae to filename (with csv or pkl extension) or None
random_state – initial random seed for the sampling process, for reproducible research
remove_accents – bool if True strips accents (NA on dataframe)
rows_only – bool by default returns only the stem and leaf rows. If false, also the matrix and dataframe
sort_by – default to ‘len’, can also be ‘alpha’
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

Returns:

ordered rows if rows_only, else also returns the matrix and dataframe

stemgraphic.alpha.plot_sunburst_level(normalized, ax, label=True, level=0, offset=0, ngram=False, plot=True, stem=None, vis=0)¶

utility function for sunburst function.

Parameters:	normalized – ax – label – level – ngram – offset – plot – stem – vis –
Returns:

stemgraphic.alpha.polar_word_plot(ax, word, words, label, min_dist, max_dist, metric, offset, step)¶

Utility function for radar plot.

Parameters:

ax – matplotlib ax
word – string, the reference word that will be placed in the middle
words – list of words to compare
label – bool if True display words centered at coordinate
min_dist – minimum distance based on metric to include a word for display
max_dist – maximum distance for a given section
metric – any metric function accepting two values and returning that metric in a range from 0 to x
offset – where to start plotting in degrees
step – how many degrees to step between plots

Returns:

stemgraphic.alpha.radar(word, comparisons, ascending=True, display=100, label=True, metric=None, min_distance=1, max_distance=None, random_state=None, sort_by='alpha')¶

The radar plot compares a reference word with a corpus. By default, it calculates the levenshtein distance between the reference word and each words in the corpus. An alternate distance or metric function can be provided. Each word is then plotted around the center based on 3 criteria.

If the word length is longer, it is plotted on the left side, else on the right side.
Distance from center is based on the distance function.
the words are equidistant, and their order defined alphabetically or by count (only applicable if the corpus is a text and not a list of unique words, such as a password dictionary).

Stem-and-leaf support is upcoming.

Parameters:

word – string, the reference word that will be placed in the middle
comparisons – external file, list or string or dataframe of words
ascending – bool if the sort is ascending
display – maximum number of data points to display, forces sampling if smaller than len(df)
label – bool if True display words centered at coordinate
metric – Levenshtein (default), or any metric function accepting two values and returning that metric
min_distance – minimum distance based on metric to include a word for display
max_distance – maximum distance based on metric to include a word for display
random_state – initial random seed for the sampling process, for reproducible research
sort_by – default to ‘alpha’, can also be ‘len’

Returns:

stemgraphic.alpha.scatter(src1, src2, src3=None, alpha=0.5, alpha_only=True, ascending=True, asFigure=False, ax=None, caps=False, compact=True, display=None, fig_xy=None, interactive=True, jitter=False, label=False, leaf_order=1, leaf_skip=0, log_scale=True, normalize=None, percentage=None, project=False, project_only=False, random_state=None, sort_by='alpha', stem_order=1, stem_skip=0, stop_words=None, whole=False)¶

With 2 sources:

Scatter compares the word frequency of two sources, on each axis. Each data point Z value is the word or stem-and-leaf value, while the X axis reflects that word/ngram count in one source and the Y axis reflect the same word/ngram count in the other source, in two different colors. If one word/ngram is more common on the first source it will be displayed in one color, and if it is more common in the second source, it will be displayed in a different color. The values that are the same for both sources will be displayed in a third color (default colors are blue, black and pink.

With 3 sources:

The scatter will compare in 3d the word frequency of three sources, on each axis. Each data point hover value is the word or stem-and-leaf value, while the X axis reflects that word/ngram count in the 1st source, the Y axis reflects the same word/ngram count in the 2nd source, and the Z axis the 3rd source, each in a different color. If one word/ngram is more common on the 1st source it will be displayed in one color, in the 2nd source as a second color and if it is more common in the 3rd source, it will be displayed in a third color. The values that are the same for both sources will be displayed in a 4th color (default colors are blue, black, purple and pink.

In interactive mode, hovering the data point will give the precise counts on each axis along with the word itself, and filtering by category is done by clicking on the category in the legend. Double clicking a category will show only that category.

Parameters:	src1 – string, filename, url, list, numpy array, time series, pandas or dask dataframe src2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe src3 – string, filename, url, list, numpy array, time series, pandas or dask dataframe, optional

:param alpha:: opacity of the dots, defaults to 50% :param alpha_only: only use stems from a-z alphabet (NA on dataframe) :param ascending: word/stem count sorted in ascending order, defaults to True :param asFigure: return plot as plotly figure (for web applications) :param ax: matplotlib axes instance, usually from a figure or other plot :param caps: bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe) :param compact: do not display empty stem rows (with no leaves), defaults to False :param display: maximum number of data points to display, forces sampling if smaller than len(df) :param fig_xy: tuple for matplotlib figsize, defaults to (20,20) :param interactive: if cufflinks is loaded, renders as interactive plot in notebook :param jitter: random noise added to help see multiple data points sharing the same coordinate :param label: bool if True display words centered at coordinate :param leaf_order: how many leaf digits per data point to display, defaults to 1 :param leaf_skip: how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’ :param log_scale: bool if True (default) uses log scale axes (NA in 3d due to open issues with mpl, cufflinks) :param normalize: bool if True normalize frequencies in src2 and src3 relative to src1 length :param percentage: coordinates in percentage of maximum word/ngram count (in non interactive mode) :param project: project src1/src2 and src1/src3 comparisons on X=0 and Z=0 planes :param project_only: only show the projection (NA if project is False) :param random_state: initial random seed for the sampling process, for reproducible research :param sort_by: sort by ‘alpha’ (default) or ‘count’ :param stem_order: how many stem characters per data point to display, defaults to 1 :param stem_skip: how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter :param stop_words: stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French) :param whole: for normalized or percentage, use whole integer values (round) :return: matplotlib ax, dataframe with categories

stemgraphic.alpha.stem_freq_plot(df, alpha_only=False, asFigure=False, column=None, compact=True, caps=False, display=2600, interactive=True, kind='barh', leaf_order=1, leaf_skip=0, random_state=None, stem_order=1, stem_skip=0, stop_words=None)¶

Word frequency plot is the most common visualization in NLP. In this version it supports stem-and-leaf / n-grams.

Each row is the stem, and similar leaves are grouped together and each different group is stacked in bar charts.

Default is horizontal bar chart, but vertical, histograms, area charts and even pie charts are supported by this one visualization.

Parameters:

df – string, filename, url, list, numpy array, time series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet (NA on dataframe)
asFigure – return plot as plotly figure (for web applications)
column – specify which column (string or number) of the dataframe to use, or group of columns (stems) else the frame is assumed to only have one column with words.
compact – do not display empty stem rows (with no leaves), defaults to False
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
display – maximum number of data points to display, forces sampling if smaller than len(df)
interactive – if cufflinks is loaded, renders as interactive plot in nebook
kind – defaults to ‘barh’. One of ‘bar’,’barh’,’area’,’hist’. Non-interactive also supports ‘pie’
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
random_state – initial random seed for the sampling process, for reproducible research
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

Returns:

stemgraphic.alpha.stem_graphic(df, df2=None, aggregation=True, alpha=0.1, alpha_only=True, ascending=False, ax=None, ax2=None, bar_color='C0', bar_outline=None, break_on=None, caps=True, column=None, combined=None, compact=False, delimiter_color='C3', display=750, figure_only=True, flip_axes=False, font_kw=None, leaf_color='k', leaf_order=1, leaf_skip=0, legend_pos='best', median_color='C4', mirror=False, persistence=None, primary_kw=None, random_state=None, remove_accents=False, reverse=False, secondary=False, show_stem=True, sort_by='len', stop_words=None, stem_order=1, stem_skip=0, title=None, trim_blank=False, underline_color=None)¶

The principal visualization of stemgraphic.alpha is stem_graphic. It offers all the options of stem_text (3.1) and adds automatic title, mirroring, flipping of axes, export (to pdf, svg, png, through fig.savefig) and many more options to change the visual appearance of the plot (font size, color, background color, underlining and more).

By providing a secondary text source, the plot will enable comparison through a back-to-back display

Parameters:

df – string, filename, url, list, numpy array, time series, pandas or dask dataframe
df2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional). for back 2 back stem-and-leaf plots
aggregation – Boolean for sum, else specify function
alpha – opacity of the bars, median and outliers, defaults to 10%
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – stem sorted in ascending order, defaults to True
ax – matplotlib axes instance, usually from a figure or other plot
ax2 – matplotlib axes instance, usually from a figure or other plot for back to back
bar_color – the fill color of the bar representing the leaves
bar_outline – the outline color of the bar representing the leaves
break_on – force a break of the leaves at that letter, the rest of the leaves will appear on the next line
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
column – specify which column (string or number) of the dataframe to use, or group of columns (stems) else the frame is assumed to only have one column with words.
combined – list (specific subset to automatically include, say, for comparisons), or None
compact – do not display empty stem rows (with no leaves), defaults to False
delimiter_color – color of the line between aggregate and stem and stem and leaf
display – maximum number of data points to display, forces sampling if smaller than len(df)
figure_only – bool if True (default) returns matplotlib (fig,ax), False returns (fig,ax,df)
flip_axes – X becomes Y and Y becomes X
font_kw – keyword dictionary, font parameters
leaf_color – font color of the leaves
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.
median_color – color of the box representing the median
mirror – mirror the plot in the axis of the delimiters
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
primary_kw – stem-and-leaf plot additional arguments
random_state – initial random seed for the sampling process, for reproducible research
remove_accents – bool if True strips accents (NA on dataframe)
reverse – bool if True look at words from right to left
secondary – bool if True, this is a secondary plot - mostly used for back-to-back plots
show_stem – bool if True (default) displays the stems
sort_by – default to ‘len’, can also be ‘alpha’
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
title – string, or None. When None and source is a file, filename will be used.
trim_blank – remove the blank between the delimiter and the first leaf, defaults to True
underline_color – color of the horizontal line under the leaves, None for no display

Returns:

matplotlib figure and axes instance, and dataframe if figure_only is False

stemgraphic.alpha.stem_scatter(src1, src2, src3=None, alpha=0.5, alpha_only=True, ascending=True, asFigure=False, ax=None, caps=False, compact=True, display=None, fig_xy=None, interactive=True, jitter=False, label=False, leaf_order=1, leaf_skip=0, log_scale=True, normalize=None, percentage=None, project=False, project_only=False, random_state=None, sort_by='alpha', stem_order=1, stem_skip=0, stop_words=None, whole=False)¶

stem_scatter compares the word frequency of two sources, on each axis. Each data point Z value is the word or stem-and-leaf value, while the X axis reflects that word/ngram count in one source and the Y axis reflect the same word/ngram count in the other source, in two different colors. If one word/ngram is more common on the first source it will be displayed in one color, and if it is more common in the second source, it will be displayed in a different color. The values that are the same for both sources will be displayed in a third color (default colors are blue, black and pink. In interactive mode, hovering the data point will give the precise counts on each axis along with the word itself, and filtering by category is done by clicking on the category in the legend.

Parameters:	src1 – string, filename, url, list, numpy array, time series, pandas or dask dataframe src2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe src3 – string, filename, url, list, numpy array, time series, pandas or dask dataframe, optional

:param alpha:: opacity of the dots, defaults to 50% :param alpha_only: only use stems from a-z alphabet (NA on dataframe) :param ascending: stem sorted in ascending order, defaults to True :param asFigure: return plot as plotly figure (for web applications) :param ax: matplotlib axes instance, usually from a figure or other plot :param caps: bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe) :param compact: do not display empty stem rows (with no leaves), defaults to False :param display: maximum number of data points to display, forces sampling if smaller than len(df) :param fig_xy: tuple for matplotlib figsize, defaults to (20,20) :param interactive: if cufflinks is loaded, renders as interactive plot in notebook :param jitter: random noise added to help see multiple data points sharing the same coordinate :param label: bool if True display words centered at coordinate :param leaf_order: how many leaf digits per data point to display, defaults to 1 :param leaf_skip: how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’ :param log_scale: bool if True (default) uses log scale axes (NA in 3d due to open issues with mpl, cufflinks) :param normalize: bool if True normalize frequencies in src2 and src3 relative to src1 length :param percentage: coordinates in percentage of maximum word/ngram count :param random_state: initial random seed for the sampling process, for reproducible research :param sort_by: sort by ‘alpha’ (default) or ‘count’ :param stem_order: how many stem characters per data point to display, defaults to 1 :param stem_skip: how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter :param stop_words: stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French) :param whole: for normalized or percentage, use whole integer values (round) :return: matplotlib polar ax, dataframe

stemgraphic.alpha.stem_sunburst(words, alpha_only=True, ascending=False, caps=False, compact=True, display=None, hole=True, label=True, leaf_order=1, leaf_skip=0, median=True, ngram=False, random_state=None, sort_by='alpha', statistics=True, stem_order=1, stem_skip=0, stop_words=None, top=0)¶

Stem-and-leaf based sunburst. See sunburst for details

Parameters:

words – string, filename, url, list, numpy array, time series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – stem sorted in ascending order, defaults to True
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
hole – bool if True (default) leave space in middle for statistics
label – bool if True display words centered at coordinate
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
median – bool if True (default) display an origin and a median mark
ngram – bool if True display full n-gram as leaf label
random_state – initial random seed for the sampling process, for reproducible research
sort_by – sort by ‘alpha’ (default) or ‘count’
statistics – bool if True (default) displays statistics in center - hole has to be True
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
top – how many different words to count by order frequency. If negative, this will be the least frequent

Returns:

stemgraphic.alpha.stem_text(df, aggr=False, alpha_only=True, ascending=True, binary=False, break_on=None, caps=True, column=None, compact=False, display=750, legend_pos='top', leaf_order=1, leaf_skip=0, persistence=None, remove_accents=False, reverse=False, rows_only=False, sort_by='len', stem_order=1, stem_skip=0, stop_words=None, random_state=None)¶

Tukey’s original stem-and-leaf plot was text, with a vertical delimiter to separate stem from leaves. Just as stemgraphic implements a text version of the plot for numbers, stemgraphic.alpha implements a text version for words. This type of plot serves a similar purpose as a stacked bar chart with each data point annotated.

It also displays some basic statistics on the whole text (or subset if using column).

Parameters:

df – list, numpy array, time series, pandas or dask dataframe
aggr – bool if True display the aggregated count of leaves by row
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – bool if the sort is ascending
binary – bool if True forces counts to 1 for anything greater than 0
break_on – force a break of the leaves at that letter, the rest of the leaves will appear on the next line
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
column – specify which column (string or number) of the dataframe to use, or group of columns (stems) else the frame is assumed to only have one column with words.
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
leaf_order – how many leaf characters per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
legend_pos – where to put the legend: ‘top’ (default), ‘bottom’ or None
persistence – will save the sampled datafrae to filename (with csv or pkl extension) or None
random_state – initial random seed for the sampling process, for reproducible research
remove_accents – bool if True strips accents (NA on dataframe)
reverse – bool if True look at words from right to left
rows_only – by default returns only the stem and leaf rows. If false, also return the matrix and dataframe
sort_by – default to ‘len’, can also be ‘alpha’
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

stemgraphic.alpha.sunburst(words, alpha_only=True, ascending=False, caps=False, compact=True, display=None, hole=True, label=True, leaf_order=1, leaf_skip=0, median=True, ngram=True, random_state=None, sort_by='alpha', statistics=True, stem_order=1, stem_skip=0, stop_words=None, top=40)¶

Word sunburst charts are similar to pie or donut charts, but add some statistics in the middle of the chart, including the percentage of total words targeted for a given

number of unique words (ie. top 50 words, 48% coverage).

With stem-and-leaf, the first level of the sunburst represents the stem and the second level subdivides each stem by leaves.

Parameters:

words – string, filename, url, list, numpy array, time series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – stem sorted in ascending order, defaults to True
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
hole – bool if True (default) leave space in middle for statistics
label – bool if True display words centered at coordinate
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
median – bool if True (default) display an origin and a median mark
ngram – bool if True (default) display full n-gram as leaf label
random_state – initial random seed for the sampling process, for reproducible research
statistics – bool if True (default) displays statistics in center - hole has to be True
sort_by – sort by ‘alpha’ (default) or ‘count’
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
top – how many different words to count by order frequency. If negative, this will be the least frequent

Returns:

matplotlib polar ax, dataframe

stemgraphic.alpha.word_freq_plot(src, alpha_only=False, ascending=False, asFigure=False, caps=False, display=None, interactive=True, kind='barh', random_state=None, sort_by='count', stop_words=None, top=100)¶

word frequency bar chart.

This function creates a classical word frequency bar chart.

Parameters:

src – Either a filename including path, a url or a ready to process text in a dataframe or a tokenized format.
alpha_only – words only if True, words and numbers if False
ascending – stem sorted in ascending order, defaults to True
asFigure – if interactive, the function will return a plotly figure instead of a matplotlib ax
caps – keep capitalization (True, False)
display – if specified, sample that quantity of words
interactive – interactive graphic (True, False)
kind – horizontal bar chart (barh) - also ‘bar’, ‘area’, ‘hist’ and non interactive ‘kde’ and ‘pie’
random_state – initial random seed for the sampling process, for reproducible research
sort_by – default to ‘count’, can also be ‘alpha’
stop_words – a list of words to ignore
top – how many different words to count by order frequency. If negative, this will be the least frequent

Returns:

text as dataframe and plotly figure or matplotlib ax

stemgraphic.alpha.word_radar(word, comparisons, ascending=True, display=100, label=True, metric=None, min_distance=1, max_distance=None, random_state=None, sort_by='alpha')¶

Radar plot based on words. Currently, the only type of radar plot supported. See `radar’ for more detail.

Parameters:

word – string, the reference word that will be placed in the middle
comparisons – external file, list or string or dataframe of words
ascending – bool if the sort is ascending
display – maximum number of data points to display, forces sampling if smaller than len(df)
label – bool if True display words centered at coordinate
metric – any metric function accepting two values and returning that metric in a range from 0 to x
min_distance – minimum distance based on metric to include a word for display
max_distance – maximum distance based on metric to include a word for display
random_state – initial random seed for the sampling process, for reproducible research
sort_by – default to ‘alpha’, can also be ‘len’

Returns:

stemgraphic.alpha.word_scatter(src1, src2, src3=None, alpha=0.5, alpha_only=True, ascending=True, asFigure=False, ax=None, caps=False, compact=True, display=None, fig_xy=None, interactive=True, jitter=False, label=False, leaf_order=None, leaf_skip=0, log_scale=True, normalize=None, percentage=None, random_state=None, sort_by='alpha', stem_order=None, stem_skip=0, stop_words=None, whole=False)¶

Scatter compares the word frequency of two sources, on each axis. Each data point Z value is the word or stem-and-leaf value, while the X axis reflects that word count in one source and the Y axis re- flect the same word count in the other source, in two different colors. If one word is more common on the first source it will be displayed in one color, and if it is more common in the second source, it will be displayed in a different color. The values that are the same for both sources will be displayed in a third color (default colors are blue, black and pink. In interactive mode, hovering the data point will give the precise counts on each axis along with the word itself, and filtering by category is done by clicking on the category in the legend.

Parameters:

src1 – string, filename, url, list, numpy array, time series, pandas or dask dataframe
src2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe
src3 – string, filename, url, list, numpy array, time series, pandas or dask dataframe, optional
alpha – opacity of the bars, median and outliers, defaults to 10%
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – stem sorted in ascending order, defaults to True
asFigure – return plot as plotly figure (for web applications)
ax – matplotlib axes instance, usually from a figure or other plot
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
fig_xy – tuple for matplotlib figsize, defaults to (20,20)
interactive – if cufflinks is loaded, renders as interactive plot in notebook
jitter – random noise added to help see multiple data points sharing the same coordinate
label – bool if True display words centered at coordinate
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
log_scale – bool if True (default) uses log scale axes
random_state – initial random seed for the sampling process, for reproducible research
sort_by – sort by ‘alpha’ or ‘count’ (default)
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
whole – for normalized or percentage, use whole integer values (round)

Returns:

matplotlib polar ax, dataframe

stemgraphic.alpha.word_sunburst(words, alpha_only=True, ascending=False, caps=False, compact=True, display=None, hole=True, label=True, leaf_order=None, leaf_skip=0, median=True, ngram=True, random_state=None, sort_by='alpha', statistics=True, stem_order=None, stem_skip=0, stop_words=None, top=40)¶

Word based sunburst. See sunburst for details

Parameters:

words – string, filename, url, list, numpy array, time series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – stem sorted in ascending order, defaults to True
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
hole – bool if True (default) leave space in middle for statistics
label – bool if True display words centered at coordinate
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
median – bool if True (default) display an origin and a median mark
ngram – bool if True (default) display full n-gram as leaf label
random_state – initial random seed for the sampling process, for reproducible research
statistics – bool if True (default) displays statistics in center - hole has to be True
sort_by – sort by ‘alpha’ (default) or ‘count’
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
top – how many different words to count by order frequency. If negative, this will be the least frequent

Returns:

`graphic`¶

Stemgraphic.graphic

Stemgraphic provides a complete set of functions to handle everything related to stem-and-leaf plots. Stemgraphic.graphic is a module implementing a graphical stem-and-leaf plot function and a stem-and-leaf heatmap plot function for numerical data.

stemgraphic.graphic.density_plot(df, var=None, ax=None, bins=None, box=None, density=True, density_fill=True, display=1000, fig_only=True, fit=None, hist=None, hues=None, hue_labels=None, jitter=None, kind=None, leaf_order=1, legend=True, limit_var=False, norm_hist=None, random_state=None, rug=None, scale=None, singular=True, strip=None, swarm=None, title=None, violin=None, x_min=0, x_max=None, y_axis_label=True)¶

density_plot.

Various density and distribution plots conveniently packaged into one function. Density plot normally forces tails at each end which might go beyond the data. To force min/max to be driven by the data, use limit_var. To specify min and max use x_min and x_max instead. Nota Bene: defaults to _decimation_ and _quantization_ mode.

See density_plot notebook for examples of the different combinations of plots.

Why this instead of seaborn:

Stem-and-leaf plots naturally quantize data. The amount of loss is based on scale and leaf_order and on the data itself. This function which wraps several seaborn distribution plots was added in order to compare various measures of density and distributions based on various levels of decimation (sampling, set through display) and of quantization (set through scale and leaf_order). Also, there is no option in seaborn to fill the area under the curve…

Parameters:

df – list, numpy array, time series, pandas or dask dataframe
var – variable to plot, required if df is a dataframe
ax – matplotlib axes instance, usually from a figure or other plot
bins – Specification of hist bins, or None to use Freedman-Diaconis rule
box – bool, if True plots a box plot. Similar to using violin, use one or the other
density – bool, if True (default) plots a density plot
density_fill – bool, if True (default) fill the area under the density curve
display – maximum number rows to use (1000 default) for calculations, forces sampling if < len(df)
fig_only – bool, if True (default) returns fig, ax, else returns fix, ax, max_peak, true_min, true_max
fit – object with fit method, returning a tuple that can be passed to a pdf method
hist – bool, if True plot a histogram
hues – optional, a categorical variable for multiple plots
hue_labels – optional, if using a column that is an object and/or categorical needing translation
jitter – for strip plots only, add jitter. strip + jitter is similar to using swarm, use one or the other
leaf_order – the order of magnitude of the leaf. The higher the order, the less quantization.
legend – bool, if True plots a legend
limit_var – use min / max from the data, not density plot
norm_hist – bool, if True histogram will be normed
random_state – initial random seed for the sampling process, for reproducible research
rug – bool, if True plot a rug plot
scale – force a specific scale for building the plot. Defaults to None (automatic).
singular – force display of a density plot using a singular value, by simulating values of each side
strip – bool, if True displays a strip plot
swarm – swarm plot, similar to strip plot. use one or the other
title – if present, adds a title to the plot
violin – bool, if True plots a violin plot. Similar to using box, use one or the other
x_min – force X axis minimum value. See also limit_var
x_max – force Y axis minimum value. See also limit_var
y_axis_label – bool, if True displays y axis ticks and label

Returns:

see fig_only

stemgraphic.graphic.heatmap(df, annotate=False, asFigure=False, ax=None, column=None, compact=False, display=900, interactive=True, leaf_order=1, persistence=None, random_state=None, scale=None, trim=False, trim_blank=True, unit='', zoom=None)¶

The heatmap displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves, they are left in their respective columns. Row ‘42’ and Column ‘7’ would have the count of numbers starting with ‘427’ of the given scale.

The heatmap is useful to look at patterns. For distribution, stem_graphic is better suited.

Parameters:

df – list, numpy array, time series, pandas or dask dataframe
annotate – display annotations (Z) on heatmap
asFigure – return plot as plotly figure (for web applications)
ax – matplotlib axes instance, usually from a figure or other plot
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
interactive – if cufflinks is loaded, renders as interactive plot in notebook
leaf_order – how many leaf digits per data point to display, defaults to 1
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
trim_blank – remove the blank between the delimiter and the first leaf, defaults to True
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zoom – zoom level, on top of calculated scale (+1, -1 etc)

Returns:

count matrix, scale and matplotlib ax or figure if interactive and asFigure are True

stemgraphic.graphic.stem_graphic(df, df2=None, aggregation=True, alpha=0.1, asc=True, ax=None, ax2=None, bar_color='C0', bar_outline=None, break_on=None, column=None, combined=None, compact=False, delimiter_color='C3', display=900, figure_only=True, flip_axes=False, font_kw=None, leaf_color='k', leaf_order=1, legend_pos='best', median_alpha=0.25, median_color='C4', mirror=False, outliers=None, outliers_color='C3', persistence=None, primary_kw=None, random_state=None, scale=None, secondary=False, secondary_kw=None, secondary_plot=None, show_stem=True, title=None, trim=False, trim_blank=True, underline_color=None, unit='', zoom=None)¶

A graphical stem and leaf plot. stem_graphic provides horizontal, vertical or mirrored layouts, sorted in ascending or descending order, with sane default settings for the visuals, legend, median and outliers.

Parameters:

df – list, numpy array, time series, pandas or dask dataframe
df2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional). for back 2 back stem-and-leaf plots
aggregation – Boolean for sum, else specify function
alpha – opacity of the bars, median and outliers, defaults to 10%
asc – stem sorted in ascending order, defaults to True
ax – matplotlib axes instance, usually from a figure or other plot
ax2 – matplotlib axes instance, usually from a figure or other plot for back to back
bar_color – the fill color of the bar representing the leaves
bar_outline – the outline color of the bar representing the leaves
break_on – force a break of the leaves at x in (5, 10), defaults to 10
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
combined – list (specific subset to automatically include, say, for comparisons), or None
compact – do not display empty stem rows (with no leaves), defaults to False
delimiter_color – color of the line between aggregate and stem and stem and leaf
display – maximum number of data points to display, forces sampling if smaller than len(df)
figure_only – bool if True (default) returns matplotlib (fig,ax), False returns (fig,ax,df)
flip_axes – X becomes Y and Y becomes X
font_kw – keyword dictionary, font parameters
leaf_color – font color of the leaves
leaf_order – how many leaf digits per data point to display, defaults to 1
legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.
median_alpha – opacity of median and outliers, defaults to 25%
median_color – color of the box representing the median
mirror – mirror the plot in the axis of the delimiters
outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto
outliers_color – background color for the outlier boxes
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
primary_kw – stem-and-leaf plot additional arguments
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
secondary – bool if True, this is a secondary plot - mostly used for back-to-back plots
secondary_kw – any matplotlib keyword supported by .plot(), for the secondary plot
secondary_plot – One or more of ‘dot’, ‘kde’, ‘margin_kde’, ‘rug’ in a comma delimited string or None
show_stem – bool if True (default) displays the stems
title – string to display as title
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
trim_blank – remove the blank between the delimiter and the first leaf, defaults to True
underline_color – color of the horizontal line under the leaves, None for no display
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zoom – zoom level, on top of calculated scale (+1, -1 etc)

Returns:

matplotlib figure and axes instance

`helpers`¶

helpers.py

Helper functions for stemgraphic.

stemgraphic.helpers.APOSTROPHE = '’'¶: Typographical apostrophe - ex: I’m, l’arbre

stemgraphic.helpers.CHAR_FILTER = ['\t', '\n', '\\', '/', '`', '*', '_', '{', '}', '[', ']', '(', ')', '<', '>', '#', '=', '+', '- ', '–', '.', ';', ':', '!', '?', '|', '$', "'", '"', '…']¶: Characters to filter. Does a relatively good job on a majority of texts ‘- ‘ and ‘–’ is to skip quotes in many plays and dialogues in books, especially French.

stemgraphic.helpers.DOUBLE_QUOTE = '"'¶: Double straight quote mark

stemgraphic.helpers.EMPTY = b' '¶: empty

stemgraphic.helpers.LETTERS = 'abcdefghijklmnopqrstuvwxyz'¶: Default definition of standard letters remove_accent has to be called explicitely for any of these letters to match their accented counterparts

stemgraphic.helpers.NON_ALPHA = ['-', '+', '/', '[', ']', '_', '£', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', ';', "'", '"', '’', b' ', b'\xd6\xb1', '?', '¡', '¿', '«', '»', '“', '”', '-', '—']¶: List of non alpha characters. Temporary - I want to balance flexibility with convenience, but still looking at options.

stemgraphic.helpers.NO_PERIOD_FILTER = ['\t', '\n', '\\', '/', '`', '*', '_', '{', '}', '[', ']', '(', ')', '<', '>', '#', '=', '+', '- ', '–', ';', ':', '!', '?', '|', '$', "'", '"']¶: Similar purpose to CHAR_FILTER, ut keeps the period. The last word of each sentence will end with a ‘.’ Useful for manipulating the dataframe returned by the various visualizations and ngram_data, to break down frequencies by sentence instead of the full text or list.

stemgraphic.helpers.OVER = b'\xd6\xb1'¶: for typesetting overlap

stemgraphic.helpers.QUOTE = "'"¶: Straight quote mark - ex: ‘INCONCEIVABLE’

stemgraphic.helpers.key_calc(stem, leaf, scale)¶

Calculates a value from a stem, a leaf and a scale.

Parameters:	stem – leaf – scale –
Returns:	calculated values

stemgraphic.helpers.legend(ax, x, y, asc, flip_axes, mirror, stem, leaf, scale, delimiter_color, aggregation=True, cur_font=None, display=10, pos='best', unit='')¶

Builds a graphical legend for numerical stem-and-leaf plots.

Parameters:	display – cur_font – ax – x – y – pos – asc – flip_axes – mirror – stem – leaf – scale – delimiter_color – unit – aggregation –

stemgraphic.helpers.min_max_count(x, column=0)¶

Handles min, max and count. This works on numpy, lists, pandas and dask dataframes.

Parameters:	column – x – list, numpy array, series, pandas or dask dataframe
Returns:	min, max and count

stemgraphic.helpers.percentile(data, alpha)¶

Parameters:	data – list, numpy array, time series or pandas dataframe alpha – between 0 and 0.5 proportion to select on each side of the distribution
Returns:	the actual value at that percentile

stemgraphic.helpers.stack_columns(row)¶: stack multiple columns into a single stacked value :param row: a row of letters :return: stacked string

`num`¶

stemgraphic.num.

BRAND NEW in V.0.5.0!

Stemgraphic provides a complete set of functions to handle everything related to stem-and-leaf plots. num is a module of the stemgraphic package to handle numerical variables.

This module structure is new as of v.0.5.0 to match the addition of stemgraphic.alpha.

The shorthand from previous versions of stemgraphic is still available and defaults to the numerical functions:

from stemgraphic import stem_graphic, stem_text, heatmap

`stopwords`¶

stopwords

This module includes 4 lists of stop words: EN (main English list), ALT_EN (alternate English list), FR (French) and SP (Spanish).

A PT (Portuguese) list is in the works.

stemgraphic.stopwords.ALT_EN = ['a', 'am', 'an', 'and', 'are', 'as', 'at', 'been', 'for', 'from', 'in', 'is', 'of', 'on', 'or', 'out', 'so', 'such', 'that', 'the', 'these', 'this', 'those', 'to', 'upon', 'was', 'were']¶: ALT_ENglish stopwords

stemgraphic.stopwords.EN = ['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'inside', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves']¶: ENglish stop words

stemgraphic.stopwords.ES = ['a', 'alguna', 'algunas', 'alguno', 'algunos', 'algún', 'ambas', 'ambos', 'ampleamos', 'ante', 'antes', 'aquel', 'aquellas', 'aquellos', 'aqui', 'arriba', 'atras', 'aun', 'bajo', 'bastante', 'bien', 'cada', 'cierta', 'ciertas', 'cierto', 'ciertos', 'como', 'con', 'conseguimos', 'conseguir', 'consigo', 'consigue', 'consiguen', 'consigues', 'cual', 'cuando', 'dentro', 'desde', 'donde', 'dos', 'el', 'ella', 'ellas', 'ellos', 'empleais', 'emplean', 'emplear', 'empleas', 'empleo', 'en', 'encima', 'entonces', 'entre', 'era', 'eramos', 'eran', 'eras', 'eres', 'es', 'esta', 'estaba', 'estado', 'estais', 'estamos', 'estan', 'estoy', 'fin', 'fue', 'fueron', 'fui', 'fuimos', 'ha', 'hace', 'haceis', 'hacemos', 'hacen', 'hacer', 'haces', 'hago', 'incluso', 'intenta', 'intentais', 'intentamos', 'intentan', 'intentar', 'intentas', 'intento', 'ir', 'la', 'largo', 'las', 'lo', 'los', 'mientras', 'mio', 'modo', 'muchos', 'muy', 'nos', 'nosotros', 'otro', 'para', 'pero', 'podeis', 'podemos', 'poder', 'podria', 'podriais', 'podriamos', 'podrian', 'podrias', 'por', 'porque', 'primero', 'puede', 'pueden', 'puedo', 'quien', 'sabe', 'sabeis', 'sabemos', 'saben', 'saber', 'sabes', 'ser', 'sea', 'sean', 'si', 'siendo', 'sin', 'sobre', 'sois', 'solamente', 'solo', 'somos', 'soy', 'su', 'sus', 'también', 'teneis', 'tenemos', 'tener', 'tengo', 'tiempo', 'tiene', 'tienen', 'todo', 'trabaja', 'trabajais', 'trabajamos', 'trabajan', 'trabajar', 'trabajas', 'trabajo', 'tras', 'tu', 'tuyo', 'ultimo', 'un', 'una', 'unas', 'uno', 'unos', 'usa', 'usais', 'usamos', 'usan', 'usar', 'usas', 'uso', 'va', 'vais', 'valor', 'vamos', 'van', 'vaya', 'verdad', 'verdadera', 'verdadero', 'vosotras', 'vosotros', 'voy', 'yo']¶: Spanish (ESpanol) stop words

stemgraphic.stopwords.FR = ['a', 'alors', 'au', 'aucuns', 'aussi', 'autre', 'autres', 'aux', 'avant', 'avec', 'avoir', 'bon', 'car', 'ce', 'cela', 'ces', 'ceux', 'chacun', 'chacune', 'chaque', 'ci', 'comme', 'comment', 'dans', 'de', 'dedans', 'dehors', 'depuis', 'derrière', 'des', 'dessus', 'devant', 'devrait', 'doit', 'donc', 'dos', 'du', 'début', 'elle', 'elles', 'en', 'encore', 'essai', 'est', 'et', 'eu', 'fait', 'faites', 'fois', 'font', 'hors', 'ici', 'il', 'ils', 'je', 'juste', 'la', 'le', 'les', 'leur', 'là', 'ma', 'maintenant', 'mais', 'mes', 'mine', 'moins', 'mon', 'mot', 'même', 'ni', 'nommés', 'notre', 'nous', 'ou', 'où', 'par', 'parce', 'pas', 'peu', 'peut', 'plupart', 'pour', 'pourquoi', 'quand', 'que', 'quel', 'quelle', 'quelles', 'quelque', 'quelques', 'quels', 'qui', 'sa', 'sans', 'ses', 'seulement', 'si', 'sien', 'son', 'sont', 'sous', 'soyez', 'sujet', 'sur', 'ta', 'tandis', 'tellement', 'tels', 'tes', 'ton', 'tous', 'tout', 'trop', 'très', 'tu', 'voient', 'vont', 'votre', 'vous', 'vu', 'ça', 'étaient', 'état', 'étions', 'été', 'être']¶: French (FRancais) stop words

stemgraphic.stopwords.VOCALES = ['a', 'á', 'e', 'é', 'i', 'í', 'o', 'ó', 'u', 'ú', 'ü']¶: Spanish vowels

stemgraphic.stopwords.VOWELS = ['a', 'e', 'i', 'o', 'u']¶: English vowels

stemgraphic.stopwords.VOYELLES = ['a', 'â', 'ä', 'à', 'æ', 'e', 'ê', 'ë', 'é', 'è', 'i', 'î', 'ï', 'o', 'ô', 'ö', 'œ', 'u', 'û', 'ü', 'ù', 'y']¶: French vowels

`text`¶

stemgraphic.text.quantize(df, column=None, display=750, leaf_order=1, random_state=None, scale=None, trim=None, zoom=None)¶

Converts a series into stem-and-leaf and back into decimal. This has the potential effect of decimating (or truncating) values in a lossy way.

Parameters:

df – list, numpy array, time series, pandas or dask dataframe
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
display – maximum number of data points to display, forces sampling if smaller than len(df)
leaf_order – how many leaf digits per data point to display, defaults to 1
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
zoom – zoom level, on top of calculated scale (+1, -1 etc)

Returns:

decimated df

stemgraphic.text.stem_data(x, break_on=None, column=None, compact=False, display=300, full=False, leaf_order=1, omin=None, omax=None, outliers=False, persistence=None, random_state=None, scale=None, total_rows=None, trim=False, zoom=None)¶

Returns scale factor, key label and list of rows.

Parameters:

x – list, numpy array, time series, pandas or dask dataframe
break_on – force a break of the leaves at x in (5, 10), defaults to 10
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
full – bool, if True returns all interim results including sorted data and stems
leaf_order – how many leaf digits per data point to display, defaults to 1
outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto
omin – float, if already calculated, helps speed up the process for large data sets
omax – float, if already calculated, helps speed up the process for large data sets
persistence – persist sampled dataframe
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic)
total_rows – int, if already calculated, helps speed up the process for large data sets
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
zoom – zoom level, on top of calculated scale (+1, -1 etc)

stemgraphic.text.stem_dot(df, asc=True, break_on=None, column=None, compact=False, display=300, leaf_order=1, legend_pos='best', marker=None, outliers=True, random_state=None, scale=None, trim=False, unit='', zoom=None)¶

Parameters:

df – list, numpy array, time series, pandas or dask dataframe
asc – stem sorted in ascending order, defaults to True
break_on – force a break of the leaves at x in (5, 10), defaults to 10
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.
marker – char, symbol to use as marker. ‘O’ is default. Suggested alternatives: ‘*’, ‘+’, ‘x’, ‘.’, ‘o’
outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zoom – zoom level, on top of calculated scale (+1, -1 etc)

stemgraphic.text.stem_text(df, asc=True, break_on=None, column=None, compact=False, display=300, legend_pos='best', outliers=True, persistence=None, random_state=None, scale=None, trim=False, unit='', zoom=None)¶

Parameters:

df – list, numpy array, time series, pandas or dask dataframe
asc – stem sorted in ascending order, defaults to True
break_on – force a break of the leaves at x in (5, 10), defaults to 10
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.
outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zoom – zoom level, on top of calculated scale (+1, -1 etc)

Stemgraphic Modules¶

stemgraphic¶

`aliases`¶

`alpha`¶

`graphic`¶

`helpers`¶

`num`¶

`stopwords`¶

`text`¶

stemgraphic

Navigation

Related Topics

Stemgraphic Modules¶

stemgraphic¶

aliases¶

alpha¶

graphic¶

helpers¶

num¶

stopwords¶

text¶

`aliases`¶

`alpha`¶

`graphic`¶

`helpers`¶

`num`¶

`stopwords`¶

`text`¶