Skip to content

Wrapping a Pandas DataFrame into a CubedPandas Cube

Using the 'cubed()' Method

The cubed function is the most convenient way to wrap and convert a Pandas dataframe into a CubedPandas cube. by the way, cdf is nice and short for a 'cubed data frame' following the Pandas convention of df for a 'data frame'.

If no schema is provided when applying the cubed method, a schema will be automatically inferred from the DataFrame. By default, all numeric columns will be considered as measures, all other columns as dimensions of the cube.

import pandas as pd
from cubedpandas import cubed

df = pd.DataFrame({"channel": ["Online", "Online", "Online", "Retail", "Retail", "Retail"],
                   "product": ["Apple",  "Pear",   "Banana", "Apple",  "Pear",   "Banana"],
                   "sales":   [100,      150,      300,      200,      250,      350     ],})
cdf = cubed(df)
print(cdf.Online)  # returns 550 = 100 + 150 + 300

Sometimes, e.g. if you want an integer column to be considered as a dimension not as a measure column, you need to provide a schema. Here's a simple example of how to define and use a schema, here identical to schema that will be automatically inferred. For more information please refer to the Schema documentation.

import pandas as pd
from cubedpandas import cubed

df = pd.DataFrame({"channel": ["Online", "Online", "Online", "Retail", "Retail", "Retail"],
                   "product": ["Apple",  "Pear",   "Banana", "Apple",  "Pear",   "Banana"],
                   "sales":   [100,      150,      300,      200,      250,      350     ],})
schema = {"dimensions": [{"column":"channel"}, {"column": "product"}],
          "measures":   [{"column":"sales"}]}
cdf = cubed(df, schema=schema)
print(cdf.Online)  # returns 550 = 100 + 150 + 300

Using the 'cubed' extension for Python

After CubedPandas has been loaded, e.g. by import cubedpandas, you can also directly use the cubed extension for Pandas. The only difference to the cubed() function is, that you need to use the cubed attribute of the Pandas DataFrame and either slice it with the [] operator or get access to the cube or any context using the . operator.

import pandas as pd
import cubedpandas

df = pd.DataFrame({"channel": ["Online", "Online", "Online", "Retail", "Retail", "Retail"],
                   "product": ["Apple",  "Pear",   "Banana", "Apple",  "Pear",   "Banana"],
                   "sales":   [100,      150,      300,      200,      250,      350     ],})

cdf = df.cubed.cube  # return a reference to the cube, just 'df.cubed' will not work.
# or directly access any context the cube either by slicing with the [] operator
x = df.cubed["Online", "Apple", "sales"]
# or by using the . operator
y = df.cubed.Online.Apple.sales

assert(x == y == 100)

cubed(df, schema=None, infer_schema=True, exclude=None, caching=CachingStrategy.LAZY, caching_threshold=EAGER_CACHING_THRESHOLD, read_only=True)

Wraps a Pandas dataframes into a cube to provide convenient multi-dimensional access to the underlying dataframe for easy aggregation, filtering, slicing, reporting and data manipulation and write back.

Parameters:

  • df (DataFrame) –

    The Pandas dataframe to be wrapped into the CubedPandas Cube object.

  • schema

    (optional) A schema that defines the dimensions and measures of the Cube. If not provided, the schema will be inferred from the dataframe if parameter infer_schema is set to True. For further details please refer to the documentation of the Schema class. Default value is None.

  • infer_schema (bool, default: True ) –

    (optional) If no schema is provided and infer_schema is set to True, a suitable schema will be inferred from the unerlying dataframe. All numerical columns will be treated as measures, all other columns as dimensions. If this behaviour is not desired, a schema must be provided. Default value is True.

  • exclude (str | list | tuple | None, default: None ) –

    (optional) Defines the columns that should be excluded from the cube if no schema is provied. If a column is excluded, it will not be part of the schema and can not be accessed through the cube. Excluded columns will be ignored during schema inference. Default value is None.

  • caching (CachingStrategy, default: LAZY ) –

    (optional) A caching strategy to be applied for accessing the cube. recommended value for almost all use cases is CachingStrategy.LAZY, which caches dimension members on first access. Caching can be beneficial for performance, but may also consume more memory. To cache all dimension members eagerly (on initialization of the cube), set this parameter to CachingStrategy.EAGER. Please refer to the documentation of 'CachingStrategy' for more information. Default value is CachingStrategy.LAZY.

  • caching_threshold (int, default: EAGER_CACHING_THRESHOLD ) –

    (optional) The threshold as 'number of members' for EAGER caching only. If the number of distinct members in a dimension is below this threshold, the dimension will be cached eargerly, if caching is set to CacheStrategy.EAGER or CacheStrategy.FULL. Above this threshold, the dimension will be cached lazily. Default value is EAGER_CACHING_THRESHOLD, equivalent to 256 unique members per dimension.

  • read_only (bool, default: True ) –

    (optional) Defines if write backs to the underlying dataframe are permitted. If read_only is set to True, write back attempts will raise an PermissionError. If read_only is set to False, write backs are permitted and will be pushed back to the underlying dataframe. Default value is True.

Returns:

  • A new Cube object that wraps the dataframe.

Raises:

  • PermissionError

    If writeback is attempted on a read-only Cube.

  • ValueError

    If the schema is not valid or does not match the dataframe or if invalid dimension, member, measure or address agruments are provided.

Examples:

>>> df = pd.value([{"product": ["A", "B", "C"]}, {"value": [1, 2, 3]}])
>>> cdf = cubed(df)
>>> cdf["product:B"]
2

CubedPandasAccessor

A Pandas extension that provides the CubedPandas 'cubed' accessor for Pandas dataframes.

cube property

Wraps a Pandas dataframes into a cube to provide convenient multi-dimensional access to the underlying dataframe for easy aggregation, filtering, slicing, reporting and data manipulation and write back.

Parameters:

  • df

    The Pandas dataframe to be wrapped into the CubedPandas Cube object.

  • schema

    (optional) A schema that defines the dimensions and measures of the Cube. If not provided, the schema will be inferred from the dataframe if parameter infer_schema is set to True. For further details please refer to the documentation of the Schema class. Default value is None.

  • infer_schema

    (optional) If no schema is provided and infer_schema is set to True, a suitable schema will be inferred from the unerlying dataframe. All numerical columns will be treated as measures, all other columns as dimensions. If this behaviour is not desired, a schema must be provided. Default value is True.

  • exclude

    (optional) Defines the columns that should be excluded from the cube if no schema is provied. If a column is excluded, it will not be part of the schema and can not be accessed through the cube. Excluded columns will be ignored during schema inference. Default value is None.

  • read_only

    (optional) Defines if write backs to the underlying dataframe are permitted. If read_only is set to True, write back attempts will raise an PermissionError. If read_only is set to False, write backs are permitted and will be pushed back to the underlying dataframe. Default value is True.

  • ignore_case

    (optional) If set to True, the case of member names will be ignored, 'Apple' and 'apple' will be treated as the same member. If set to False, member names are case-sensitive, 'Apple' and 'apple' will be treated as different members. Default value is True.

  • ignore_key_errors

    (optional) If set to True, key errors for members of dimensions will be ignored and cell values will return 0.0 or None if no matching record exists. If set to False, key errors will be raised as exceptions when accessing cell values for non-existing members. Default value is True.

  • caching

    (optional) A caching strategy to be applied for accessing the cube. recommended value for almost all use cases is CachingStrategy.LAZY, which caches dimension members on first access. Caching can be beneficial for performance, but may also consume more memory. To cache all dimension members eagerly (on initialization of the cube), set this parameter to CachingStrategy.EAGER. Please refer to the documentation of 'CachingStrategy' for more information. Default value is CachingStrategy.LAZY.

  • caching_threshold

    (optional) The threshold as 'number of members' for EAGER caching only. If the number of distinct members in a dimension is below this threshold, the dimension will be cached eargerly, if caching is set to CacheStrategy.EAGER or CacheStrategy.FULL. Above this threshold, the dimension will be cached lazily. Default value is EAGER_CACHING_THRESHOLD, equivalent to 256 unique members per dimension.

  • eager_evaluation

    (optional) If set to True, the cube will evaluate the context eagerly, i.e. when the context is created. Eager evaluation is recommended for most use cases, as it simplifies debugging and error handling. If set to False, the cube will evaluate the context lazily, i.e. only when the value of a context is accessed/requested.

Returns:

  • A new Cube object that wraps the dataframe.

Raises:

  • PermissionError

    If writeback is attempted on a read-only Cube.

  • ValueError

    If the schema is not valid or does not match the dataframe or if invalid dimension, member, measure or address agruments are provided.

Examples:

>>> df = pd.value([{"product": ["A", "B", "C"]}, {"value": [1, 2, 3]}])
>>> cdf = cubed(df)
>>> cdf["product:B"]