Wrapping a Pandas DataFrame into a CubedPandas Cube¶
Using the 'cubed()' Method¶
The cubed
function is the most convenient way to wrap and convert a Pandas dataframe
into a CubedPandas cube.
by the way, cdf
is nice and short for a 'cubed data frame' following the Pandas
convention of df
for a 'data frame'.
If no schema is provided when applying the cubed
method, a schema will be
automatically inferred from the DataFrame.
By default, all numeric columns will be considered as measures, all other columns as dimensions
of the cube.
import pandas as pd
from cubedpandas import cubed
df = pd.DataFrame({"channel": ["Online", "Online", "Online", "Retail", "Retail", "Retail"],
"product": ["Apple", "Pear", "Banana", "Apple", "Pear", "Banana"],
"sales": [100, 150, 300, 200, 250, 350 ],})
cdf = cubed(df)
print(cdf.Online) # returns 550 = 100 + 150 + 300
Sometimes, e.g. if you want an integer
column to be considered as a dimension not as
a measure column,
you need to provide a schema. Here's a simple example of how to define and use a schema, here
identical
to schema that will be automatically inferred. For more information please refer to the
Schema documentation.
import pandas as pd
from cubedpandas import cubed
df = pd.DataFrame({"channel": ["Online", "Online", "Online", "Retail", "Retail", "Retail"],
"product": ["Apple", "Pear", "Banana", "Apple", "Pear", "Banana"],
"sales": [100, 150, 300, 200, 250, 350 ],})
schema = {"dimensions": [{"column":"channel"}, {"column": "product"}],
"measures": [{"column":"sales"}]}
cdf = cubed(df, schema=schema)
print(cdf.Online) # returns 550 = 100 + 150 + 300
Using the 'cubed' extension for Python¶
After CubedPandas has been loaded, e.g. by import cubedpandas
, you can also directly
use the cubed
extension
for Pandas. The only difference to the cubed()
function is, that you need to use
the cubed
attribute of the
Pandas DataFrame and either slice it with the []
operator or get access to the cube
or any context
using the .
operator.
import pandas as pd
import cubedpandas
df = pd.DataFrame({"channel": ["Online", "Online", "Online", "Retail", "Retail", "Retail"],
"product": ["Apple", "Pear", "Banana", "Apple", "Pear", "Banana"],
"sales": [100, 150, 300, 200, 250, 350 ],})
cdf = df.cubed.cube # return a reference to the cube, just 'df.cubed' will not work.
# or directly access any context the cube either by slicing with the [] operator
x = df.cubed["Online", "Apple", "sales"]
# or by using the . operator
y = df.cubed.Online.Apple.sales
assert(x == y == 100)
cubed(df,
schema=None, infer_schema=True, exclude=None, caching=CachingStrategy.LAZY, caching_threshold=EAGER_CACHING_THRESHOLD,
read_only=True)
¶
Wraps a Pandas dataframes into a cube to provide convenient multi-dimensional access to the underlying dataframe for easy aggregation, filtering, slicing, reporting and data manipulation and write back.
Parameters:
-
df
(DataFrame
) –The Pandas dataframe to be wrapped into the CubedPandas
Cube
object. -
schema
–(optional) A schema that defines the dimensions and measures of the Cube. If not provided, the schema will be inferred from the dataframe if parameter
infer_schema
is set toTrue
. For further details please refer to the documentation of theSchema
class. Default value isNone
. -
infer_schema
(bool
, default:True
) –(optional) If no schema is provided and
infer_schema
is set to True, a suitable schema will be inferred from the unerlying dataframe. All numerical columns will be treated as measures, all other columns as dimensions. If this behaviour is not desired, a schema must be provided. Default value isTrue
. -
exclude
(str | list | tuple | None
, default:None
) –(optional) Defines the columns that should be excluded from the cube if no schema is provied. If a column is excluded, it will not be part of the schema and can not be accessed through the cube. Excluded columns will be ignored during schema inference. Default value is
None
. -
caching
(CachingStrategy
, default:LAZY
) –(optional) A caching strategy to be applied for accessing the cube. recommended value for almost all use cases is
CachingStrategy.LAZY
, which caches dimension members on first access. Caching can be beneficial for performance, but may also consume more memory. To cache all dimension members eagerly (on initialization of the cube), set this parameter toCachingStrategy.EAGER
. Please refer to the documentation of 'CachingStrategy' for more information. Default value isCachingStrategy.LAZY
. -
caching_threshold
(int
, default:EAGER_CACHING_THRESHOLD
) –(optional) The threshold as 'number of members' for EAGER caching only. If the number of distinct members in a dimension is below this threshold, the dimension will be cached eargerly, if caching is set to CacheStrategy.EAGER or CacheStrategy.FULL. Above this threshold, the dimension will be cached lazily. Default value is
EAGER_CACHING_THRESHOLD
, equivalent to 256 unique members per dimension. -
read_only
(bool
, default:True
) –(optional) Defines if write backs to the underlying dataframe are permitted. If read_only is set to
True
, write back attempts will raise anPermissionError
. If read_only is set toFalse
, write backs are permitted and will be pushed back to the underlying dataframe. Default value isTrue
.
Returns:
-
–
A new Cube object that wraps the dataframe.
Raises:
-
PermissionError
–If writeback is attempted on a read-only Cube.
-
ValueError
–If the schema is not valid or does not match the dataframe or if invalid dimension, member, measure or address agruments are provided.
Examples:
CubedPandasAccessor
¶
A Pandas extension that provides the CubedPandas 'cubed' accessor for Pandas dataframes.
cube
property
¶
Wraps a Pandas dataframes into a cube to provide convenient multi-dimensional access to the underlying dataframe for easy aggregation, filtering, slicing, reporting and data manipulation and write back.
Parameters:
-
df
–The Pandas dataframe to be wrapped into the CubedPandas
Cube
object. -
schema
–(optional) A schema that defines the dimensions and measures of the Cube. If not provided, the schema will be inferred from the dataframe if parameter
infer_schema
is set toTrue
. For further details please refer to the documentation of theSchema
class. Default value isNone
. -
infer_schema
–(optional) If no schema is provided and
infer_schema
is set to True, a suitable schema will be inferred from the unerlying dataframe. All numerical columns will be treated as measures, all other columns as dimensions. If this behaviour is not desired, a schema must be provided. Default value isTrue
. -
exclude
–(optional) Defines the columns that should be excluded from the cube if no schema is provied. If a column is excluded, it will not be part of the schema and can not be accessed through the cube. Excluded columns will be ignored during schema inference. Default value is
None
. -
read_only
–(optional) Defines if write backs to the underlying dataframe are permitted. If read_only is set to
True
, write back attempts will raise anPermissionError
. If read_only is set toFalse
, write backs are permitted and will be pushed back to the underlying dataframe. Default value isTrue
. -
ignore_case
–(optional) If set to
True
, the case of member names will be ignored, 'Apple' and 'apple' will be treated as the same member. If set toFalse
, member names are case-sensitive, 'Apple' and 'apple' will be treated as different members. Default value isTrue
. -
ignore_key_errors
–(optional) If set to
True
, key errors for members of dimensions will be ignored and cell values will return 0.0 orNone
if no matching record exists. If set toFalse
, key errors will be raised as exceptions when accessing cell values for non-existing members. Default value isTrue
. -
caching
–(optional) A caching strategy to be applied for accessing the cube. recommended value for almost all use cases is
CachingStrategy.LAZY
, which caches dimension members on first access. Caching can be beneficial for performance, but may also consume more memory. To cache all dimension members eagerly (on initialization of the cube), set this parameter toCachingStrategy.EAGER
. Please refer to the documentation of 'CachingStrategy' for more information. Default value isCachingStrategy.LAZY
. -
caching_threshold
–(optional) The threshold as 'number of members' for EAGER caching only. If the number of distinct members in a dimension is below this threshold, the dimension will be cached eargerly, if caching is set to
CacheStrategy.EAGER
orCacheStrategy.FULL
. Above this threshold, the dimension will be cached lazily. Default value isEAGER_CACHING_THRESHOLD
, equivalent to 256 unique members per dimension. -
eager_evaluation
–(optional) If set to
True
, the cube will evaluate the context eagerly, i.e. when the context is created. Eager evaluation is recommended for most use cases, as it simplifies debugging and error handling. If set toFalse
, the cube will evaluate the context lazily, i.e. only when the value of a context is accessed/requested.
Returns:
-
–
A new Cube object that wraps the dataframe.
Raises:
-
PermissionError
–If writeback is attempted on a read-only Cube.
-
ValueError
–If the schema is not valid or does not match the dataframe or if invalid dimension, member, measure or address agruments are provided.
Examples: