Dataframe

Package

abeja_dsf.core.calculation.data_type.dataframe

Description

In DSF, DataFrame is most common data.

DataFrame generally refers to “tabular” data: a data structure representing cases (rows), each of which consists of a number of observations or measurements (columns).

Please see wiki for information about concept of dataframe.

Actual class is DSFDataFrame that is inherited from DSFData.

DSFColumnType

DSFColumnType decorates DSFColumn with what type of data column stores.

This type is not a data type like int or float, but it is level of measurement.

variable name string name description
DSFColumnType.UNKNOWN_COLUMN unknown unknown data
DSFColumnType.QUANTITY_COLUMN quantity quantity data
DSFColumnType.RATIO_QUANTITY_COLUMN ratio absolutely scaled quantity
DSFColumnType.INTERVAL_QUANTITY_COLUMN interval relatively scaled quantity
DSFColumnType.CATEGORY_COLUMN category label data
DSFColumnType.NOMINAL_CATEGORY_COLUMN nominal label number is independent of meaning
DSFColumnType.ORDINAL_CATEGORY_COLUMN ordinal big label number have strong meaning
from abeja_dsf.core.calculation.data_type.dataframe import *

DSFColumnType.UNKNOWN_COLUMN
DSFColumnType.QUANTITY_COLUMN
DSFColumnType.RATIO_QUANTITY_COLUMN
DSFColumnType.INTERVAL_QUANTITY_COLUMN
DSFColumnType.CATEGORY_COLUMN
DSFColumnType.NOMINAL_CATEGORY_COLUMN
DSFColumnType.ORDINAL_CATEGORY_COLUMN

DSFColumn

DSFColumn have a schema information for a column in DSFDataFrame.

This is very similar to RDB table column.

def __init__(self
             name: str,
             dtype: Type[dsf_T],
             ctype: DSFColumnType,
             nullable: bool = True,
             default: Optional[dsf_T] = None)

create DSFColumn instance. Information about dsf_T is here.


def as_name(self, name: str)

create copy with new name.


@classmethod
def from_definition(cls, 
                    definition: DSFColumnDefinition)

create DSFColumn instance from DSFColumnDefinition

from abeja_dsf.core.calculation.data_type.dtypes import *
from abeja_dsf.core.calculation.data_type.dataframe import *

user_id_column = DSFColumn('user_id', dsf_int, DSFColumnType.NOMINAL_CATEGORY_COLUMN, True, -1)
renamed_column = user_id_column.as_name('copied_user_id')
Aliases for DSFColumn
from abeja_dsf.core.calculation.data_type.dataframe import *

IntIdSeries(name='int_category_data', nullable=True, default=None)

IntValueSeries(name='int_quantity_data', nullable=True, default=None)

StrIdSeries(name='str_category_data', nullable=True, default=None)

ValueSeries(name='float_quantity_data', nullable=True, default=None)

BinarySeries(name='bool_quantity_data', nullable=True, default=None)

TimestampSeries(name='timestamp_quantity_data', nullable=True, default=None)

IntervalSeries(name='interval_quantity_data', nullable=True, default=None)

UnknownSeries(name='unknown _data', nullable=True, default=None)

DSFColumnDefinition

DSFColumnDefinition is used to specify column information in DSFComponent.Arguments. You can also use ColDef instead of DSFColumnDefinition.

def __init__(self,
             name: str,
             dtype: str = 'object',
             ctype: str = 'unknown',
             nullable: bool = True,
             default: Optional[dsf_T] = None)

create DSFColumnDefinition instance.

Please fill dtype from string name in here

And please fill ctype from string name in here

from abeja_dsf.core.calculation.data_type.dtypes import *
from abeja_dsf.core.calculation.data_type.dataframe import *

user_id_column = DSFColumn.from_definition(DSFColumnDefinition('user_id', 'int', 'nominal', True, -1))
user_id_column = DSFColumn.from_definition(ColDef('user_id', 'int', 'nominal', True, -1))

DSFDataFrame

def __init__(self, 
             schema: Union[Dict[str, DSFColumn], List[DSFColumn]],
             data: Union[pandas.DataFrame, Dict[str, Iterable]])

create DSFDataFrame instance.


def column(self, name: str) -> Optional[DSFColumn]

get DSFColumn by name.


def columns(self) -> List[DSFColumn]

get all DSFColumns.


def add_column(self,
               column: DSFColumn,
               series: Optional[pandas.Series] = None)

add new DSFColumn to DSFDataFrame with data.


def remove_column(self, name: str)

delete DSFColumn from DSFDataFrame.


def union(self, dfs: Iterable[DSFDataFrame]) -> DSFDataFrame

create new DSFDataFrame by unifying self and dfs.


from abeja_dsf.core.calculation.data_type.dataframe import *
df = DSFDataFrame([IntIdSeries('a'), IntValueSeries('b')],
                  {'a': [1,2,3,4,5], 'b': [4,5,6,7,8]})

df[ValueSeries('c')] = [0.0, 0.1, 0.2, 0.3, 0.4]

df[df.column('b').as_name('d')] = [b * a for b, a in zip(df['b'], df['a'])]

del df['b']

len(df)

for row in df:
    print(row['a'], row['c'], row['d'])