abeja_dsf.core.calculation.data_type.dataframe
In DSF, DataFrame is most common data.
DataFrame generally refers to “tabular” data: a data structure representing cases (rows), each of which consists of a number of observations or measurements (columns).
Please see wiki for information about concept of dataframe.
Actual class is DSFDataFrame that is inherited from DSFData.
DSFColumnType decorates DSFColumn with what type of data column stores.
This type is not a data type like int or float, but it is level of measurement.
| variable name | string name | description |
|---|---|---|
| DSFColumnType.UNKNOWN_COLUMN | unknown | unknown data |
| DSFColumnType.QUANTITY_COLUMN | quantity | quantity data |
| DSFColumnType.RATIO_QUANTITY_COLUMN | ratio | absolutely scaled quantity |
| DSFColumnType.INTERVAL_QUANTITY_COLUMN | interval | relatively scaled quantity |
| DSFColumnType.CATEGORY_COLUMN | category | label data |
| DSFColumnType.NOMINAL_CATEGORY_COLUMN | nominal | label number is independent of meaning |
| DSFColumnType.ORDINAL_CATEGORY_COLUMN | ordinal | big label number have strong meaning |
from abeja_dsf.core.calculation.data_type.dataframe import *
DSFColumnType.UNKNOWN_COLUMN
DSFColumnType.QUANTITY_COLUMN
DSFColumnType.RATIO_QUANTITY_COLUMN
DSFColumnType.INTERVAL_QUANTITY_COLUMN
DSFColumnType.CATEGORY_COLUMN
DSFColumnType.NOMINAL_CATEGORY_COLUMN
DSFColumnType.ORDINAL_CATEGORY_COLUMN
DSFColumn have a schema information for a column in DSFDataFrame.
This is very similar to RDB table column.
def __init__(self name: str, dtype: Type[dsf_T], ctype: DSFColumnType, nullable: bool = True, default: Optional[dsf_T] = None)create DSFColumn instance. Information about
dsf_Tis here.
def as_name(self, name: str)create copy with new name.
@classmethod def from_definition(cls, definition: DSFColumnDefinition)create
DSFColumninstance fromDSFColumnDefinition
from abeja_dsf.core.calculation.data_type.dtypes import *
from abeja_dsf.core.calculation.data_type.dataframe import *
user_id_column = DSFColumn('user_id', dsf_int, DSFColumnType.NOMINAL_CATEGORY_COLUMN, True, -1)
renamed_column = user_id_column.as_name('copied_user_id')
from abeja_dsf.core.calculation.data_type.dataframe import *
IntIdSeries(name='int_category_data', nullable=True, default=None)
IntValueSeries(name='int_quantity_data', nullable=True, default=None)
StrIdSeries(name='str_category_data', nullable=True, default=None)
ValueSeries(name='float_quantity_data', nullable=True, default=None)
BinarySeries(name='bool_quantity_data', nullable=True, default=None)
TimestampSeries(name='timestamp_quantity_data', nullable=True, default=None)
IntervalSeries(name='interval_quantity_data', nullable=True, default=None)
UnknownSeries(name='unknown _data', nullable=True, default=None)
DSFColumnDefinition is used to specify column information in DSFComponent.Arguments.
You can also use ColDef instead of DSFColumnDefinition.
def __init__(self, name: str, dtype: str = 'object', ctype: str = 'unknown', nullable: bool = True, default: Optional[dsf_T] = None)create
DSFColumnDefinitioninstance.Please fill
dtypefrom string name in hereAnd please fill
ctypefrom string name in here
from abeja_dsf.core.calculation.data_type.dtypes import *
from abeja_dsf.core.calculation.data_type.dataframe import *
user_id_column = DSFColumn.from_definition(DSFColumnDefinition('user_id', 'int', 'nominal', True, -1))
user_id_column = DSFColumn.from_definition(ColDef('user_id', 'int', 'nominal', True, -1))
def __init__(self, schema: Union[Dict[str, DSFColumn], List[DSFColumn]], data: Union[pandas.DataFrame, Dict[str, Iterable]])create
DSFDataFrameinstance.
def column(self, name: str) -> Optional[DSFColumn]get
DSFColumnby name.
def columns(self) -> List[DSFColumn]get all
DSFColumns.
def add_column(self, column: DSFColumn, series: Optional[pandas.Series] = None)add new
DSFColumntoDSFDataFramewith data.
def remove_column(self, name: str)delete
DSFColumnfromDSFDataFrame.
def union(self, dfs: Iterable[DSFDataFrame]) -> DSFDataFramecreate new
DSFDataFrameby unifying self and dfs.
from abeja_dsf.core.calculation.data_type.dataframe import *
df = DSFDataFrame([IntIdSeries('a'), IntValueSeries('b')],
{'a': [1,2,3,4,5], 'b': [4,5,6,7,8]})
df[ValueSeries('c')] = [0.0, 0.1, 0.2, 0.3, 0.4]
df[df.column('b').as_name('d')] = [b * a for b, a in zip(df['b'], df['a'])]
del df['b']
len(df)
for row in df:
print(row['a'], row['c'], row['d'])