ABEJA DSF

Date: 2017-10-12

Version: 1.0.0

Source Repository:

core library: https://github.com/abeja-inc/dsf_core.git
io library: https://github.com/abeja-inc/dsf_io_tools.git
calculation modules:

What is DSF

DSF(Data Science Framework) is python-based framework to compose and execute data analysis logic.

Let’s get start from installing and creating project.

Purpose

Purpose of DSF is making data scientists more forcussing on logic by supporting points below:

Quick prototyping
Clean and reusable code
System building

Design concepts

1. Complex logic with simple syntax

In DSF, all logics are written by using pipeline. A pipeline starts with data loading tasks and finally ends with data writing tasks after many calculation tasks. This structure help reader understanding.

2. Built-in calculation without implementation

Calculation components define how calculations behave. All you need to do with DSF is choosing calculation components and connect them to a pipeline.

3. Reusable packaged logics

You can define new calculation components from large pipeline as a calculation component set. Calculation component sets are very useful to manage to accumulate and reuse.

4. Data store interface without implementation

DSF has many types of Data Warehousing rule. One of them probably fit where you want to store data and format of data.

5. Auto scheduling logics

DSF has batch scheduler and executor. This feature automatically schedule and execute your pipeline according to dependence relationship. Schedule interval can be set by cron style.

6. High capacity for big data

It is hard to manipulate very big data in a computer, but DSF can partition data and execute calculation. However it’s slow because there is no parallel processing feature in current version.