Reading and Writing Data ¶

Contents

Reading and Writing Data
- Converting other data sources to TimeSeriesDataFrame
- Writing temporary data to HDFS

A ts.flint.FlintContext is similar to a pyspark.sql.SQLContext in that it is the main entry point to reading Two Sigma data sources into a ts.flint.TimeSeriesDataFrame.

Converting other data sources to TimeSeriesDataFrame ¶

You can also use a ts.flint.FlintContext to convert an existing pandas.DataFrame or pyspark.sql.DataFrame to a ts.flint.TimeSeriesDataFrame in order to take advantage of its time-aware functionality:

>>> df1 = flintContext.read.pandas(pd.read_csv(path))
>>> df2 = (flintContext.read
...        .option('isSorted', False)
...        .dataframe(sqlContext.read.parquet(hdfs_path)))

Writing temporary data to HDFS ¶

You can materialize a pyspark.sql.DataFrame to HDFS and read it back later on, to save data between sessions, or to cache the result of some preprocessing.

>>> import getpass
>>> filename = 'hdfs:///user/{}/filename.parquet'.format(getpass.getuser())
>>> df.write.parquet(filename)

The Apache Parquet format is a good fit for most tabular data sets that we work with in Flint.

To read a sequence of Parquet files, use the flintContext.read.parquet method. This method assumes the Parquet data is sorted by time. You can pass the .option('isSorted', False) option to the reader if the underlying data is not sorted on time:

>>> ts_df1 = flintContext.read.parquet(hdfs_path)  # assumes sorted by time
>>> ts_df2 = (flintContext.read
...           .option('isSorted', False)
...           .parquet(hdfs_path))  # this will sort by time before load

Reading and Writing Data¶

Converting other data sources to TimeSeriesDataFrame¶

Writing temporary data to HDFS¶

Reading and Writing Data ¶

Converting other data sources to TimeSeriesDataFrame ¶

Writing temporary data to HDFS ¶