Reading and Writing Data

A ts.flint.FlintContext is similar to a pyspark.sql.SQLContext in that it is the main entry point to reading Two Sigma data sources into a ts.flint.TimeSeriesDataFrame.

Converting other data sources to TimeSeriesDataFrame

You can also use a ts.flint.FlintContext to convert an existing pandas.DataFrame or pyspark.sql.DataFrame to a ts.flint.TimeSeriesDataFrame in order to take advantage of its time-aware functionality:

>>> df1 = flintContext.read.pandas(pd.read_csv(path))
>>> df2 = (flintContext.read
...        .option('isSorted', False)
...        .dataframe(sqlContext.read.parquet(hdfs_path)))

Writing temporary data to HDFS

You can materialize a pyspark.sql.DataFrame to HDFS and read it back later on, to save data between sessions, or to cache the result of some preprocessing.

>>> import getpass
>>> filename = 'hdfs:///user/{}/filename.parquet'.format(getpass.getuser())
>>> df.write.parquet(filename)

The Apache Parquet format is a good fit for most tabular data sets that we work with in Flint.

To read a sequence of Parquet files, use the flintContext.read.parquet method. This method assumes the Parquet data is sorted by time. You can pass the .option('isSorted', False) option to the reader if the underlying data is not sorted on time:

>>> ts_df1 = flintContext.read.parquet(hdfs_path)  # assumes sorted by time
>>> ts_df2 = (flintContext.read
...           .option('isSorted', False)
...           .parquet(hdfs_path))  # this will sort by time before load