Reading and Writing Data¶
Contents
A ts.flint.FlintContext
is similar to a pyspark.sql.SQLContext
in
that it is the main entry point to reading Two Sigma data sources into
a ts.flint.TimeSeriesDataFrame
.
Converting other data sources to TimeSeriesDataFrame¶
You can also use a ts.flint.FlintContext
to convert an
existing pandas.DataFrame
or pyspark.sql.DataFrame
to a
ts.flint.TimeSeriesDataFrame
in order to take
advantage of its time-aware functionality:
>>> df1 = flintContext.read.pandas(pd.read_csv(path))
>>> df2 = (flintContext.read
... .option('isSorted', False)
... .dataframe(sqlContext.read.parquet(hdfs_path)))
Writing temporary data to HDFS¶
You can materialize a pyspark.sql.DataFrame
to HDFS and read it
back later on, to save data between sessions, or to cache the result
of some preprocessing.
>>> import getpass
>>> filename = 'hdfs:///user/{}/filename.parquet'.format(getpass.getuser())
>>> df.write.parquet(filename)
The Apache Parquet format is a good fit for most tabular data sets that we work with in Flint.
To read a sequence of Parquet files, use the flintContext.read.parquet
method. This method assumes
the Parquet data is sorted by time. You can pass the
.option('isSorted', False)
option to the reader if the underlying data is
not sorted on time:
>>> ts_df1 = flintContext.read.parquet(hdfs_path) # assumes sorted by time
>>> ts_df2 = (flintContext.read
... .option('isSorted', False)
... .parquet(hdfs_path)) # this will sort by time before load