https://www.kaggle.com/dedecu/cross-correlation-time-lag-with-pandas
-
Analysis with Pandas
-
DataFrame Slice
df_new = df['col1']
DataFrame loc
df_new = df.loc[:, 'col1'] df.loc[1:3, 'col2'] = 5 # change data in dataframe
- Better performance compare to
Slice
method. - Modify the original dataframe.
- Better performance compare to
-
Boxplot chart
A boxplot is a standardized way of displaying the distribution of data based on a five number summary:
- median (Q2/50th Percentile): the middle value of the dataset.
- first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.
- third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.
- interquartile range (IQR): 25th to the 75th percentile.
- whiskers (shown in blue)
- outliers (shown as green circles)
- “maximum”: Q3 + 1.5*IQR
- “minimum”: Q1 -1.5*IQR
https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
-
Chart Libraries
-
What is regression coefficient?
Linear relationships, i.e. lines, are easier to work with and most phenomenon are naturally linearly related. If variables aren’t linearly related, then some math can transform that relationship into a linear one, so that it’s easier for the researcher (i.e. you) to understand.
Regression analysis is used to find
equations
that fit data. Once we have the regression equation, we can use the model to make predictions. One type of regression analysis is linear analysis. When a correlation coefficient shows that data is likely to be able to predict future outcomes and a scatter plot of the data appears to form a straight line, you can use simple linear regression to find a predictive function.If you recall from elementary algebra, the equation for
a line is y = mx + b
. This article shows you how to take data, calculate linear regression, and find the equation y’ = a + bx. -
-
from tqdm import tqdm_notebook tqdm_notebook().pandas() data['column_1'].progress_map(lambda x: x.count('e'))
-
Different ways to iterate over rows in a Pandas Dataframe — performance comparison
- Column operation and apply are both relatively fast
- Select using at() and iat() is faster than loc()
- Location-based indexing on numpy array is faster than locating-based indexing on a pandas dataframe
- zip() is relatively fast for small dataset - even faster than apply() for N < 1000
- iat() and at() indexing can be 30 times faster than loc()
- loc() is slower than expected even when access using the index
- If I cannot achieve what I want using column operation or apply(), I will use zip() instead (not iterrows()!)
- I will avoid using loc() for updating or access single value, use iat() and at() instead
- Consider extracting the underlying values as a numpy array then perform the processing/analysing
-
Multi-Processing with Pandas and Dask
It is important to understand that unlike the pandas read_csv, the above command does not actually load the data. It does some data inference, and leaves the other aspects for later.
import dask.dataframe as dd df = dd.read_csv(r"C:\temp\yellow_tripdata_2009-01.csv")
Using the npartitions attribute, we can see how many partitions the data will be broken in for loading. Viewing the raw df object would give you a shell of the dataframe with column and datatypes inferred. The actual data is not loaded yet.
df.npartitions
# The computation is actually defferred until we compute it. size = df.size size, type(size) %%time size.compute() #48s
This computation comes back with 25MM rows. This computation actually took a while. This is because when we compute size, we are not only calculating the size of the data, but we are also actually loading the dataset. Now you think that is not very efficient. There are a couple of approaches you can take:
If you have access to a (cluster of) computers with large enough RAM, then you can load and persist the data in memory. The subsequent computations will compute in memory and will be a lot faster. This also allows you to do many computations much like using pandas but in a distributed paradigb.
Another approach is to setup a whole bunch of deferred computations, and to compute out of core. Then dask will intelligently load data and process all the computations once by figuring out the various dependencies. This is a great approach if you don't have a lot of RAM available.# to load data in memory is by using the persist method on the df object. df = df.persist() %%time df.size.compute() # 35ms
http://gouthamanbalaraman.com/blog/distributed-processing-pandas-dask.html
-
Dataframe access (performance)
loc
only work on index# label based, but we can use position values # to get the labels from the index object df.loc[df.index[2], 'ColName'] = 3
iloc
work on position# position based, but we can get the position # from the columns object via the `get_loc` method df.iloc[2, df.columns.get_loc('ColName')] = 3
ix
You can get data from dataframe without it being in the index
at
get scalar values. It's a very fast loc
Works very similar to loc for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns.Advantage: over loc is that this is faster.
Disadvantage: is that you can't use arrays for indexers.# label based, but we can use position values # to get the labels from the index object df.at[df.index[2], 'ColName'] = 3
iat
Get scalar values. It's a very fast iloc
Works similarly to iloc. Cannot work in array indexers. Cannot! assign new indices and columns.Advantage: over iloc is that this is faster.
Disadvantage: is that you can't use arrays for indexers.# position based, but we can get the position # from the columns object via the `get_loc` method IBM.iat[2, IBM.columns.get_loc('PNL')] = 3
https://stackoverflow.com/questions/28757389/pandas-loc-vs-iloc-vs-ix-vs-at-vs-iat
http://pyciencia.blogspot.com/2015/05/obtener-y-filtrar-datos-de-un-dataframe.html