• ghr

    Multi-Processing with Pandas and Dask

    It is important to understand that unlike the pandas read_csv, the above command does not actually load the data. It does some data inference, and leaves the other aspects for later.

    import dask.dataframe as dd
    df = dd.read_csv(r"C:\temp\yellow_tripdata_2009-01.csv")

    Using the npartitions attribute, we can see how many partitions the data will be broken in for loading. Viewing the raw df object would give you a shell of the dataframe with column and datatypes inferred. The actual data is not loaded yet.

    # The computation is actually defferred until we compute it.
    size = df.size
    size, type(size)
    size.compute()  #48s

    This computation comes back with 25MM rows. This computation actually took a while. This is because when we compute size, we are not only calculating the size of the data, but we are also actually loading the dataset. Now you think that is not very efficient. There are a couple of approaches you can take:

    If you have access to a (cluster of) computers with large enough RAM, then you can load and persist the data in memory. The subsequent computations will compute in memory and will be a lot faster. This also allows you to do many computations much like using pandas but in a distributed paradigb.
    Another approach is to setup a whole bunch of deferred computations, and to compute out of core. Then dask will intelligently load data and process all the computations once by figuring out the various dependencies. This is a great approach if you don't have a lot of RAM available.

    # to load data in memory is by using the persist method on the df object.
    df = df.persist()
    df.size.compute() # 35ms

    posted in Technology read more
  • ghr

    I tried to run the same code, but with a larger CSV file. I generated one which is 5 times bigger than the previous (with 5 000 000 rows and a size of around 487 MB). I got the following results:

    • csv.DictReader took 9.799003601074219e-05 seconds
    • pd.read_csv took 11.01493215560913 seconds
    • pd.read_csv with chunksize took 11.402302026748657 seconds
    • dask.dataframe took 0.21671509742736816 seconds
    • datatable took 0.7201321125030518 seconds

    I re-ran the test with a CSV file of 10 000 000 rows and a size of around 990 MB. The results are the following:

    • csv.DictReader took 0.00013709068298339844 seconds
    • pd.read_csv took 23.0141019821167 seconds
    • pd.read_csv with chunksize took 24.249807119369507 seconds
    • dask.dataframe took 0.49848103523254395 seconds
    • datatable took 1.45100998878479 seconds

    Again ignoring the csv.DictReader, dask is by far the fastest. However, datatable also performs pretty well.

    posted in Technology read more
  • ghr



    Creature(const std::string &name) : m_name{name} { }
    • A passed lvalue binds to name, then is copied into m_name.
    • A passed rvalue binds to name, then is copied into m_name.


    Creature(std::string name) : m_name{std::move(name)} { }
    • A passed lvalue is copied into name, then is moved into m_name.
    • A passed rvalue is moved into name, then is moved into m_name.


    Creature(const std::string &name) : m_name{name} { }
    Creature(std::string &&rname) : m_name{std::move(rname)} { }
    • A passed lvalue binds to name, then is copied into m_name.
    • A passed rvalue binds to rname, then is moved into m_name.

    As move operations are usually faster than copies, (1) is better than (0) if you pass a lot of temporaries. (2) is optimal in terms of copies/moves, but requires code repetition.

    The code repetition can be avoided with perfect forwarding:


    template <typename T, 
        std::enable_if_t<std::is_convertible_v<std::remove_cvref_t<T>, std::string>, int> = 0
    Creature(T&& name) : m_name{std::forward<T>(name)} { }

    You might optionally want to constrain T in order to restrict the domain of types that this constructor can be instantiated with (as shown above). C++20 aims to simplify this with Concepts.

    posted in Technology read more
  • ghr

    GCE login authenication via browser

    gcloud auth application-default login

    This command is useful when you are developing code that would normally use a service account but need to run the code in a local development environment where it's easier to provide user credentials. The credentials will apply to all API calls that make use of the Application Default Credentials client library.

    posted in Technology read more
  • ghr

    References vs. Pointers

    A reference is a name constant for an address. You need to initialize the reference during declaration.

    int & iRef;   // Error: 'iRef' declared as reference but not initialized

    Once a reference is established to a variable, you cannot change the reference to reference another variable.

    To get the value pointed to by a pointer, you need to use the dereferencing operator * (e.g., if pNumber is a int pointer, *pNumber returns the value pointed to by pNumber. It is called dereferencing or indirection). To assign an address of a variable into a pointer, you need to use the address-of operator & (e.g., pNumber = &number).

    On the other hand, referencing and dereferencing are done on the references implicitly. For example, if refNumber is a reference (alias) to another int variable, refNumber returns the value of the variable. No explicit dereferencing operator * should be used. Furthermore, to assign an address of a variable to a reference variable, no address-of operator & is needed.

    /* References vs. Pointers (TestReferenceVsPointer.cpp) */
    #include <iostream>
    using namespace std;
    int main() {
       int number1 = 88, number2 = 22;
       // Create a pointer pointing to number1
       int * pNumber1 = &number1;  // Explicit referencing
       *pNumber1 = 99;             // Explicit dereferencing
       cout << *pNumber1 << endl;  // 99
       cout << &number1 << endl;   // 0x22ff18
       cout << pNumber1 << endl;   // 0x22ff18 (content of the pointer variable - same as above)
       cout << &pNumber1 << endl;  // 0x22ff10 (address of the pointer variable)
       pNumber1 = &number2;        // Pointer can be reassigned to store another address
       // Create a reference (alias) to number1
       int & refNumber1 = number1;  // Implicit referencing (NOT &number1)
       refNumber1 = 11;             // Implicit dereferencing (NOT *refNumber1)
       cout << refNumber1 << endl;  // 11
       cout << &number1 << endl;    // 0x22ff18
       cout << &refNumber1 << endl; // 0x22ff18
       //refNumber1 = &number2;     // Error! Reference cannot be re-assigned
                                    // error: invalid conversion from 'int*' to 'int'
       refNumber1 = number2;        // refNumber1 is still an alias to number1.
                                    // Assign value of number2 (22) to refNumber1 (and number1).
       cout << refNumber1 << endl;  // 22
       cout << number1 << endl;     // 22
       cout << number2 << endl;     // 23

    & operator

    • LHS - reference variable
    • RHS - address-of variable

    * operator

    • LHS - pointer variable
    • RHS - de-referencing (value-of variable)

    posted in Technology read more
  • ghr

    References (or Aliases) (&)

    /* Test reference declaration and initialization (TestReferenceDeclaration.cpp) */
    #include <iostream>
    using namespace std;
    int main() {
       int number = 88;          // Declare an int variable called number
       int & refNumber = number; // Declare a reference (alias) to the variable number
                                 // Both refNumber and number refer to the same value
       cout << number << endl;    // Print value of variable number (88)
       cout << refNumber << endl; // Print value of reference (88)
       refNumber = 99;            // Re-assign a new value to refNumber
       cout << refNumber << endl;
       cout << number << endl;    // Value of number also changes (99)
       number = 55;               // Re-assign a new value to number
       cout << number << endl;
       cout << refNumber << endl; // Value of refNumber also changes (55)


    A reference works as a pointer. A reference is declared as an alias of a variable. It stores the address of the variable, as illustrated:


    posted in Technology read more
  • ghr

    Initializing Pointers with Address-Of Operator (&)

    int number = 88;     // An int variable with a value
    int * pNumber;       // Declare a pointer variable called pNumber pointing to an int (or int pointer)
    pNumber = &number;   // Assign the address of the variable number to pointer pNumber
    int * pAnother = &number; // Declare another int pointer and init to address of the variable number


    Indirection or Dereferencing Operator (*)

    int number = 88;
    int * pNumber = &number;  // Declare and assign the address of variable number to pointer pNumber (0x22ccec)
    cout << pNumber<< endl;   // Print the content of the pointer variable, which contain an address (0x22ccec)
    cout << *pNumber << endl; // Print the value "pointed to" by the pointer, which is an int (88)
    *pNumber = 99;            // Assign a value to where the pointer is pointed to, NOT to the pointer variable
    cout << *pNumber << endl; // Print the new value "pointed to" by the pointer (99)
    cout << number << endl;   // The value of variable number changes as well (99)

    The reason why the value of number has been changed to 99 is because:

    • pointer pNumber has been assign the address of number
    • then pNumber has been assign the value 99
    • thus the value 99 will go into the content of address of number

    posted in Technology read more
  • ghr

    constexpr vs inline functions:

    Both are for performance improvements, inline functions are request to compiler to expand at compile time and save time of function call overheads. In inline functions, expressions are always evaluated at run time. constexpr is different, here expressions are evaluated at compile time.

    constexpr with constructors:

    constexpr can be used in constructors and objects also. See this for all restrictions on constructors that can use constexpr.

    // C++ program to demonstrate uses of constexpr in constructor 
    #include <bits/stdc++.h> 
    using namespace std; 
    // A class with constexpr constructor and function 
    class Rectangle 
    	int _h, _w; 
    	// A constexpr constructor 
    	constexpr Rectangle (int h, int w) : _h(h), _w(w) {} 
    	constexpr int getArea () { return _h * _w; } 
    // driver program to test function 
    int main() 
    	// Below object is initialized at compile time 
    	constexpr Rectangle obj(10, 20); 
    	cout << obj.getArea(); 
    	return 0; 

    constexpr vs const

    They serve different purposes. constexpr is mainly for optimization while const is for practically const objects like value of Pi.

    Both of them can be applied to member methods. Member methods are made const to make sure that there are no accidental changes by the method. On the other hand, the idea of using constexpr is to compute expressions at compile time so that time can be saved when code is run.

    const can only be used with non-static member function whereas constexpr can be used with with member and non-member functions, even with constructors but with condition that argument and return type must be of literal types.

    posted in Technology read more

Looks like your connection to ghr was lost, please wait while we try to reconnect.