Python is the most used programming language among data scientists. Data scientists need to deal with complex problems, and the problem-solving process basically involves four major steps – data collection and cleaning, data exploration, data modeling and data visualization. Python provides us with all the necessary tools to effectively carry out this process with dedicated libraries. So in this article we want to present the most useful Python libraries for the fields of data manipulation, data visualization, and data modeling, which for sure are worth learning.
Pandas is the most powerful, flexible and easy to use open source library for data manipulation and analysis.
The name Pandas is derived from the term “panel data”, an econometrics term for data sets that include observations over multiple time periods for the same individuals.
Wikipedia
It was created by the developer Wes McKinney in 2008, to make his work easier on financial analysis. It was built on top of the NumPy package, which means that a lot of structure of NumPy is used or replicated in Pandas. Part of this library is coded in C, for an optimized performance.
Pandas provides high-performing data structures, such as Series (one-dimensional) and DataFrames (two-dimensional), that make working with data easy, fast and intuitive.
The answer to that question is yes, definitively. It can be used in conjunction with other libraries such as NumPy, SciPy, Scikit-learn and Matplotlib among others, to address specific needs in data preparation, analysis and visualization.
Pandas has its own limitation when it comes to big data due to its algorithm and local memory constraints, data needs to be completely loaded into RAM to be processed. As recognized by its author, nowadays you should follow:
“Pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset”
Wes McKinney
You can learn more about Pandas on its Official Website
Let’s start with the basics. What’s NumPy? NumPy is a Python library designed for efficient numerical computing. As a Data Scientist you often need to perform mathematical operations over entire collections of values, and you want to do it fast. NumPy provides high performance multi-dimensional arrays and matrices and the tools to operate on them. It’s the main package if you want to work with scientific computation in Python. It can also be used for treating images, sound waves representations and as an efficient multi-dimensional container for generic data. This allows NumPy to integrate with a wide variety of databases seamlessly and speedily.
NumPy was created by Travis Oliphant in 2005. The original intention was to unify the community with a single array package, incorporating features of the competing Numarray and Numeric libraries.
Basically, NumPy is part of the core of every task that involves working with structured data. Scikit Learn can use NumPy to load, manipulate and summarize data. Matplotlib can use NumPy to plot the information in an attractive way. Also, it can be used in combination with SciPy to produce better insights.
NumPy require a contiguous allocation of memory: Insertion and deletion become costly as data is stored in contiguous memory locations.
Algorithm that are not expressible as a vectorized operation will typically run slowly because they must be implemented in “pure Phyton”.
If you want to read the complete documentation you can check the Official Website.
Matplotlib is the most popular open-source library for data visualization. It is a 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
It can generate many types of plots, from histograms to scatterplots, with just a few lines of code. Matplotlib lays down an array of colors, themes, palettes, and other options to customize and personalize our plots. This library also delivers an API for embedding plots into applications.
One of Matplotlib’s most important feature is its ability to play well with many operating systems and graphics backends.
Matplotlib provides an easy but comprehensive visual approach to present our findings. It allows programmers to visualize huge amounts of data and produce high-quality images in a range of formats.
For simple plotting the pyplot
module provides a MATLAB-like interface.
Matplotlib is useful whether you’re performing data exploration for a machine learning project or simply want to create dazzling and eye-catching charts.
While Matplotlib is an excellent data plotting library, and ships with some tool-kits to plot 3D graphs, is not the most complete or fastest in this area.
If you are looking for further resources, in its Official Website you will find all the detailed documentation.
Let’s see some examples on how we can use this three libraries together. We are going to use Jupyter Notebook to run our examples.
First you need to import the libraries in your Jupyter Notebook, following this instructions:
In the first example we are going to visualize the results from a Primary Election using a bar plot.
We create the data for our example using NumPy arrays. We plot the data, and decorate adding title and labels to our plot.
In the next example we use a scatter plot to visualize the correlation between the years of experience and the corresponding salary earned by the employees of an organization.
We use Pandas to read our data from CSV file, create the dataset and select the data we are going to plot.
In our last example we present line plots for time series analysis, based on a dataset containing the temperature values during the time period of 1981-1990.
Lets take only the first 100 values for plotting the line plot, because if we consider too many points the graph will become crowded and we will not be able to visualize the graph properly.
We hope you have enjoyed this article, and just be aware that this list is just the starting point, as the Python ecosystem offers many other tools that can be helpful for data science work.
This article was written in co-participation with Lautaro Cupaiuoli, Andrés Escobar and Juan Franco. Special thanks to the Data Science Hub for their support on writing this article.