**Python** is the most used programming language among data scientists. **Data scientists** need to deal with complex problems, and the problem-solving process basically involves four major steps – data collection and cleaning, data exploration, data modeling and data visualization. **Python** provides us with all the necessary tools to effectively carry out this process with dedicated libraries. So in this article we want to present the most useful **Python libraries** for the fields of data manipulation, data visualization, and data modeling, which for sure are worth learning.

Pandas is the most powerful, flexible and easy to use open source library for data manipulation and analysis.

The name Pandas is derived from the term “panel data”, an econometrics term for data sets that include observations over multiple time periods for the same individuals.

Wikipedia

It was created by the developer Wes McKinney in 2008, to make his work easier on financial analysis. It was built on top of the NumPy package, which means that a lot of structure of NumPy is used or replicated in Pandas. Part of this library is coded in C, for an optimized performance.

Pandas provides high-performing data structures, such as** Series** (one-dimensional) and **DataFrames** (two-dimensional), that make working with data easy, fast and intuitive.

**Multiple file formats supported**: Pandas support importing or reading data from various file formats.**Input and output tools**: Easily read and write data into data structures, databases, webservices.**Intelligent label-based slicing, indexing, and sub-setting**: The tabular distribution of the data allows us an effective organization of the data. In addition, indexing provides us with a mechanism to label and access the data more efficiently.**Data handling**: Pandas has a fast and efficient way to manipulate and explore data through a large number of functions.**Easily overcoming missing data**: The suppression or filling of missing data is one of the most recurrent tasks when working with raw data.**Data cleaning**: Pandas allows us to clean the raw data by filtering only the data necessary for further analysis.**Data masking**: Pandas help us to easily correct or change data based on certain criteria.**Intuitive merging, joining, reshaping and pivoting of data sets**: It is very common in Data Science the need to mix or join different data sets until reaching a final one.**Grouping:**The group-by functionality is very flexible and powerful. It will allow us to combine or group data in different ways to get new relevant information.**Perform mathematical operations on the data**: Another essential feature of Pandas is the ability to generate calculated columns by applying mathematical operations or functions

The answer to that question is yes, definitively. It can be used in conjunction with other libraries such as NumPy, SciPy, Scikit-learn and Matplotlib among others, to address specific needs in data preparation, analysis and visualization.

Pandas has its own limitation when it comes to big data due to its algorithm and local memory constraints, data needs to be completely loaded into RAM to be processed. As recognized by its author, nowadays you should follow:

Wes McKinney

“Pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset”

You can learn more about Pandas on its Official Website

Let’s start with the basics. **What’s NumPy? NumPy** is a Python library designed for efficient numerical computing. As a Data Scientist you often need to perform mathematical operations over entire collections of values, and you want to do it fast. **NumPy** provides high performance multi-dimensional arrays and matrices and the tools to operate on them. It’s the main package if you want to work with scientific computation in Python. It can also be used for treating images, sound waves representations and as an efficient multi-dimensional container for generic data. This allows **NumPy** to integrate with a wide variety of databases seamlessly and speedily.

**NumPy** was created by Travis Oliphant in 2005. The original intention was to unify the community with a single array package, incorporating features of the competing Numarray and Numeric libraries.

**High-performance N-dimensional homogeneous array object (ndarray)**: These arrays are homogenously typed and can be one dimensional or multidimensional.**Enables fast computation through vectorization**: Implicit element-by-element operation on multi-dimensional arrays is speedily executed by pre-compiled C Code.**Contains tools for integrating code from C/C++ and Fortran****Designed for scientific computation**: It can be used to perform several complex operations on arrays such as trigonometric, statistical, linear algebra, or Fourier transform routines. It also supports random numbers capabilities.**Array broadcasting**: The term describes how NumPy treats arrays of different shapes during arithmetic operations. Subject to certain constraints “broadcast” the smaller array across the larger array so that they have compatible shapes.**Data type definition capabilities to work with varied databases**

Basically, NumPy is part of the core of every task that involves working with structured data. Scikit Learn can use NumPy to load, manipulate and summarize data. Matplotlib can use NumPy to plot the information in an attractive way. Also, it can be used in combination with SciPy to produce better insights.

NumPy require a contiguous allocation of memory: Insertion and deletion become costly as data is stored in contiguous memory locations.

Algorithm that are not expressible as a vectorized operation will typically run slowly because they must be implemented in “pure Phyton”.

If you want to read the complete documentation you can check the Official Website.

**Matplotlib** is the most popular open-source library for data visualization. It is a 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

It can generate many types of plots, from histograms to scatterplots, with just a few lines of code. **Matplotlib** lays down an array of colors, themes, palettes, and other options to customize and personalize our plots. This library also delivers an API for embedding plots into applications.

One of **Matplotlib’s **most important feature is its ability to play well with many operating systems and graphics backends.

**Matplotlib** provides an easy but comprehensive visual approach to present our findings. It allows programmers to visualize huge amounts of data and produce high-quality images in a range of formats.

For simple plotting the `pyplot`

module provides a MATLAB-like interface.

**Matplotlib** is useful whether you’re performing data exploration for a machine learning project or simply want to create dazzling and eye-catching charts.

While Matplotlib is an excellent data plotting library, and ships with some tool-kits to plot 3D graphs, is not the most complete or fastest in this area.

If you are looking for further resources, in its Official Website you will find all the detailed documentation.

Let’s see some examples on how we can use this three libraries together. We are going to use Jupyter Notebook to run our examples.

First you need to import the libraries in your Jupyter Notebook, following this instructions:

In the first example we are going to visualize the results from a Primary Election using a bar plot.

We create the data for our example using NumPy arrays. We plot the data, and decorate adding title and labels to our plot.

In the next example we use a scatter plot to visualize the correlation between the years of experience and the corresponding salary earned by the employees of an organization.

We use Pandas to read our data from CSV file, create the dataset and select the data we are going to plot.

In our last example we present line plots for time series analysis, based on a dataset containing the temperature values during the time period of 1981-1990.

Lets take only the first 100 values for plotting the line plot, because if we consider too many points the graph will become crowded and we will not be able to visualize the graph properly.

We hope you have enjoyed this article, and just be aware that this list is just the starting point, as the Python ecosystem offers many other tools that can be helpful for data science work.

This article was written in co-participation with *Lautaro Cupaiuoli*, *Andrés Escobar* and *Juan Franco*. Special thanks to the Data Science Hub for their support on writing this article.

Backend Software Designer