Which programming language should I learn for getting started in data science?
That’s one of the most popular question for anyone who is getting started in data science. You have several programming languages to begin with. When I got started in the field of data science, I had the same confusion as you, and I wasted several hours browsing to come up with a good choice. I don’t want you to waste your precious time as I did. Before I give you my opinion, it is good to know what languages and platforms are popular in self-selected communities of data science. Every year Kd Nuggets conducts a poll on “What programming/statistics languages are used for data science work”.
The graph below shows the survey result for the year 2014.
The popular tools or programming languages mentioned in the survey are R, Python, SAS, MATLAB, SPSS, My SQL and Java. I did not choose MATLAB, SPSS and SAS as they are expensive software products. I also didn’t choose Java as it has a very high learning curve. On the other hand, My SQL and other Hadoop based languages are open source databases, rather than a programming language, so I did not choose them as well. Now I was left with R and Python, and finally I chose Python instead of R. There are 3 reasons why I chose Python as a beginner.
Python is easy to learn:
By surfing the internet, I was able to find that Python was easier to learn than R. R on the other hand had a high learning curve when compared to Python. This reminded me the quote by mark Zuckerberg:
“If you do the things that are easier first, then you can actually make a lot of progress”
As a beginner you will be replicating the projects done by other data scientists, for which you have to read the source code of their projects. Python places an emphasis on readability. So you can understand the code written by other developers without pain.
The first code example below is written in c++:
Std::cout <<”Hello, world!n”;
Here is the code with the same output in python:
Print (“Hello, world!”)
Excellent tutorials and libraries
There are excellent tutorials available for Python. Many MOOCs beginner programming classes are taught in python, even MIT teaches its introductory programming course in Python. There are excellent python tutorials focused on data science applications.
Data scientists are often involved with wiring together network applications, programming for the web, scripting and automating data processing jobs. If you are looking for one programming language to do all these tasks, then Python is the answer. Python’s popularity for data science is largely due to the strength of its core libraries.
Data from real world is not clean, with information from disparate data sources and mismatched records. Data scientist’s spend most of their time cleaning up a messy data set. Pandas makes Data munging a lot easier.
A Data scientist should possess different skills, and machine learning is one of them.
A data scientist should be capable enough to run machine learning algorithms on a dataset to derive meaningful insights. Take any intro data science course, they will give a brief introduction to machine learning. This shows the importance of machine learning in data science.
Python has an extensive machine learning library called Scikit-learn.Especially for deep learning , it has got an amazing library called Theano. You can easily run machine learning algorithms such as Svm, Linear regression, Logistic regression on a data-set using Scikit-learn.
R is the standard language for performing statistical analysis, it has quite a high learning curve and there are certain areas of data science for which it is not well suited. Python is an extremely coherent, compact, object oriented language while R is frankly a hodgepodge of features, which makes it intimidating for a beginner.
“Life is short use Python”