Do you know how Data Scientists use Python and if it requires data, where I can get sample data to practice with? I'm currently learning the language and plan on applying for Data Scientists jobs, so it would be helpful if I knew if there were specific types of tasks where I'd use python. If it's relevant, I use C# regularly.

share|improve this question
Data is most valuable when it can be shared and the results of analysis can be shared. I have seen a number of Python projects make great use of Django. I would recommend that you invest some time getting to grips with the basics of Django. – PenFold Feb 11 '12 at 15:20
1  
I do not agree with the point stated above, analysing data and presenting your results are two different things. – AsTeR Feb 11 '12 at 15:53
2  
I'd assert that "data scientist" is a pretty vague term, straddling about a dozen fields and a whole lot of definitions. Any chance of some more specifics on at least the industry you're looking at? – EpiGrad Feb 11 '12 at 20:38
Good point. The software industry. – user1106278 Feb 11 '12 at 21:13

migrated from stackoverflow.com Feb 11 '12 at 14:53

5 Answers

up vote 2 down vote accepted

Any type of data parsing is going to be of two types. Ascii or binary.

I would practice and learn how to 'get' data from multiply sources. This could be from files, from network connections, from stdout of subprocesses, serial port... Etc...

for ascii data, you would must probably be using strings methods for splitting and storing the data.

Binary data is where you'd really want to practice. You have to worry about the endianess of the data, worry about the specific bit lengths of different datatypes on different platforms, following specifications for the format of the data, actually parsing the raw data (get familiar with 'struct' and packing and unpacking data).

as for examples, search for tutorials on each piece I've mentioned. Most items you can create a temp generator of data to create the other side of a process (such as a tcp server, a subprocess spitting out stdout, something to generate files).

Good luck

share|improve this answer
Thanks, g19fanatic. That's exactly the type of detailed answer that I wanted. I remember pulling out large chunks of hair dealing with endianess. Thankfully, it'll be less of a problem this time around because I have less hair. So Data Scientists use Python primarily for data parsing? – user1106278 Feb 11 '12 at 19:09
It is apart of it. It all starts with parsing without parsing, you dont have anything to analyze – g19fanatic Feb 12 '12 at 18:23

I'm not the most experienced in that field but I use Python for data mining and experimental data analysis and modification.

Why Python for that ?

  • Many library designed around science and data analysis (yes SciPy, Numpy but also some more specialised like Pandas and NetworkX and some good visualisations utility like matplotlib)
  • Python transforms my ideas into code in very few lines of code
  • Python learning curves let you easily dive in the code (use some nice interpreters like: Reinteract or bpython ). e.g.: file i/o and text processing is trivial so I frequently use it just to write small format converter.

I wrote an article in french on "Why Python ?" if you can read it.

If you want some data you can found some on sites like UCI Machine Learning Repository :

  • some are ready for processing in data mining
  • some others require some preprocessing like this one.

** EDIT **

A good video about Pandas at PyCon.

share|improve this answer
Thanks, AsTeR. I forgot about UCI's database. – user1106278 Feb 11 '12 at 19:04
Remembering old times at school uh ? ;) – AsTeR Feb 11 '12 at 19:50
Shudder... The only memories I have of school are memories I'd rather not have. :) – user1106278 Feb 11 '12 at 20:38
+1 for introducing me to NetworkX. Looks very promising from their website. – Buttons840 Feb 16 '12 at 17:10

Why don't you try some scientific computation libraries like this:

  • Scipy: Scientific Tools for Python
  • Numpy: Scientific Computing Tools For Python
  • Matplot: python 2D plotting library
  • Sage: Comprehensive CAS system that integrates the above libraries and a lot more

These are tools for scientific computation in Python.

share|improve this answer
sagemath is also nice :) It's basically a comprehensive wrapper around all those libraries. – Niklas Feb 11 '12 at 14:47
Matplotlib is actually the name, not just matplot+library. – ldigas Feb 11 '12 at 15:51
Thanks. I'll get a feel for standard Python first, then hit the tools that you mentioned. – user1106278 Feb 11 '12 at 19:09

This video on youtube would explain that to you in detail Python for Data Science

share|improve this answer
1  
Whilst this may theoretically answer the question, it would be preferable to include the essential parts of the answer here, and provide the link for reference. – Yannis Rizos Feb 11 '12 at 15:13
Thanks. It's a longer video, so I'll take a look later today. – user1106278 Feb 11 '12 at 19:14
If any one else wants to look at it, it'll be faster to read the powerpoint: lanyrd.com/2011/pycon-finland/sgwwx. The presenter gives an outline as follows: 1. Different tactics to gather your data 2. Cleansing, scrubbing, correcting your data 3. Running analysis for your data 4. Bring your data to live with visualizations 5. Publishing your data for rest of us as linked open data – user1106278 Feb 12 '12 at 0:47

I'd recommend this book.

Machine Learning: An Algorithmic Perspective

It uses data sets from the UCI Machine Learning Repository. Even if you don't precisely want to do machine learning, but are more interested in statistics there are a bunch of data sets here that might interest you.

The main library for doing science with Python is Numpy. Enthought has a great Python distribution which includes Numpy, matplotlib, and many other libraries to get you started.

Also check out: Numpy 1.5 Beginners Guide. It's a great introduction and will get you up and running doing simple things quickly. The author's blog also has some good introductory material.

Hope that helps. An aside (noting you're a c# programmer as well). I still haven't been able to get Numpy to work successfully in IronPython. I have to use normal CPython. If you happen to get that figured out please let me know! Having Numpy within .NET would be really nice.

share|improve this answer
Thanks, Svaha. I'll be sure to let you know. – user1106278 Feb 11 '12 at 21:15

Your Answer

 
or
required, but never shown
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.