Summary: A Crash Course in Data Science

This is an educational sharing. I took down the notes during this course so I can absorb the knowledge better and share it with my readers.

I have taken this course through Coursera, an education-technology focused company that offers numerous online courses. They offer many courses and certifications from top universities. My Company has partnered with Coursera and there are plenty of courses that I can learn and to me, that is totally awesome because I can learn new subjects and learn as much as I can. I really love learning, so this entry is for those who love to learn as well. ;)

This is a summary of what I have learned (it is not in detail, but more simple points and notes for basic understanding).

If you are interested in the course, check out https://www.coursera.org/ to learn as many courses as you can from foundation courses to advanced courses. I completed this course for 2 days for a one-week course (depends on how you manage your time and how fast you learn) and I earned my certification from John Hopkins University through Coursera. You can try it too. :D

I will share more educational entry more to come. Stay tuned and hope you learned something out of my data science summary! :)

Sofia's Learning Summary: A Crash Course in Data Science

Credit to Coursera: John Hopkins University

📘 Data science is only useful when the data are used to answer a question
📘 Data science is not data, it is science
📘 Data science is also known as knowledge discovery and data mining (KDD)
📘 Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience

The best machine learning method

1) Interpretable

2) Simple

3) Accurate

4) Fast (to train and test)

5) Scalable

In short, data science is answering specific questions with data

What are statistics good for?

Statistics is the practice of science of collecting and analysing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.

1) Descriptive analysis - brain MRI scan cognitive example

2) Statistical inference - the process of making conclusions about populations from a sample

3) Prediction - stock market

4) Statistical design - clinical trial, AB design

Machine Learning - is a set of algorithms that can take a set of inputs (data) and return a prediction

2 activities of machine learning:

1) Supervised learning - using a collection of predictors and some observed outcomes to build an algorithm to predict the outcome when it is not observed; random forests, boosting, SVMs (support vector machines). E.g: the development of regression

2) Unsupervised learning - trying to uncover unobserved. factors; clustering, mixture models, principal components. E.g: the computation of the g-factor (psychometrics)

📘 Evaluates results via prediction performance
📘 Concern for overfitting but not model complexity per se
📘 Emphasis on performance overpopulation modeling
📘 Generalizability is obtained through performance on novel datasets
📘 Concern over performance and robustness

Traditional statistical analysis

📘 Emphasizes superpopulation inference
📘 Focuses on a-priori hypotheses
📘 Simpler models preferred over complex ones
📘 Emphasis on parameter interpretability
📘 Emphasis on modeling or sampling assumptions
📘 Concern over performance and robustness

1) Both approaches are valuable

2) Amount of tolerable model/algorithm complexity changes dramatically

3) Goals of the approaches are different

Further reading:

1) Rise of the Machines by Larry Wasserman

2) Statistical modeling: The Two Cultures by Leo Breiman

3) Classifier Technology and the Illusion of Progress by David J. Hand

What is software engineering or data science?

Types of Software

📘 Just some code
📘 That you wrote code at all is the first step
📘 Encapsulate automation with a loop or similar
📘 Some sort of function
📘 First level of abstraction; defined 'interface'
📘 Software package
📘 API+convenience for user (documentation)

Rule of Thumb

📘 Do it once: write some code and document it well
📘 Do it twice: Write a function (or equivalent)
📘 Do it three times: Write a package with docs

Structure of a data science project

📘 Question (between the 6 types of questions)
📘 EDA (exploratory data analysis) - is the data suitable for the question? - sketch the solution
📘 Formal Modeling
📘 Interpretation
📘 Communication
📘 Decision

The output of a data science experiment

📘Reports: clearly written, narrative, concise conclusions, omit the unnecessary, reproducible
📘Presentations: clearly presented, narrative, concise conclusions, omit the unnecessary, reproducible
📘Web pages and apps: easy to use, documentation, code commented, version control

Defining Success

1) New knowledge is created

2) Decisions or policies are made based on the outcome of the experiment

3) A report, presentation or app with impact is created

4) It is learned that the data can't answer the question being asked of it

Data Science Toolbox

Collection of tools that are used to store, process, analyze and communicate results of data science experiments.

📘 R programming
📘 Phyton
📘 MongoDB
📘 Hadoop
📘 Spark
📘 Stack Overflow

Separating Hype from Value

📘 What is the question you are trying to answer with data?
📘 Do you have the data to answer that question?
📘 If you could answer the question, could you use the answer?

Do share with me if you know more about Data Science. Would love to learn more :D