Summary: A Crash Course in Data Science

Image result for data science

This is an educational sharing. I took down the notes during this course so I can absorb the knowledge better and share it with my readers. 
I have taken this course through Coursera, an education-technology focused company that offers numerous online courses. They offer many courses and certifications from top universities. My Company has partnered with Coursera and there are plenty of courses that I can learn and to me, that is totally awesome because I can learn new subjects and learn as much as I can. I really love learning, so this entry is for those who love to learn as well. ;)

This is a summary of what I have learned (it is not in detail, but more simple points and notes for basic understanding). 

If you are interested in the course, check out to learn as many courses as you can from foundation courses to advanced courses.  I completed this course for 2 days for a one-week course (depends on how you manage your time and how fast you learn) and I earned my certification from John Hopkins University through Coursera. You can try it too. :D 

I will share more educational entry more to come. Stay tuned and hope you learned something out of my data science summary! :)

Sofia's Learning Summary: A Crash Course in Data Science
Credit to Coursera: John Hopkins University

πŸ“˜ Data science is only useful when the data are used to answer a question
πŸ“˜ Data science is not data, it is science
πŸ“˜ Data science is also known as knowledge discovery and data mining (KDD)
πŸ“˜ Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience

The best machine learning method
1) Interpretable
2) Simple
3) Accurate
4) Fast (to train and test)
5) Scalable

In short, data science is answering specific questions with data

What are statistics good for?
Statistics is the practice of science of collecting and analysing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.

1) Descriptive analysis - brain MRI scan cognitive example
2) Statistical inference - the process of making conclusions about populations from a sample
3) Prediction - stock market
4) Statistical design - clinical trial, AB design

Machine Learning - is a set of algorithms that can take a set of inputs (data) and return a prediction

2 activities of machine learning:

1) Supervised learning - using a collection of predictors and some observed outcomes to build an algorithm to predict the outcome when it is not observed; random forests, boosting, SVMs (support vector machines). E.g: the development of regression

2) Unsupervised learning - trying to uncover unobserved. factors; clustering, mixture models, principal components. E.g: the computation of the g-factor (psychometrics)

πŸ“˜ Evaluates results via prediction performance
πŸ“˜ Concern for overfitting but not model complexity per se
πŸ“˜ Emphasis on performance overpopulation modeling
πŸ“˜ Generalizability is obtained through performance on novel datasets
πŸ“˜ Concern over performance and robustness

Traditional statistical analysis

πŸ“˜ Emphasizes superpopulation inference
πŸ“˜ Focuses on a-priori hypotheses
πŸ“˜ Simpler models preferred over complex ones
πŸ“˜ Emphasis on parameter interpretability
πŸ“˜ Emphasis on modeling or sampling assumptions
πŸ“˜ Concern over performance and robustness

1) Both approaches are valuable
2) Amount of tolerable model/algorithm complexity changes dramatically
3) Goals of the approaches are different

Further reading:
1) Rise of the Machines by Larry Wasserman
2) Statistical modeling: The Two Cultures by Leo Breiman
3) Classifier Technology and the Illusion of Progress by David J. Hand

What is software engineering or data science?

Types of Software

πŸ“˜ Just some code
πŸ“˜ That you wrote code at all is the first step
πŸ“˜ Encapsulate automation with a loop or similar
πŸ“˜ Some sort of function
πŸ“˜ First level of abstraction; defined 'interface'
πŸ“˜ Software package
πŸ“˜ API+convenience for user (documentation)

Rule of Thumb

πŸ“˜ Do it once: write some code and document it well
πŸ“˜ Do it twice: Write a function (or equivalent)
πŸ“˜ Do it three times: Write a package with docs

Structure of a data science project

πŸ“˜ Question (between the 6 types of questions)
πŸ“˜ EDA (exploratory data analysis) - is the data suitable for the question? - sketch the solution
πŸ“˜ Formal Modeling
πŸ“˜ Interpretation
πŸ“˜ Communication
πŸ“˜ Decision

The output of a data science experiment

πŸ“˜Reports: clearly written, narrative, concise conclusions, omit the unnecessary, reproducible
πŸ“˜Presentations: clearly presented, narrative, concise conclusions, omit the unnecessary, reproducible
πŸ“˜Web pages and apps: easy to use, documentation, code commented, version control

Defining Success
1) New knowledge is created
2) Decisions or policies are made based on the outcome of the experiment
3) A report, presentation or app with impact is created
4) It is learned that the data can't answer the question being asked of it

Data Science Toolbox
Collection of tools that are used to store, process, analyze and communicate results of data science experiments.

πŸ“˜ R programming
πŸ“˜ Phyton
πŸ“˜ MongoDB
πŸ“˜ Hadoop
πŸ“˜ Spark
πŸ“˜ Stack Overflow

Separating Hype from Value

πŸ“˜ What is the question you are trying to answer with data?
πŸ“˜ Do you have the data to answer that question?
πŸ“˜ If you could answer the question, could you use the answer?

Do share with me if you know more about Data Science. Would love to learn more :D

Turned-on Macbook Pro

Best regards,

Post a Comment