Data scientists spend a significant amount of time writing code seeking answers to below questions most of the time.
- What does the data look like? What’s the schema?
- What’s the quality of the data? What’s the severity of missing data?
- How are individual variables distributed? Do I need to do variable transformation?
- How relevant is the data is to the machine learning task? How difficult is the machine learning task itself?
- Which variables are most relevant to the machine learning target?
- Is there any specific clustering pattern in the data?
- How will ML models on the data perform? Which variables are significant in the models?
Much of the code can be generalized into data science utilities that can be reused across projects helping data scientists work on specific tasks in a project in a guided mode, ensuring consistency and completeness of the underlying tasks. To help data scientists, Microsoft is releasing two data science utilities,
- Interactive Data Exploration, Analysis and Reporting (IDEAR), and
- Automated Modeling and Reporting (AMAR).
These two utilities, which run in CRAN-R, can be accessed from this GitHub site.
Read more about these utilities here.