As part of my work at a Summer job, I had to enter data into Google forms that filled in a sheet. From time to time, the team needed to get an idea of how far we have come. And each time, I would reply with a number of rows filled thus far.
But, I felt this was not enough for my boss to get an idea of the progress. Further, I felt a need for a spatial overview of the sheet instead of scrolling many rows down and across each time to understand the progress and the gaps. There were different rules for the accuracy and completion of data entered based on the field. This quality and accuracy perspective is not coming across either.
Every data point we were collecting was crucial to the project's progress. The nature of the data portal we were building would display each record and its description as a separate entity. Hence each record in itself can have many metrics.Hence, the question "How far have we come"
translates to multiple details that would then help in adjusting our data collection operations. This information was missing when I say we have X entries so far. A few questions that need to be answered are:
1.How many entries have we covered?
2.How many gaps do we have within each entry?
3.How many gaps do we have overall?
4.Where are these gaps highest?
5.Where are the gaps that need immediate attention?
6.How sure are we of the accuracy of each data point?
7.How many of these are below tolerance level?
The following factors were crucial for data cleaning.
1. Have as many records as possible
2. Complete information in each record
3. Accuracy of each field. If not a 100% sure of the answer, leave blank in some fields.
4. Tolerate gray area in certain other fields.
5.Are there empty records?
6.Is there gibberish ?
7.Are there Testing entries?
Initially I made a heatmap for the spreadsheet to show the fields that were empty and the ones that had any kind of information.
I will be expanding this to an interactive visualization that can answer some or all of the above questions.
Comments
Post a Comment