Techniques for Automated Data Cleaning and Analysis
Processing and Understanding Report Files in Directories.
Written by Sai Gattupalli
Data integrity and cleanliness cannot be overstated. Every dataset, especially educational datasets with live student performance reports, contains noise, discrepancies, or formatting issues. Addressing these problems manually can be a time-consuming and error-prone process. Here, I discuss simple techniques and methods to streamline the process of cleaning, formatting, and understanding data without the need for intensive manual intervention.
The main steps include:
Organizing the directory, encoding and parsing, then perform cleaning and save the processed data to the predefined directory. For demonstration purposes, I supplied a zip of an existing dataset, which contains adult learner assessment records from the Massachusetts Adult Proficiency Tests – College and Career Readiness (MAPT-CCR), which is being spearheaded and maintained by UMass Amherst EDUC Distinguished Professor Stephen Sireci and Research Professor April Zenisky from the department of Research, Educational Measurement, & Psychometrics (EPRA).
In simpler terms, the records are pivotal in understanding the proficiency and readiness of adult learners for college and career opportunities, as they can sometimes contain inconsistencies due to the vast number of entries and the diverse sources from which they are collated. Utilizing my technique could ensure that the records are accurate, consistent, and ready for in-depth analysis.
First, we will clean the database labels.
import os
import pandas as pd
def clean_report_files(directory):
# Create a new "Cleaned" folder inside the specified directory
cleaned_directory = os.path.join(directory, "Cleaned")
if not os.path.exists(cleaned_directory):
os.makedirs(cleaned_directory)
# Listing all files in the dir
files = [f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f)) and "Report" in f and f.endswith('.tsv')]
# Process each "Report" file
for file in files:
file_path = os.path.join(directory, file)
# Attempting to read
try:
df = pd.read_csv(file_path, sep="\t", encoding="utf-8", skiprows=5)
except Exception as utf8_err:
# Another attempt, using another encoding. If the one above fails, this should do the trick
try:
df = pd.read_csv(file_path, sep="\t", encoding="ISO-8859-1", skiprows=5)
except Exception as iso_err:
print(f"Error parsing file: {file}. UTF-8 error: {utf8_err}. ISO-8859-1 error: {iso_err}. Skipping...")
continue
# Save the cleaned data to a new folder
cleaned_file_path = os.path.join(cleaned_directory, file)
df.to_csv(cleaned_file_path, sep="\t", index=False)
directory = "location"
clean_report_files(directory)
Stage 2 is validation, where I show file meta for manual verification. Upon real human validation of file structure, the next steps can be taken for further processing.
def list_files_and_sizes(directory):
# To ensure the location exists
if not os.path.exists(directory):
print(f"The directory {directory} does not exist.")
return
# Listing all files - for manual validation
files = [f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f))]
# verify the counting vs human counting
print(f"Total number of files: {len(files)}\n")
# Print file names and their sizes
for file in files:
file_path = os.path.join(directory, file)
size = os.path.getsize(file_path)
print(f"File: {file}\tSize: {size} bytes")
directory = "path/Cleaned"
list_files_and_sizes(directory)
Until next time.