This book examines the role of data in modern machine learning, with emphasis on the preparation and analysis of publicly available benchmark datasets. It presents a systematic treatment of data acquisition, inspection, preprocessing, and evaluation within supervised learning frameworks. Key topics include statistical characterization of datasets, data normalization, categorical encoding, handling class imbalance, feature selection and extraction, dimensionality reduction, and model evaluation metrics. Computational examples are provided using Python, with implementations based on scikit-learn and PyTorch. The text is intended for students and practitioners seeking a rigorous and practical foundation in data management for machine learning, with applications to both classification and regression problems.
Machine Learning : Introducing Available Datasets with Python