@DietrichEpp disagreed completely.
If you want a machine tolearnthen you have to understand the difference between data and knowledge. Stats classes don’t normally cover this.
So there are at least two questions here. Firstly, how much do you really have to understand in order to build a machine. As I see it, getting a machine do something (including learning) counts as engineering rather than science. Engineering requires two kinds of knowledge - practical knowledge (how to reliably, efficiently and safely produce a given outcome) and socio-ethical knowledge (whom shall the technology serve). Engineers are generally not expected to fully understand the scientific principles that underpin all the components, tools and design heuristics that they use, but they have a professional and ethical responsibility to have some awareness of the limitations of these tools and the potential consequences of their work.
In his book on Design Thinking, Peter Rowe links the concept of design heuristic to Gadamer's concept of enabling prejudice. Engineers would not be able to function without taking some things for granted.
So the second question is - which things can/should an engineer trust. Most computer engineers will be familiar with the phrase Garbage In Garbage Out, and this surely entails a professional scepticism about the quality of any input dataset. Meanwhile, statisticians are trained to recognize a variety of potential causes of bias. (Some of these are listed in the Wikipedia entry on statistical bias.) Most of the statistics courses I looked at on Coursera included material on inference.
Looking for relevant material to support my position, I found some good comments by Ariel Guersenzvaig, reported by Derek du Preez.
Unbiased data is an oxymoron. Data is biased from the start. You have to choose categories in order to collect the data. Sometimes even if you don’t choose the categories, they are there ad hoc. Linguistics, sociologists and historians of technology can teach us that categories reveal a lot about the mind, about how people think about stuff, about society.
And arriving too late for this Twitter discussion, two more stories of dataset bias were published in the last few days. Firstly, following an investigation by Vinay Prabhu and Abeba Birhane, MIT has withdrawn a very large image dataset, which has been widely used for machine learning, and asked researchers and developers to delete it. And secondly, FiveThirtyEight has published an excellent essay by Mimi Ọnụọha on the disconnect between data collection and meaningful change, arguing that it is impossible to collect enough data to convince people of structural racism.
So there are indeed some critical questions about data and knowledge that affect the practice of machine learning, and some critical insights from artists and sociologists. As for philosophy, famous philosophers from Plato to Wittgenstein have spent 2500 years exploring a broad range of abstract ideas about the relationship between data and knowledge, so you can probably find a plausible argument to support any position you wish to adopt. So this is hardly going to provide any consistent guidance for machine learning.
Mimi Ọnụọha, When Proof Is Not Enough (FiveThirtyEight, 1 July 2020)
Vinay Uday Prabhu and Abeba Birhane, Large Image Datasets: A pyrrhic win for computervision?(Preprint, 1 July 2020)
Derek du Preez, AI and ethics - ‘Unbiased data is an oxymoron’ (Diginomica, 31 October 2019)
Katyanna Quach, MIT apologizes, permanently pulls offline huge dataset that taught AI systems to use racist, misogynistic slurs Top uni takes action after El Reg highlights concerns by academics (The Register, 1 July 2020)
Peter Rowe, Design Thinking (MIT Press 1987)
Stanford Encyclopedia of Philosophy: Gadamer and the Positivity of Prejudice
Wikipedia: Algorithmic bias, All models are wrong, Bias (statistics), Garbage in garbage out
Further points and links in the following posts: Faithful Representation (August 2008), From Sedimented Principles to Enabling Prejudices (March 2013), Whom does the technology serve? (May 2019), Algorithms and Auditability (July 2019), Algorithms and Governmentality (July 2019), Naive Epistemology (July 2020)