Using Data Mining to glimpse on our brain workings

Frequency distribution of words in different languages based on 10 million words from 30 Wikipedia. They are pretty similar. Image credit: Wikipedia

Back in 1935 George Zipf, an American linguist, made an interesting observation: if we rank the words of the US English language we would discover that the word appearing most frequently (the) is followed by a set of words that show approximately half that frequency (be, and, a) and in turns these are followed by a set of words having 1/4 of the frequency and so on.  Of course this is a coarse approximation but still it is sort of puzzling.

This kind of distribution is know as the Zipf’s law.

More recently, a study at the Communication University of China in Beijing using data mining techniques explored 50 languages of very different roots and surprisingly they show the same distribution. Hence, it should be something more profound than the structure of a language (different languages have different structures) and the only common factor in those languages is that they are “produced” by brains. This seems to point out that brains share a common approach to processing (and uttering) languages resulting in this kind of distribution.

Indeed, recent observation at how the brain processes language (which is overlapping to how brain thinks!) shows that there are two processing approaches to languages (and thinking): a fast one and a more slowly paced one. The fast one is the one giving rise to the general structure of sentences, a sort of scaffolding composed by those frequently used words, the other is the one that fills in the gaps with more thoughtful words.

This also applies to the way we understand sentences. Take for example the question:

“a bat and a ball cost 1$10c. Knowing that the bat costs 1$ more than the ball, how much does the ball cost?”

Our fast processing brain comes up with the answer 10c, why, it is obvious 10c plus 1$ makes 1.10$. Unfortunately, this is wrong. In this proposed solution the bat would cost 90c more than the ball. Our slowly paced brain will step in and provide the correct answer: 5c. Indeed, if the ball cost 5c, the bat -costing 1$ more- would cost 1.05$ hence all together they cost 1.10$.

I guess you can remember plenty of situation where the answer to a question just popped up immediately, but it was the wrong one. This is not because we are shallow, we don’t think enough. It is because of evolution that took care of developing a fast brain, able to react immediately to dangerous situations being right most of the time. Humans, differently from most other species, also developed a slow thinking brain that can create abstraction and do some sort of simulation on those abstractions coming up with a more appropriate solution.

It is interesting to see how data mining can cast some light on our brain inner workings. It shouldn’t come as a surprise though. Our language, as our actions, is the result of brain processing activity and by analysing those results, through reverse process engineering, we can shed more light on those processes.

About Roberto Saracco

Roberto Saracco fell in love with technology and its implications long time ago. His background is in math and computer science. Until April 2017 he led the EIT Digital Italian Node and then was head of the Industrial Doctoral School of EIT Digital up to September 2018. Previously, up to December 2011 he was the Director of the Telecom Italia Future Centre in Venice, looking at the interplay of technology evolution, economics and society. At the turn of the century he led a World Bank-Infodev project to stimulate entrepreneurship in Latin America. He is a senior member of IEEE where he leads the New Initiative Committee and co-chairs the Digital Reality Initiative. He is a member of the IEEE in 2050 Ad Hoc Committee. He teaches a Master course on Technology Forecasting and Market impact at the University of Trento. He has published over 100 papers in journals and magazines and 14 books.