Using Data Mining to glimpse on our brain workings

Roberto Saracco December 28, 2018 Blog 459 Views

Frequency distribution of words in different languages based on 10 million words from 30 Wikipedia. They are pretty similar. Image credit: Wikipedia

Back in 1935 George Zipf, an American linguist, made an interesting observation: if we rank the words of the US English language we would discover that the word appearing most frequently (the) is followed by a set of words that show approximately half that frequency (be, and, a) and in turns these are followed by a set of words having 1/4 of the frequency and so on. Of course this is a coarse approximation but still it is sort of puzzling.

This kind of distribution is know as the Zipf’s law.

More recently, a study at the Communication University of China in Beijing using data mining techniques explored 50 languages of very different roots and surprisingly they show the same distribution. Hence, it should be something more profound than the structure of a language (different languages have different structures) and the only common factor in those languages is that they are “produced” by brains. This seems to point out that brains share a common approach to processing (and uttering) languages resulting in this kind of distribution.

Indeed, recent observation at how the brain processes language (which is overlapping to how brain thinks!) shows that there are two processing approaches to languages (and thinking): a fast one and a more slowly paced one. The fast one is the one giving rise to the general structure of sentences, a sort of scaffolding composed by those frequently used words, the other is the one that fills in the gaps with more thoughtful words.

This also applies to the way we understand sentences. Take for example the question:

“a bat and a ball cost 1$10c. Knowing that the bat costs 1$ more than the ball, how much does the ball cost?”

Our fast processing brain comes up with the answer 10c, why, it is obvious 10c plus 1$ makes 1.10$. Unfortunately, this is wrong. In this proposed solution the bat would cost 90c more than the ball. Our slowly paced brain will step in and provide the correct answer: 5c. Indeed, if the ball cost 5c, the bat -costing 1$ more- would cost 1.05$ hence all together they cost 1.10$.

I guess you can remember plenty of situation where the answer to a question just popped up immediately, but it was the wrong one. This is not because we are shallow, we don’t think enough. It is because of evolution that took care of developing a fast brain, able to react immediately to dangerous situations being right most of the time. Humans, differently from most other species, also developed a slow thinking brain that can create abstraction and do some sort of simulation on those abstractions coming up with a more appropriate solution.

It is interesting to see how data mining can cast some light on our brain inner workings. It shouldn’t come as a surprise though. Our language, as our actions, is the result of brain processing activity and by analysing those results, through reverse process engineering, we can shed more light on those processes.