What can (theoretical) linguistics do for NLP research?

Esta es la pregunta con la que se abría la mesa redonda en la que tuve el placer de participar junto con investigadores de la talla del Dr. Carlos Periñán de la Universidad Politècnica de València y fundador y director de FunGramKB, el Dr. Brian Nolan director del Departamento de Informática y Creative Digital Media en el instituto de Tecnología de Blanchardstown en Dublín y la Dra. Elke Diedichsen del Instituto Tecnológico de Blanchardstown y que fue directora de proyectos de habla en Google. El anfitrión de la sesión ha sido Gianluca Pontrandolfo de la Universidad de Trieste.

Aquí dejo la transcripción de mi intervención que, aunque ha generado un poco de controversia, parece que ha dado en la espinita que muchos tenemos clavada dentro. La falta de cooperación y colaboración entre linguistas e ingenieros.

Thank you Gianluca for your presentation and I would also like to thank the organisers and especially Carlos Periñan for inviting me to participate and share this round-table with such a high-level researchers. It is a pleasure.

In order to give you my two cents wrt. the open questions of this round-table, I would like to tell you the brief story of what we do in our company.

As Gianluca said, my name is Francisco Rangel and I am the Chief Technical Officer of Autoritas.

Autoritas is a growing company where experts in different areas – psychologists, sociologists, economists, computer scientist and eventually linguists – help organisations to include into their business, into their intelligence cycle, the social knowledge collected from millions and millions of social media conversations.

My role in the organisation is to provide to my colleges with the convenient tools for retrieving, processing and analysing these millions of conversations. And of course, to do it in real time.

As you can imagine, I have to deal with several different problems, but I like to organise them into three main components: big data, machine learning and natural language problems.

Big data because we have to analyse a huge amount of data, actually large amount of heterogeneous and unstructured data, and we have to do it in real time.

Machine learning because we want machines to work for us, and not the other way around, so we must teach the machine to do things automatically.

And natural language because conversations in social media are written in natural language, that is, our language, the human language.

Autoritas is working in UK, Spain and several countries in Latin America, including Brazil. Our target languages are, so far, Spanish, English, Portuguese, Italian, German and French. So, the problem of dealing with natural language is even multiplied.

Probably you know that most of the natural language processing techniques are based on statistics. At least, they are based on a statistical learning of a representational model.

But although statistics-based natural language techniques works quite well for many different tasks such as opinion mining, plagiarism detection or even author profiling, when we want to obtain a deeper knowledge about the meaning of what is happening, they lack of effectiveness.

Furthermore, there is another factor. Let me highlight a general problem of perception. I mean, there is a very different perception about the quality of a natural language processing approach depending on the point of view: I like to name it the eternal gap between academy and industry.

For example, a sentiment analysis process, or more specifically a polarity detection, with an 80\% of accuracy is a very good system in academy because it is on the top of the state-of-the-art. But usually it is perceive as a very poor quality system in the industry.

Why? Because when the customer goes into detail, looking for specific mentions and what kind of polarity the system assigned to them, he usually finds errors. Actually, a 20\% of error is high enough to find errors at first sight. And this is the problem, if the customer finds more than one error, or some consecutive errors, she/he distrust the tool, she/he argues a lack of credibility.

And this is the problem of statistical approaches, they work very well with big numbers – imagine analysing 1 million documents with 80\% of accuracy is the same that having 800.000 documents well categorised, something impossible to do manually – but when we go into details, they cannot explain certain results, certain behaviours.

I would like to give you a another real example. We analysed the visit of the Pope Francisco to Brazil. We analysed millions of documents and provided several different insights very valuables for our customer. But one of the most important things we had to do was to monitor the possibility of terrorists attacks.

For that, we monitored all kind of mentions to trigger words, for example, to the word bomb. The machine can do this task really well and really fast, and this information is very valuable per se. But without linguistic knowledge, without linguistic background, it is impossible to distinguish between the meaning of the sentence “it’s going to be a bomb” referring to the visit of the Pope to Brazil, and the meaning of the sentence “I’m going to put a bomb”.

So, the answer to the main question of this round-table is clear. YES, for sure, I need you, I need linguists in my daily work.

But then I would like to introduce another question. Why are not there more collaboration between computer scientists and linguists?

I guess this is due to a gap of knowledge. I know linguists who do not want to work with computers. Even less if I talk to them about machine learning, features, statistics, representational models and so on. But I know even more computer scientists who are not able to do a simple syntactical analysis. Even they are not able to identify linguistic differences between the two sentences I gave in my example.

By the way, actually there is a gap of knowledge.

But I believe that new degree and master programmes are covering this gap. And I strongly believe that this kind of conferences and round-tables are working in the good direction.

But I also think there is another problem, there is a lack of communication. Incredible in the information age, but true. I think computer scientists and linguists did not reach – at least yet- an agreement about what we need and what we want from each other.

And in order to finalize, let me ask you two questions.

How many of you are technicians? (there were 18 people in the room and only four – three of them in the round-table – answered positively).

Ok, even lower numbers to what I expected.

I guess then you are linguists, so, how many of you use to go to natural language processing, information retrieval or machine learning conferences? (no one answered positively except the four previous)

So, maybe there we have the answer.

Thank you very much.

Submit a Comment

Your email address will not be published. Required fields are marked *