Improve your English: invent anew subtitles

1. Intro

- Tatiana L., and it is possible, we will see this movie with subtitles
? - No, juvenile woodpeckers, we train your auditory perception, so the movies you watch without them! With subtitles, you'll just read the text and do not listen.
- Tatiana L., but without subtitles, we more than half do not understand
! - But this is your problem
beginning of the 2000s, a dialogue with the teacher in the French special school, St. Petersburg

2... What's the matter?
TV Shows and movies - a great thing to improve your English. You already know the grammar, possess a large set of words. Maintain a free conversation with native speakers is still early, and practice tests and exercises - is boring. You start to watch movies and TV shows.

You look at yourself and see. It seems everything is clear, understandable, but then begins a rapid dialogue between the two characters, from which you know only excuses. Okay, turn the subs. And they solve the problem - you begin to understand what is happening.

However, seeing a few videos with sabami, people often notice two things.
  • Instead of listening training you become a master of speed reading subs in a foreign language. Now you quickly understand the phrase, just looking at her, but auditory perception progresses slightly. Disable subs, you stop again to understand what is happening in some scenes on the screen. School teacher Tatiana L. was right, forbidding us to watch French movies sabami - "juvenile woodpeckers" and the truth is not progressed in listening and linguistic thinking
  • Some areas of film remain completely. incomprehensible due to the fact that they contain difficult words. «I can not jeopardize my company's success»? I'm sorry, what? Jeopardize? B>. Okay, Google, I put the movie on pause, and you tell me what it means.
    There are guys that offer to watch movies with subtitles in two languages ​​- English and Russian. What makes you quickly have an absolute champion in speed reading subs in two languages, but contributes little to the auditory perception and the development of linguistic thought.

     No subs is not clear anything, and with sabami hampered progress in auditory perception and ... still is not clear.

    3. Now what?

    In this screenshot from "South Park" seen 7 words. 6 of them are familiar to almost all students of English. And they may well be to know and understand, even if they are spoken quickly and with emphasis. There remains a word, which (with high probability) be a problem. The word weary -. Tired, weary

  • This word is not so common. Chances are good that you do not recognize it by ear.
It would be right on the screen to show the translation. Otherwise, either have to be distracted and translate with a dictionary, or simply to score and look no further.
 And the rest of the word can be thrown out. They know almost everything and did not need to show on the screen. If we apply this logic to the rest of the scene, we get Saba, in which there are only difficult words, and the rest we have to listen and understand.

As it turned out, this idea is not new. A quick gugling showed that at least some bloggers have written articles with the same idea, but offered to do the adaptation manually subtitle. And we, the geeks, we will do an automatic adaptation of the software subs!

4. Building bike
The problem is reduced to finding difficult words in the text which need to be translated.

The basic idea is that you can analyze ooooochen many texts in English, calculate statistics on the use of words and understand that some words are used much less frequently than others. These few words, and fall under the concept of "compound word" - they are rare, so you do not know the translation and writing.

I have worked all this as a hobby after work (by the way, here's an article about how it all started). All this resulted in Bamboo Ninja project, which allows us to analyze the book in English, find difficult words in them, insert transfer and collect the book back. Subtitles - this is also the text, so I'll take the developments there and apply them to the subtitles

. We open subs, split them into pieces, then into individual words and start the analysis. For each word, we need to solve the problem of binary classification - skip the word through an algorithm that returns the output of 1 or 0 - is whether the word simple to learn English or complex. His decision classifier makes on the basis of statistical data obtained from the analysis of ~ 40 GB of textual data from various sources (generally worth it to collect data actually at very different sources: gut chat logs, news, lyrics. But I was too lazy and used in the main text books but more on that later).

Then there is a certain amount of trouble with the database, write code and turn the subs that look like this

5. We go on a bicycle built
I drove through a program of 3-4 dozen subs, estimated values ​​of the metrics that are issued by the analyzer. I tried to watch movies that happened. He showed to friends, acquaintances and visitors.

To evaluate the results, I used two classical metrics for machine learning tasks:
Acc (precision) - the ability to correctly classify the word Completeness < / (recall) - the ability to find all the words need to be translated was found that the values ​​of the metrics tend to jump from film to film. In some films, the completeness and accuracy showed 85% -90% of the desired, but on the other - in the region of 55%. Digging into the problem, I found the cause - a large part of the data for statistical analysis I collected art books of the last 300 years, and some of the words in them are more common than found in modern English. For example, the word bayonet (bayonet) in those days, much more common than it is now, but our classifier that word says not so rare.

Although Colin, my friend from Britain, long laughed and said that the expression "my meat bayonet» (beef bayonet) is very common among the military, but this event will not be considered.

I decided to revert to the old version of the classifier, which I used a few months ago. It was built back in the summer with just 500 big books, but the books in that sample were more diverse, "Harry Potter", "Song of Ice and Fire," the technical documentation for programmers, books on psychology, medicine and more. Qualifier with a smaller but more diverse amount of data proved to be an order of magnitude better than the classifier built only on English literature. word recognition algorithm has become much less common mistakes.

This result is generally consistent with the goal, but the algorithm still produces Saba suitable for a person having considerable experience in the use of English. It is necessary to have a certain skill in speech recognition by ear and tangible vocabulary of several thousand basic words. In this case, the subs will stand in good stead to improve English.

All his experiments I designed to support and fastened to his hobby site, and added to the small subs library for those who want to test a piece that is not on the spot.

6. Outro
Turn preview series in the learning process instead of a blunt-screen reading seems worthwhile task. And the improvement of the algorithm allow for the benefit of many more nights.

Thanks to all! Good movies and success in English.



See also

New and interesting