HW4rec v1.3

'''
With this program, we want to help a Dutch archeologist. She has recently found
  a collection of precious inscriptions in Ancient Greek and valuable texts in
  Italian. She wants to find passages that are in common between pairs of
  texts in different languages. She is fluent in Latin and English but
  not in Ancient Greek and Italian. However, she knows she can rely on our help!

To pursue her objective, the archeologist has retrieved two CSV files. In the
  first one, "lexicon_gr_en", some Ancient Greek words are translated into
  one or more English expressions (let them be single words or short clauses),
  whenever available.

  For instance:
    "ἀραρίσκω;join;fit together"
  is a line in the file indicating that "ἀραρίσκω" translates to "join" or
  "fit together". Another line,
    "ἀπορρήσσω;[unavailable]"
  suggests the absence of a reliable translation.

  In the second CSV file, "lexicon_en_it", every English expression is
  translated into an Italian one: "join" translates to "unirsi" and "fit
  together" translates to "aderire". The correspondence between English and
  Italian expressions is one-to-one. Also, all English expressions in
  "lexicon_gr_en" also occur in "lexicon_en_it", except those marked as
  "[unavailable]".

  In both CSV files, expressions are separated by a semi-colon.

Notice that the Ancient Greek inscriptions are written in a rather particular
  way. The flow of the text is boustrophedon, that is, alternating
  lines of writing are flipped: first left-to-right, then right-to-left,
  then left-to-right again, and so on. The good news is, the glyphs of the
  characters are not mirrored. Furthermore, paragraphs are separated by multiple
  line-feeds (two or more). Single line-feeds are kept only to wrap lines.
  The end of the file also denotes the end of the last paragraph.
  For simplicity, (1) all letters are reported in lower case and (2) the
  punctuation symbols used are only line-feeds and the following:
    '.' (full stop) ',' (comma) ':' (colon) ' ' (white space) "'" (apostrophes)

  For example, a paragraph like:
      ἄνδρα μοι ἔννεπε, μοῦσα, πολύτροπον, ὃς μάλα πολλὰ
    πλάγχθη, ἐπεὶ Τροίης ἱερὸν πτολίεθρον ἔπερσεν:
    πολλῶν δ' ἀνθρώπων ἴδεν ἄστεα καὶ νόον ἔγνω,
    πολλὰ δ' ὅ γ' ἐν πόντῳ πάθεν ἄλγεα ὃν κατὰ θυμόν,

  reads as follows (see the "odyssey.txt" file):

    ἄνδρα μοι ἔννεπε, μοῦσα, πολύτροπον, ὃς μάλα πολλὰ
    :νεσρεπἔ νορθείλοτπ νὸρεἱ ςηίορτ ὶεπἐ ,ηθχγάλπ
    πολλῶν δ' ἀνθρώπων ἴδεν ἄστεα καὶ νόον ἔγνω,
    ,νόμυθ ὰτακ νὃ αεγλἄ νεθάπ ῳτνόπ νἐ 'γ ὅ 'δ ὰλλοπ


The archeologist wants to find out sequences of at least k > 0 words in
  the Ancient Greek text such that (1) the Ancient Greek words are in a
  single paragraph and (2) they correspond to sequences of at least k words
  in a paragraph of the Italian text, based on the given CSV files and
  ignoring punctuation marks. Notice that the Italian text follows the only
  left-to-right flow and, for convenience, all letters are lowercase.
  Paragraphs in the Italian text are also separated by two or more line-feeds.

Design a function

    ex1(k, lexicon_gr_en_f, lexicon_en_it_f, greek_txt_f, italian_txt_f)

  that, given:
  - k: the minimum number of consecutive Ancient Greek words to be found
      in paragraphs of "greek_txt_f" whose translation in English corresponds
      to sequences of words in paragraphs of "italian_txt_f" (with k > 0)
  - lexicon_gr_en_f: the path to the lexicon text file translating Ancient Greek
      into English, as described above
  - lexicon_en_it_f: the path to the lexicon text file translating English into
      Italian, as described above
  - greek_txt_f: the path to the text file with an inscription in Ancient
      Greek, written according to the rules described above
  - italian_txt_f: the path to the text file with a text in Italian
  returns:
  - a set of pairs of tuples; the first tuple refers to the Ancient Greek text;
    the second tuple refers to the corresponding excerpt in the Italian one;
    each tuple indicates:
    1) the excerpt of the text containing the sequence of words whose
       translation in English match with the translation from the other language
       (having line-feeds replaced by white spaces, written only from left to
       right),
    2) the paragraph number (starting from 1) where the excerpt lies.

For example,
  ex1(2, "lexicon-GR-EN.csv", "lexicon-EN-IT.csv", "odyssey.txt", "proemio.txt")
  should return
  {(("ἔννεπε, μοῦσα", 1),
    ("dissi io, o musa", 1)),
   (("τῶν ἁμόθεν γε, θεά, θύγατερ διός", 2),
    ("di ciò, da qualunque principio, ad ogni costo, dea figlia di zeus", 3))
  }

  Notice that, in "lexicon_GR_EN.csv", the following lines occur (among others):
    ἔννεπε;said i
    μοῦσα;o muse
    τῶν;of these things
    ἁμόθεν;beginning at any stage
    γε;indeed;at least;at any rate
    θεά;goddess
    θύγατερ;daughter
    διός;of zeus
  in "lexicon_EN_IT.csv", we have:
    said i;dissi io
    o muse;o musa
    of these things;di ciò
    beginning at any stage;da qualunque principio
    at any rate;ad ogni costo
    goddess;dea
    daughter;figlia
    of zeus;di zeus
  the first paragraph of "odyssey.txt" is reported above, whereas the second
  one ends as follows:
    ἤσθιον: αὐτὰρ ὁ τοῖσιν ἀφείλετο νόστιμον ἦμαρ.
    ,εγ νεθόμἁ νῶτ
    θεά θύγατερ,
    .νῖμἡ ὶακ ὲπἰε ,ςόιδ
  the first paragraph of "proemio.txt" reads as follows:
    di donarmi il diluvio ti dissi
    io, o musa, scorgendo il destino.
  and the third paragraph of "proemio.txt" reads as follows:
    imperterrita irrefrenabile poiché
    memore di ciò, da qualunque principio,
    ad ogni costo, dea figlia di zeus,
    narrane cagione e spirito.

  Concluding remark: if two or more sequences as described above occur in a
  paragraph, they should all appear in the result. We are, however, not
  interested in inner subsequences. In the example above, for instance,
    (("θεά, θύγατερ διός", 2), ("dea figlia di zeus", 3))
  is not included in the solution.


NOTE: the timeout for this exercise is of 3 seconds for each test.

WARNING: Make sure that the uploaded file is UTF8-encoded
    (to that end, we recommend you edit the file with Spyder).
    No other files can be opened nor libriaries be included.
'''


"""The idea suddenly came to me that I could try solving this problem by
creating a class Paragraph. The more logical part of me then asked
'Why would you hurt yourself like that?'
And as I'm not a masochist, I decided that he was right, and that I wouldn't
hurt myself anymore than I have to."""

#Used inside the gr_paragraph_divider function
def gr_to_it(tr_paragraph, sentence, para, gr_to_eng, eng_to_it):
    '''From a greek sentence, it forms tuples of (greek_word, it_translations)
    and adds it to the tr_paragraph'''

    for elem in sentence:
        translated_word = (elem,)

        #if the word is in the lexicon and its definition is available
        case1 = elem in gr_to_eng.keys() and gr_to_eng[elem] != ['[unavailable]']
        #same but refers to words ending with punctuation
        case2 = elem[-1] in ".,:'" and elem[:-1] in gr_to_eng.keys() and gr_to_eng[elem[:-1]] != ['[unavailable]']

        if case1 or case2:

            if elem[-1] in ".,:'":
                #first we add the greek word to the untraslated text
                para = (para + ' ' + elem[:-1]).lstrip()
                #then we add it to the translated text with all translations
                for eng_trans in gr_to_eng[elem[:-1]]:
                    translated_word += (str(eng_to_it[eng_trans]).strip("[']")+elem[-1],)
            else:
                para = (para + ' ' + elem).lstrip()
                for eng_trans in gr_to_eng[elem]:
                    translated_word += (str(eng_to_it[eng_trans]).strip("[']"),)

            tr_paragraph.append(translated_word)
    return tr_paragraph, para


def gr_paragraph_divider(greek_txt_f, gr_to_eng, eng_to_it):
    """this function translates the text from greek to english to italian
    and divides it into paragraphs"""


    translated_txt = [] #each elem will be a tr_paragraph
    counter = 0 #to see if sentence has to be reversed
    tr_paragraph = [] #translated paragraph
    tr_paragraph_num = 1

    #used in identification function, to see if expression is in another paragraph
    gr_text = [] #the greek text without translations, divided in paragraphs
    para = '' #each paragraph will be a string

    with open(str(greek_txt_f), encoding='utf8') as txt:
        for line in txt:

            #division into tr_paragraphs
            if line == '\n':
                if tr_paragraph != []: #there will undoubtedly be empty tr_paragraphs
                    tr_paragraph.append(tr_paragraph_num)
                    translated_txt.append(tr_paragraph)
                    tr_paragraph = []
                    tr_paragraph_num += 1

                    gr_text.append(para)
                    para = ''

            elif counter % 2 == 0:
                sentence = [word for word in line.split()] #gives you each word WITH punctuation
                tr_paragraph, para = gr_to_it(tr_paragraph, sentence, para, gr_to_eng, eng_to_it)
                counter += 1


            else:
                sentence = [word for word in line[::-1].split()]
                tr_paragraph, para = gr_to_it(tr_paragraph, sentence, para, gr_to_eng, eng_to_it)
                counter += 1


        tr_paragraph.append(tr_paragraph_num)
        translated_txt.append(tr_paragraph) #adds the last tr_paragraph to the set

        gr_text.append(para)


    txt.close()

    return translated_txt, gr_text


def it_paragraph_divider(italian_txt_f):
    '''Divides the italian text into paragraphs. That's all  there is to it.
    It used to do more, but more turned out to be a royal pain in the ass to
    make. So now it doesn't. It does just as advertised: divides into
    paragraphs.'''

    it_text = []
    paragraph = ''

    with open(str(italian_txt_f), encoding='utf8') as txt:
        for line in txt:

            if line == '\n':
                if paragraph != '':
                    it_text.append(paragraph)
                    paragraph = ''

            else:
                paragraph = (paragraph + ' ' + line).strip(' \n')

        it_text.append(paragraph)
    return it_text


def identifier(k, gr_text, it_text):
    """Identifies the passages in common"""


    it_expression = ''
    gr_expression = ''

    """

    Step 1: We build an expression of k words of the greek paragraph, and
    check if it is in another paragraph, in which case we abort mission
    Step 2: We pick the italian translations, and check if it is in the
    italian text. If it isn't, abort
    Step 3: If it is, see if the translation of the next word is in too, and
    in that case compile an expression. Otherwise, this keep only previous
    expression.

    Good, now on to the substeps

        """


    pass


def ex1(k, lexicon_gr_en_f, lexicon_en_it_f, greek_txt_f, italian_txt_f):

    #let's dictionarize the lexicons
    gr_to_eng = {}
    eng_to_it = {}
    l = []

    #dictionary for greek to english
    with open(lexicon_gr_en_f, encoding='utf8') as f:
        for line in f:
             l = line.strip('\n').split(';')
             gr_to_eng[l[0]] = l[1:]
        f.close()
    #dictionary for english to italian
    with open(lexicon_en_it_f, encoding='utf8') as f:
        for line in f:
             l = line.strip('\n').split(';')
             eng_to_it[l[0]] = l[-1]
        f.close()

    #translates the greek text and divides it into paragraphs
    tr_gr_text, gr_text = gr_paragraph_divider(greek_txt_f, gr_to_eng, eng_to_it)

    #opens the italian file and divides it into paragraphs
    it_text = it_paragraph_divider(italian_txt_f)

    """and now for the showstopper, the king of the ring, the star of the movie:
    the identifier!"""


    return it_text


#ex1('k', 'lexicon-GR-EN_.csv', 'lexicon-EN-IT_.csv', 'text_GR--4_4-4.txt', 'text_IT--4_4-4.txt')


ex1(2, "lexicon-GR-EN.csv", "lexicon-EN-IT.csv", "odyssey.txt", "proemio.txt")

#expression_finder(['favorably', 'going,', 'lost', 'in'], ('favorably going', 'lost in'), set())