Identifier function

# -*- coding: utf-8 -*-
"""
Created on Mon Jan 25 16:54:35 2021

@author: Ntsoa
"""

def identifier(k, tr_gr_text, gr_text, it_text):
    """Identifies the passages in common"""

    raw_greek_expression = []
    it_expression = ''
    gr_expression = ''
    wordcount = 0


    #step 0.9 complete, now onto step 1
    for tr_gr_para in tr_gr_text:

        other_paragraphs = [x for x in gr_text] #first we make a deep copy
        other_paragraphs.pop(tr_gr_text.index(tr_gr_para))
        #now we build a superstring made of all other paragraphs
        o_p = ''
        for elem in other_paragraphs:
            o_p = (o_p + ' ' + elem).lstrip()

        #we have to pick up k words, starting from each greek word everytime
        while wordcount < len(tr_gr_para) - k:
            raw_greek_expression = tr_gr_para[wordcount : wordcount + k]
            #do not forget that we're looking for k+ words, not just k, so we'll
            #have to keep incrementing it IF the conditions are right

    """

    Step 1: We build an expression of k words of the greek paragraph, and
    check if it is in another paragraph, in which case we abort mission
    Step 2: We pick the italian translations, and check if it is in the
    italian text. If it isn't, abort
    Step 3: If it is, see if the translation of the next word is in too, and
    in that case compile an expression. Otherwise, this keep only previous
    expression.

    Good, now on to the substeps

                                                                            """

    '''
    Step 0.9: How do we check whether the expression is in another paragraph?
    Obviously we'll have to make use of the tr_gr_txt.
    We will have to go through each paragraph, do we make a superstring made up
    of only the other paragraphs? Is there enough time for that? Regardless,
    it does seem like the most efficient solution. The text made only of greek
    words (that can be translated) should be pre-built, so we don't have
    to build it at every run here.
    In other words, it will be an output of gr_para_div.
    Also, do disregard the punctuation when making that text.

    Step 1.9: We now have an expression of k words. We will have to build
    each possible translation of it, will we not? Actually, no, the moment we
    find a suitable translation we stop, right? Things are never that easy,
    are they? What a pain, well, we'll go through each possibility then.
    Do keep in mind that there actually won't be that many iterations in the
    end, since most of them will get refuted right away for not being present
    in the italian text.
    About that, it's possibly the most difficult part of this homework, so how
    do you plan on dealing with it?
    The difficult part being the part where we 1) find the parts that match,
    punctuations notwithstanding and 2) have to report those same parts, but
    with punctuation.
    Instinctively, I want to say that this will be easier if we have the
    italian text divided into sublists, as in each paragraph is a list,
    and each of these list are made up of words as elements.
    e.g.: [['io', 'vado'], ['ciao,', 'sono', 'andry.']]
    If we have that, things will be much easier.
    Say we have, from the translated greek text: 'atena, figlia di zeus.'
    if 'atena' in italian paragraph:
        with it_para.index('atena'), you search if the next elements match with
        'figlia di zeus', and if an element has a punctuation, we disregard it
        temporarily.
    Seems like a good deal, we'll go with it.
                                                                            '''