Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Kod pythona:
- from urllib.request import urlopen
- from bs4 import BeautifulSoup
- from nltk.tokenize import sent_tokenize
- text_pl = []
- text_en = []
- sentences_pl = []
- sentences_en = []
- response_pl = urlopen("http://www.staff.amu.edu.pl/~rjawor/index.php")
- response_en = urlopen("http://www.staff.amu.edu.pl/~rjawor/index_en.php")
- page_pl = BeautifulSoup(response_pl, 'html.parser')
- page_en = BeautifulSoup(response_en, 'html.parser')
- for s in page_pl.stripped_strings:
- text_pl.append(s)
- for s in text_pl:
- sentences_pl.append(sent_tokenize(s))
- for s in page_en.stripped_strings:
- text_en.append(s)
- for s in text_en:
- sentences_en.append(sent_tokenize(s))
- f_pl = open('sentences_pl.txt', 'w')
- for s in sentences_pl:
- f_pl.write(s[0]+"\n")
- f_pl.close()
- f_en = open('sentences_en.txt', 'w')
- for s in sentences_en:
- f_en.write(s[0]+"\n")
- f_en.close()
- hunalign:
- Polecenie wywoływane z folderu głównego hunalign po zrobieniu make'a i przerzuceniu plików ze zdaniami wygenerowanych przez Pythona do folderu examples:
- src/hunalign/hunalign data/english.dic examples/sentences_pl.txt examples/sentences_en.txt -text > ./align.txt
- Plik align.txt: https://pastebin.com/c5w9VA9H (Łącza do strony zewnętrznej.)Łącza do strony zewnętrznej.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement