Untitled

\chapter{Our Contribution}
In previous chapters, we have covered the basics of reading comprehension and question answering. Also, we have described two main models designed for this task – BiDAF \ref{bidaf} and BERT \ref{bert}. They were trained on the one of the most popular datasets called SQuAD \ref{squad}. Especially the second model has achieved very good results. Unfortunately, this dataset is in English language. We would like to be able to train such models also for QA in Czech language. In this chapter, we will describe, how previously described dataset and models can be reused for reaching this goal.

\section{Introduction}
At first, we have to describe basic tools and technologies, that we have used for reaching the goal of our work and describe dataset structure in detail.

\subsection{Dataset}
We have downloaded the dataset SQuAD 1.0 from here \cite{squadsource}. We have chosen SQuAD version 1.0, because there always exists the answer for the question in the text, so the answers are easier to be predicted. In version 2.0, answers can be unanswerable and it makes QA tasks even more challenging. More about SQuAD dataset can be found in \ref{chapter02}.

The structure of the dataset is following. There are two .json files. First one is \textit{train-v1.1.json} and it contains all data for training. It means there are context paragraphs with several questions and each question has one answer. Second one is \textit{dev-v1.1.json}, which is used for evaluation. The structure of this file is almost the same with one difference. It was annotated manually by several crowdworkers, so there can be several answers for one question. The most matching answer was always chosen to be compared with the predicted one to reach the highest accuracy. It also allows a little deviations in answering, which can be useful as the predicted answer is not always 100 \% same like the original one and still can be correct. The size of training dataset is 87,599 questions and development set is 10,570 questions.

The structure of both data files looks like this. There is a tag \textit{data} containing list of all articles. Inside this tag, there is always a title of the article in \textit{title} tag having a list of single paragraphs containing the context related to this title. They are called \textit{paragraphs} tags. Each paragraph has its own list with answers and questions in \textit{qas} tag, which furthermore consists of three tags. First one is \textit{question} tag, which contains the text of the question. Second one is \textit{id} tag, as each question has its own id for easier identification. Last one is \textit{answers} tag containing the text of the answer in \textit{text} tag, and also, starting index of the answer in the text represented in the \textit{answer\_start} tag.

Basically, the structure looks like this:

\lstset{language=C}
\begin{lstlisting}
{ Data[
	title
	paragraphs [{
		context
		qas [{
			answers[
				text
				answer_start
			}]
		question
		id
	}]
]
version }
\end{lstlisting}


\subsection{Translation of the data}
We have used several data translation of the SQuAD dataset to Czech and eventually back to English language. For that, we have used LINDAT Translator, which is the best translator between Czech and English, that is freely distributed and developed at Faculty of Mathematics and Physics at Charles University by the Institute of Formal and Applied Linguistics. More about this translator can be found here \cite{lindat}.

The translation of the dataset brings a noise into in. It means that in English dataset the answer in the text was the same as the text in the answer tag. Unfortunately, now the answer in the text and answer in answers tag may differ. We need to handle this problem. Moreover, the \textit{answer\_start} tag value must be recomputed as the order of the words may have changed after translation.

As mention above, when we have translated all of the paragraphs, answers and questions, the start index of each answer in the text had have to be recomputed as we need it during training. The problem is that we cannot use exact match because of several reasons.

First one is that the answer may not fit the text exactly. For that, we cannot use exact match and we need to go character by character and find longest common substring by following algorithm. See alg.\ref{Alg:lcs}. We will start with the whole text and compute match between it and the translated answer. Meanwhile, we will systematically delete first character until we have empty string and measure which of the resultant common substrings is the longest one.

\begin{algorithm}[H]
	\caption{Calculation of the cylinder from the plane}
	\label{Alg:lcs}
	\hspace*{\algorithmicindent} \textbf{Input:} \\
	$text$ = translated text \\
	$answer$ = translated answer \\
	$idx$ = original index of the middle of the answer \\

	\hspace*{\algorithmicindent} \textbf{Output:} \\
	$bestMatch$ = start index of the bes answer\\

	\begin{algorithmic}
		\FOR{i = 0 to len($text$)}
		\STATE $lcs$ = longestCommonSubstring(text[i:len($text$)], $answer$)
		\ENDFOR
		\STATE $maxLcs$ = find maximum of $lcs$
		\IF{$maxLcs$ is only one item}
		\STATE return $maxLlcs$
		\ELSE
		\FOR {i = 0 to len($maxLcs$)}
		\STATE $maxPos$=$maxLcs.match$*(1-abs($idx$-$maxLcs[i].idx$)/len($text$))
		\ENDFOR
		\ENDIF
		\STATE return max($maxPos$)
	\end{algorithmic}

\end{algorithm}

The other problem is that there can be more occurrences of the answer in the text and only one is the correct one. For that, we will consider that the text sentences are approximately in the same order in English and in Czech and we will use the original answer position to find the correct position of translated answer. Therefore, for each possible answer we compute its final score according to its value of longest common substring match multiplied by the distance from the position of original answer in original text. To facilitate the work, middle position of the answer is taken. Nearer is the actual position to original one and more similar is, the higher the score is correspondingly. Finally, the answer with the highest score is chosen and its starting index is taken as the correct one. Obviously, if starting index points to the middle of the word, it is moved that it points to the beginning of it.

To facilitate our work with translated data, we have modified the final .json file after translation a bit. Two new tags into the \textit{answers} tag were added. First one is \textit{answer\_end}, which is computed during recomputation of the starting index. It is pointing to the end of the last word of the answer in the text and it was added because of the easier visualization of the answer in the context paragraph. The other one is \textit{answer\_match} and it is value of the score of the match. See the new structure below.

\lstset{language=C}
\begin{lstlisting}
{ Data[
	title
	paragraphs [{
		context
		qas [{
			answers[
				text
				answer_start
			}]
		question
		id
	}]
]
version }
\end{lstlisting}

Unfortunately, every machine makes some mistakes during translation. We will now describe the most common ones. One of them is word order, that is confusing the system while recomputing start index of the answer in the text. One of the examples can be seen in \ref{Fig:img-order}. The other common mistake is caused by synonyms. The translator has chosen two different Czech words in question and answer for the same one in English. See \ref{Fig:img-synonyms}. Another problem is caused by different language properties - Czech words are declined. See \ref{Fig:img-declination}. The last common mistake I will mention is translating of numbers. They can be once written as words and translated and secondly written as numbers. Then the algorithm of recomputing index is confused. See \ref{Fig:img-numbers}. By the way, the same problem is with names. See \ref{Fig:img-names}. You can observe some of the deviations in translations in images below.

\begin{figure}[H]
	\centering
	\includegraphics[width=140mm]{../img/example-order.png}
	\caption{Example of the selected answer by the algorithm with changed word order between text and answer.}
	\label{Fig:img-order}
\end{figure}

\begin{figure}[H]
	\centering
	\includegraphics[width=140mm]{../img/example-declination.png}
	\caption{Example of the selected answer by the algorithm with different declination in text and answer.}
	\label{Fig:img-declination}
\end{figure}

\begin{figure}[H]
	\centering
	\includegraphics[width=140mm]{../img/example-synonyms.png}
	\caption{Example of the selected answer by the algorithm with synonyms in text and answer.}
	\label{Fig:img-synonyms}
\end{figure}

\begin{figure}[H]
	\centering
	\includegraphics[width=140mm]{../img/example-declination.png}
	\caption{Example of the selected answer by the algorithm with different declination in text and answer.}
	\label{Fig:img-declination}
\end{figure}

\begin{figure}[H]
	\centering
	\includegraphics[width=140mm]{../img/example-numbers.png}
	\caption{Example of the selected answer by the algorithm with non-translated numbers in text and answer.}
	\label{Fig:img-numbers}
\end{figure}

\begin{figure}[H]
	\centering
	\includegraphics[width=140mm]{../img/example-names.png}
	\caption{Example of the selected answer by the algorithm with partially translated names in text and answer.}
	\label{Fig:img-names}
\end{figure}


After having all the data translated and successfully recomputed start and end index, we had to analyze them and observe, how good or bad the translation was. For that we have created following graphs  \ref{Fig:img-devmatch} and \ref{Fig:img-trainmatch}. They are showing how successful the translation was. We can also observe, that we have really similar numbers for both sets, what is really good, because the development set and train set do not differ too much and they were similarly difficult for the translator and there is not too much deviation between these two datasets.
\begin{figure}[H]
	\centering
	\includegraphics[width=140mm]{../img/train-match.png}
	\caption{Plot of how much the answers match the answers in the text for training set}
	\label{Fig:img-trainmatch}
\end{figure}


\begin{figure}[H]
	\centering
	\includegraphics[width=140mm]{../img/dev-match.png}
	\caption{Plot of how much the answers match the answers in the text for development.}
	\label{Fig:img-devmatch}
\end{figure}

According to the observation of the graph \ref{Fig:img-trainmatch}, we can see that almost all the translations were with match more than 50\% in both sets. Moreover, we can see that there are only few translation that have match less than 80 \%.  If we thrown them away, we still would have preserved almost 90\% of the data and it still would be enough to make good predictions. For that we can throw them away for that they would not case too much mess during training. These answers will be probably wrongly found in the text or wrongly translated. We can observe exact values of how many percent of data will be preserved for different values of exact match in the table \ref{tab:match-after-trans}.
\begin{table}[h]
	\centering
	\begin{tabular}{| l | l | l |}
		\hline
		\textbf{Match} & \textbf{Train set size} & \textbf{Test set size} \\
		\hline
		\textbf{100\%} & 57.8\%  & 58.0\% \\
		\hline
		\textbf{$\geq$ 90\%} & 78.9\% & 78.2\% \\
		\hline
		\textbf{$\geq$ 80\%} & 89.0\% & 88.6\% \\
		\hline
		\textbf{$\geq$ 70\%} & 94.7\% & 94.2\% \\
		\hline
		\textbf{$\geq$ 60\%} & 89.2\% & 98.0\% \\
		\hline
		\textbf{$\geq$ 50\%} & 99.8\% & 99.8\% \\
		\hline
	\end{tabular}
	\caption{Preservation of original datasets [in \%], where the match is higher than defined value.}
	\label{tab:match-after-trans}
\end{table}

For better observation of the influence of the translation mistakes to the training model process, we will create several datasets. They will contain only answers with match greater of equal to certain value. We found out that the best partitions will be after 5\%, so we will create datasets where we have match 100\%, more than 95\%, more than 90\%, more than 85\% and more than 80\% for both training and development set. We will use them for the training. This process will be described in detail in following part. For the information how much percent of original data was preserved in newly created files see table \ref{tab:match-after-trans-files}. We can see that the numbers are a bit different than in previous ones. It is because of rounding deviations. Basically, we have still more than 85\% of data saved so it is still enough.

\begin{table}[h]
	\centering
	\begin{tabular}{| l | l | l |}
		\hline
		\textbf{Match} & \textbf{Train set size} & \textbf{Test set size} \\
		\hline
		\textbf{100\%} & 50.6\%  & 51.0\% \\
		\hline
		\textbf{$\geq$ 95\%} & 59.1\% & 59.2\% \\
		\hline
		\textbf{$\geq$ 90\%} & 71.2\% & 71.2\% \\
		\hline
		\textbf{$\geq$ 85\%} & 80.0\% & 78.9\% \\
		\hline
		\textbf{$\geq$ 80\%} & 85.1\% & 84.5\% \\
		\hline
	\end{tabular}
	\caption{Preservation of original datasets [in \%], where the match is higher than defined value.}
	\label{tab:match-after-trans-files}
\end{table}


\subsection{Machine for training}
For training and testing, we have used Artificial Intelligence Cluster (AIC). See \cite{aic}.
\cite[AIC (Artificial Intelligence Cluster) is a computational grid with sufficient computational capacity for research in the field of deep learning using both CPU and GPU. It was built on top of SGE scheduling system. MFF students of Bc. and Mgr. degrees can use it to run their experiments and learn the proper ways of grid computing in the process.]{aic}.
%4 processors Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz.

\section{BiDAF Model}
Biderectional Attention Flow (BiDAF) model was trained on SQuAD dataset. It is a multi-level hierarchical process. With help of attention mechanism, it represents the context od words on several levels of granularity. It also limits the information loss in context ans relation representation. More about this model can be found at \ref{chapter03}.

As mentioned above, the goal of this work is to reuse this model to solve QA task also in Czech language. We have used model from here \cite{biattflowsource}. There are two main attitudes, both linked with machine translation of data between English and Czech. First one is to take whole SQuAD dataset and translate it into Czech. Then, train model in Czech. The second is to train model in English and translate Czech input to English, then let the model to produce the answer and finally, translate it back to Czech.

\subsection{Translation of English data to Czech}
We have taken whole SQuAD data set and translated from English to Czech by LINDAT translator. Then, we have trained model in Czech and tested its accuracy. It took from 62 to 213 hours on CPU dependently on the combination of train and test files.

For the training process, we had to change English embeddings used for the English dataset to Czech embeddings. We have used Czech embeddings created by RNDr. Milan Straka, PhD. on 4 milliards of Czech words using word2vec model. There are 1.5 millions of embeddings.

After translation, we have obtained 5 data files for train and 5 data files for testing with different matches between the answer in the text and translated answer as described above. To find the best combination of train and testing dataset we have tried to train model on all the possible combinations. For that, we have trained 25 models with different train and test having answers with match from $\geq$80\% to 100\%. The resultant accuracies and F1 scores of these training processes can be observed in tables \ref{tab:cz-results-acc} and \ref{tab:cz-resultsf1}.

\begin{table}[]
	\begin{tabular}{|l|l|l|l|l|l|}
		\hline
		\textbf{Test/Train} & \textbf{100\%}& \textbf{$\geq$95\%}& \textbf{$\geq$90\%}& \textbf{$\geq$85\%}& \textbf{$\geq$80\%}  \\
		\hline
		\textbf{100\%}&	57.03\% & 57.34\% & 58.84\% & 59.23\% & 59.6\%  \\ \hline
		\textbf{$\geq$95\%}&	53.14\% & 55.7\%  & 57.17\% & 57.69\% & 57.71\% \\ \hline
		\textbf{$\geq$90\%}&	48.25\% & 51.31\% & 54.78\% & 56.12\% & 53.14\% \\ \hline
		\textbf{$\geq$85\%}&	45.72\% & 47.47\% & 49.3\%  & 52.86\% & 54.56\% \\ \hline
		\textbf{$\geq$80\%}&	43.86\% & 48.1\%  & 50.76\% & 52.34\% & 53.43\% \\ \hline
	\end{tabular}
	\caption{Exact match after translation SQuAD to Czech and then training and testing models on data files with corresponding matching values.}
	\label{tab:cz-results-acc}
\end{table}

\begin{table}[]
	\begin{tabular}{|l|l|l|l|l|l|}
		\hline
		\textbf{Test/Train} & \textbf{100\%}& \textbf{95\%}& \textbf{90\%}& \textbf{85\%}& \textbf{85\%}  \\
		\hline
		\textbf{100\%}&	65.01\% & 65.75\% & 67.49\% & 67.95\% & 67.89\%   \\ \hline
		\textbf{95\%}&	62.35\% & 64.86\% & 66.27\% & 67.18\% & 67.37\% \\ \hline
		\textbf{90\%}&	58.56\% & 61.85\% & 65.19\% & 66.4\%  & 65.99\%  \\ \hline
		\textbf{95\%}&	56.57\% & 60.83\% & 62.02\% & 64.64\% & 65.35\% \\ \hline
		\textbf{80\%}&	55.01\% & 59.88\% & 62.09\% & 63.88\% & 64.79\% \\ \hline
	\end{tabular}
	\caption{F1 after translation SQuAD to Czech and then training and testing models on data files with corresponding matching values.}
	\label{tab:cz-results-f1}
\end{table}

If we compare all these results in the graph with exact match \ref{Fig:graph-acc} and f1 score \ref{Fig:graph-f1}, we can see that the best results we had with train data having answers with match grater or equal to 80\% and training set with 100\% match between translated answer and answer in the text. The results are logical because match above 80\% does not bring such a noise into the dataset as the lower values do, and we also have much more data for training than with match 100\%. Therefore, the model can then answer more questions. Te reason why the best testing set is with answers with match 100\% is because the other development sets are noisy and it is harder for the model to predict them correctly.

\begin{figure}[H]
	\centering
	\includegraphics[width=140mm]{../img/transl_cz_accuracy.png}
	\caption{Plot of exact match how much the answers match the answers in the text for testing.}
	\label{Fig:graph-acc}
\end{figure}

\begin{figure}[H]
	\centering
	\includegraphics[width=140mm]{../img/transl_cz_f1.png}
	\caption{Plot of f1 score how much the answers match the answers in the text for testing.}
	\label{Fig:graph-f1}
\end{figure}

\subsection{Translation of Czech input into English}
We have trained the BiDAF model in English. It took 124 hours 29 minutes and 31 seconds on CPU. Then we have taken Czech testing data set, we have translated it into English and then, we have run in on a pretrained English model. Subsequently, we have measured exact match and we got following results. See table \ref{tab:cz-en-results}.
\begin{table}[h]
	\centering
	\begin{tabular}{| l | l |}
		\hline
		\textbf{Exact match} & \textbf{F1}  \\
		\hline
		54.39\% & 67.58\%  \\
		\hline
	\end{tabular}
	\caption{Results after translation of Czech input to English for evaluation. Exact match and f1 were computed on this dataset without translating answer back to Czech.}
	\label{tab:cz-en-results}
\end{table}

To make the evaluation completed, we have taken English answers and translated them back to Czech language. Then we have measured similarity between original and newly obtained Czech answers. We got these results. See table \ref{tab:cz-en-cz-results}.

\begin{table}[h]
	\centering
	\begin{tabular}{| l | l |}
		\hline
		\textbf{Exact match} & \textbf{F1} \\
		\hline
		62.08\% & 70.36\% \\
		\hline
	\end{tabular}
	\caption{Results after translation of Czech input to English for evaluation. Then, the answers were translated back to Czech and exact match and f1 were computed on translated and original Czech answers.}
	\label{tab:cz-en-cz-results}
\end{table}

As we do not have any Czech training or testing dataset, we have translated whole testing dataset from English to Czech. From that, we have chosen only questions and answers having the match between translated answer and answer in the text with more than 95 \%. Match was computed by the same way like it was mentioned in the previous section and by the same algorithm which is described in \ref{Alg:lcs}.

\subsection{Summary of results}

If we compare the best result of the model trained in Czech and the best result of the model trained in English, we can see, that the total best results we obtained from English model trained on English data set and with translating Czech input into English and then translating English output back into Czech. See table \ref{tab:all-results}

We think that English model is better because we have more data and there is no noise caused by translation. Moreover, LINDAT translator used for translation was constructed that if we translate sentence from English to Czech and then back to English, it is trying to return original sentence and for that, the loss is minimal.

\begin{table}[h]
	\centering
	\begin{tabular}{| l | l | l |}
		\hline
		\textbf{Model} & \textbf{Exact match} & \textbf{F1}  \\
		\hline
		\textbf{English} & 64.22\% & 75.29\% \\
		\hline
		\textbf{CZ} & 59.6\% & 67.89\% \\
		\hline
		\textbf{CZ-EN} & 54.39\% & 67.58\% \\
		\hline
		\textbf{CZ-EN-CZ} & 62.08\% & 70.36\% \\
		\hline
	\end{tabular}
	\caption{Comparison of results from all models.}
	\label{tab:all-results}
\end{table}


To sum it up, with the best model for Czech we have exact match 62\% and f1 score 70.36\% and for English model we have exact match 64.00\% and f1 score 75.29\%. We can see that exact match of better model is only 2\% lower and f1 score is only 5\% lower, what is not such a difference.

Basically original English BiDAF model is generally not so good. It has exact match only 64\% and F1 score 75\%. For that, we have tried to reuse another model that gives exact match 80.69\% and f1 score 88.16\% for English. It is called Transformer model.


\section{Transformer model}
We have downloaded Transformer model from here \cite{bertsource}. It is an universal language representation model which was originally pretrained on huge unlabeled corpus. We have finetuned it on SQuAD dataset to create model called BERT for our QA task. Basically, finetunning took from 3 to 6 six hours on GPU, dependently on train and test datasets.

Firstly, we have trained Transformer on original English SQuAD dataset. We have tried different number of epochs. The best results we obtained with 2 epochs and for that, we have trained the other models only with 2 epochs. See table \ref{tab:bert-epochs}.

\begin{table}[h]
	\centering
	\begin{tabular}{| l | l | l |}
		\hline
		\textbf{BERT (Train EN, Test EN)} & \textbf{Exact match} & \textbf{F1}  \\
		\hline
		\textbf{1 epoch} & 79.2\% & 87.35\% \\
		\hline
		\textbf{2 epochs} & 80.69\% & 88.16\% \\
		\hline
		\textbf{3 epochs} & 80.03\% & 87.8\% \\
		\hline
	\end{tabular}
	\caption{Comparison of results of BERT using different number of epochs.}
	\label{tab:bert-epochs}
\end{table}


After having ideal number of epochs, we have trained BERT also on SQuAD translated to Czech. For that, we have to use Multilingual BERT model, which was designed for other languages. There are two versions of this model. First one is Multilingual BERT cased, that is using cased dataset. Second one is multilingual BERT uncased, which is converting dataset to its uncased version. The uncased model basically gives better results. It is logical as the dataset is simpler. We have trained it on two different combinations of Czech datasets.

At first, we have chosen Czech training dataset witch match $\qed$80\% and testing dataset with match 100\%. The reason was simple - it has given the best results in BiDAF model. Moreover, we have tested Czech training dataset and testing dataset both witch match 100\%, because we just wanted to be sure, that first combination is really the best one for BERT too.

Surprisingly, we have found, that training on dataset with match 100\% gives us 2\% better results than training dataset with match $\qed$80\%. It is quiet surprising as more accurate dataset contains less data for training. On the other and, BERT is already pretrained on huge unlabeled corpus so the training dataset with matching value $\qed$80\% is bringing more noise into the results. Therefore, the best combination for BERT is training and testing datasets both with match 100\%. See table \ref{tab:bert-cs}.


\begin{table}[h]
	\centering
	\begin{tabular}{| l | l | l |}
		\hline
		\textbf{Multilingual BERT (Train CZ, Test CZ)} & \textbf{Exact match} & \textbf{F1}  \\
        \hline
		\textbf{Cased (Train 80\%, Test 100\%)} & 64.55\% & 73.66\% \\
		\hline
		\textbf{Uncased (Train 80\%, Test 100\%)} & 66.58\% & 75.6\% \\
		\hline
		\textbf{Cased (Train 100\%, Test 100\%)} & 66.15\% & 74.28\% \\
		\hline
		\textbf{Uncased (Train 100\%, Test 100\%)} & 68.45\% & 76.0\% \\
		\hline
	\end{tabular}
	\caption{Comparison of results of Multilingual BERTs trained and evaluated on Czech using different training and testing sets.}
	\label{tab:bert-cs}
\end{table}


Moreover, we have also tried to train Multilingual BERT for English. Firstly, we have evaluated it on English to see, how much differs the multilingual model is from the original BERT. We have seen that it gives approximately the same results. Then, we have evaluated the model on Czech by the same way as in BiDAF model. We have translated Czech inputs into English and then answers back to Czech and evaluated it. We got following results table \ref{tab:multibert-csen}.


\begin{table}[h]
	\centering
	\begin{tabular}{| l | l | l |}
		\hline
		\textbf{Multilingual BERT (Train EN)} & \textbf{Exact match} & \textbf{F1}  \\
		\hline
		\textbf{Cased (Test EN)} & 81.5\% & 88.75\% \\
		\hline
		\textbf{Uncased (Test EN)} & 81.86\% & 89.12\% \\
		\hline
		\textbf{Cased (Test CZ)} & 72.53\% & 83.93\% \\
		\hline
		\textbf{Uncased (Test CZ)} & 72.7\% & 84.23\% \\
		\hline
	\end{tabular}
	\caption{Comparison of results of Multilingual BERTs trained on English and evaluated on English and Czech.}
	\label{tab:multibert-csen}
\end{table}

We can see that Multilingual BERT for Czech is a bit worse than Multilingual BERT for English. It makes perfect sense as the Czech is much harder for Transformer and BERT than English. Basically, the results are 9\% worse in exact match and 4\% worse in f1 score.


\subsection{Conclusion}
Basically, we have again observed, that training Multilingual BERT on Czech dataset gives us 5\% worse in exact match and 8\% worse in F1 results that training it on English and only translating the Czech input into English and then translating the answer back to Czech. See table \ref{tab:multibert-cs}.


\begin{table}[h]
	\centering
	\begin{tabular}{| l | l | l |}
		\hline
		\textbf{Multilingual BERT} & \textbf{Exact match} & \textbf{F1}  \\
		\hline
		\textbf{Cased (Trained EN)} & 66.15\% & 74.28\% \\
		\hline
		\textbf{Uncased (Trained EN)} & 68.45\% & 76.0\% \\
		\hline
		\textbf{Cased (Trained CZ)} & 72.53\% & 83.93\% \\
		\hline
		\textbf{Uncased (Trained CZ)} & 72.7\% & 84.23\% \\
		\hline
	\end{tabular}
	\caption{Comparison of all the Multilingual BERT models for Czech QA}
	\label{tab:multibert-cs}
\end{table}


\section{Summary}
To sum everything up, we have to compare BiDAF and BERT model results. In both of these, we have observed, that training in English gives better results than training in Czech.

Firstly, it is important to mention, that BERT has reached much better results in QA task than BiDAF. See table \ref{tab:tab:both-only-en}.

\begin{table}[h]
	\centering
	\begin{tabular}{| l | l | l |}
		\hline
		\textbf{Model} & \textbf{Exact match} & \textbf{F1}  \\
		\hline
		\textbf{BiDAF}  & 64.22\% & 75.19\% \\
		\hline
		\textbf{BERT} & 80.69\% & 88.16\% \\
		\hline
	\end{tabular}
	\caption{Comparison of results of BiDAF and BERT trained and evaluated on English SQuAD}
	\label{tab:both-only-en}
\end{table}

Secondly, we can observe results of training both models in Czech. We can see that BiDAF is better on training dataset of matching value 80\% and BERT is better on training dataset of matching value 100\%. If we compare the best results for both model for Czech, we can see that the best model Multilingual BERT uncased is 9\% better in exact match and 12\% better in f1 score than the best model BiDAF. See table \ref{tab:both-cs}.

\begin{table}[h]
	\centering
	\begin{tabular}{| l | l | l |}
		\hline
		\textbf{Model (trained in CZ)} & \textbf{Exact match} & \textbf{F1}  \\
		\hline
		\textbf{BiDAF (Train 80\%, Test 100\%)} & 59.6\% & 67.89\% \\
		\hline
		\textbf{BiDAF (Train 100\%, Test 100\%)} & 57.03\% & 65.01\% \\
		\hline
		\textbf{Multi BERT cased(Train 80\%, Test 100\%)} & 64.55\% & 73.66\% \\
		\hline
		\textbf{Multi BERT uncased(Train 80\%, Test 100\%)} & 66.58\% & 75.6\% \\
		\hline
		\textbf{Multi BERT cased(Train 100\%, Test 100\%)} & 66.15\% & 74.28\% \\
		\hline
		\textbf{Multi BERT uncased(Train 100\%, Test 100\%)} & 68.45\% & 76.0\% \\
		\hline
	\end{tabular}
	\caption{Comparison of Multilingual BERTs and BiDAF trained on Czech.}
	\label{tab:both-cs}
\end{table}


Thirdly, we have compared models trained in English and evaluated on Czech dataset, where we have translated the inputs into English, we have let the model to predict answer and then, we have translated the answer back to Czech and compared it with original answer. We can see that both BERTs are 10\% better in exact match and 13-14\% better in f1 score even in Czech. See table \ref{tab:both-en}.

\begin{table}[h]
	\centering
	\begin{tabular}{| l | l | l |}
		\hline
		\textbf{Model (trained in EN)} & \textbf{Exact match} & \textbf{F1}  \\
		\hline
		\textbf{BiDAF} & 62.08\% & 70.36\% \\
		\textbf{Multi BERT cased} & 72.53\% & 83.93\% \\
		\hline
		\textbf{Multi BERT uncased} & 72.7\% & 84.23\% \\
		\hline
	\end{tabular}
	\caption{Comparison of Multilingual BERTs and BiDAF trained on English and evaluated on Czech.}
	\label{tab:both-en}
\end{table}


Finally, the best model for Czech QA is definitely Multilingual BERT uncased, that has reached 72.7\% exact match 84.23\% f1 score. In comparison with best model for QA in English, BERT, it is 8\% worse in exact match and 4\% worse in f1. See table \ref{tab:best-comp}.

\begin{table}[h]
	\centering
	\begin{tabular}{| l | l | l |}
		\hline
		\textbf{Model} & \textbf{Exact match} & \textbf{F1}  \\
		\hline
		\textbf{BERT for English} & 80.69\% & 88.16\% \\
		\hline
		\textbf{Multi BERT for Czech} & 72.7\% & 84.23\% \\
		\hline
	\end{tabular}
	\caption{Comparison of best models for English QA and Czech QA.}
	\label{tab:best-comp}
\end{table}