Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- I have to believe that this has been addressed but here's the workaround I did to get EMR Spark working with BQ.
- Since a Spark job can pick any one of the available cluster nodes to be the Spark driver for an app, each node will need to have the BigQuery json cred file in it’s /home/hadoop/.gcp dir as bq.json. Here’s the steps to getting that done:
- Start ssh-agent on your laptop so that it can proxy PEM crews on the remote EMR master node (slave nodes are not directly available).
- ```
- $ ssh-agent
- […]
- ```
- Then add in the .pem file required:
- ```
- $ ssh-add ~/.ssh/me.pem
- ```
- Next, ssh using “-A” to use the ssh agent:
- ```
- $ ssh -A -I ~/.ssh/me.pem <master-node-ip>
- ```
- Then find each of the slave node IPs in the cluster
- ```
- $ aws emr list-clusters --region us-west-2 --active | grep Id
- "Id": "j-3UJBBJ07DDEEF”
- ```
- And finally copy the bg.json file from master to each core node:
- ```
- $ for n in `aws emr list-instances --cluster j-3UJBBJ07DDEEF --region us-west-2 \
- --instance-group-types CORE | grep IpAddress | awk -F"\"" '{print $4}'`; \
- do echo $n; ssh $n "mkdir .gcp"; scp ./.gcp/bq.json $n:.gcp/; done
- ```
Add Comment
Please, Sign In to add comment