Classes discovered from deploying Cloudera Information Platform for IBM Cloud Pak for Information – IBM Developer


Right here on this final weblog submit in our collection, we deal with classes discovered from putting in, sustaining, and verifying the connectivity of Cloudera Information Platform and IBM Cloud Pak for Information. Should you haven’t learn the primary two posts — A technical deep-dive on integrating Cloudera Information Platform and IBM Cloud Pak for Information and Putting in Cloudera’s CDP Non-public Cloud Base on IBM Cloud with Ansible, then I’d invite you to return and skim them for added context.

On this installment, we’d wish to share some helpful suggestions and tips and train you the best way to keep away from frequent errors by first-time installers

Lesson 1: Use a bastion host

Our Cloudera cluster had a complete of 8 VMs (3 grasp nodes, 3 employee nodes, and a couple of edge nodes). We wished quick access to every node and wished to restrict public community visitors to the Cloudera cluster as a lot as potential. Fortunately, there’s already a widely known answer to this downside: utilizing a bastion host.

We spun up a small VM on the identical subnet as our Cloudera cluster and will then simply talk over personal community interfaces (10.x.y.z IP addresses). For the set up course of, this selection provided the good thing about not dropping connections for long-running Ansible playbooks.

Determine 1. The structure of our Cloudera for Cloud Pak for Information atmosphere

Lesson 2: Use VS Code’s Distant Extension Plug-in

When putting in Cloudera Information Platform with Ansible playbooks you’re possible going to want to vary a couple of config choices and values within the playbooks. We’re not towards utilizing Vim, however we opted to make use of the Visible Studio Code Distant Growth Extension Pack. This made looking via the information, modifying values, and importing and downloading information a lot simpler.

Determine 2. VSCode’s Distant Growth Extension helpful for enhancing information and working instructions towards our distant machines

Lesson 3: Stick to personal networks

This level could seem apparent, nevertheless it’s extra about being constant. Anyplace an IP tackle was to be enter, we at all times made positive to make use of the personal community IP tackle. This ensured that any visitors would keep on the IBM Cloud community and never the general public Web.

Lesson 4: Eradicate all inbound visitors besides RDP on the Home windows Energetic Listing server

Here’s a refined lesson that may in any other case be little difficult to pin down. After a couple of days of uptime, the well being checks on our Cloudera Information Platform have been indicating that the hosts couldn’t attain our Energetic Listing (AD) server. Certainly we found that our AD had hung. After we would reboot the AD server issues would return to regular for a day or so after which it might repeat.

We regarded over capability and efficiency of the server. After we checked out networking utilization, we seen a excessive stage of visitors going to and from the system from the Web going through interface. After trying on the server configuration and the visitors, we have been capable of decide {that a} overwhelming majority was over the LDAP port.

Since our solely use of LDAP is inside, the answer to this downside was to restrict the inbound visitors to the AD by making a rule that solely allowed visitors on the RDP protocol, which is used for distant desktop administration. On IBM Cloud, we created a customized safety group allowing inbound TCP on port 3389 for RDP.

Lesson 5: Mount secondary drives to /knowledge/dfs mechanically

The storage necessities for putting in Cloudera required us to buy further drives to associate with our digital machines. These drives needed to be mounted earlier than working any playbooks. We used a bit little bit of bash and SSH to do it in an automatic means. In our case, we selected to mount the drives to /knowledge/dfs:

for i in {1..8}
  ssh cid-vm-0$i mkfs.ext4 -m0 -O sparse_super,dir_index,extent,has_journal /dev/xvdc
  ssh cid-vm-0$i mkdir -p /knowledge/dfs
  ssh cid-vm-0$i mount /dev/xvdc /knowledge/dfs
  ssh cid-vm-0$i 'echo "/dev/xvdc  /knowledge/dfs   ext4  defaults,noatime 1 2" | tee -a /and so on/fstab'

Lesson 6: Replace OpenShift DNS operator so it is aware of the Cloudera node hostnames

We wished our IBM Cloud Pak for Information occasion which runs on OpenShift be capable to talk with our newly deployed Cloudera Information Platform cluster. We caught to our “at all times use personal community interfaces” rule, however that resulted in 404s since OpenShift didn’t know the best way to resolve these hostnames. To get round this, we wanted to edit the DNS operator on our OpenShift occasion. It’s documented within the OpenShift DNS Documentation, however for brevity, we’ve added what labored for us.

Edit the dns operator default CR: oc edit dns.operator/default replace by including to the spec part:

  - forwardPlugin:
      - <your personal ip>
      - <your public ip>
    title: cdplab-server
    - cdplab.native

Then confirm the configmap for CoreDNS is up to date: oc get configmap/dns-default -n openshift-dns -o yaml

apiVersion: v1
  Corefile: |
    # cdplab-server
    cdplab.native:5353 {
        ahead . <your personal ip> <your public ip>

Then create a pod and attempt to entry CDP from the pod, and HTML must be returned, not a 404 error message.

bash-4.4$ curl -k https://cid-vm-01.cdplab.native:7183/cmf/house

Lesson 7: Make sure the AD self-signed certificates can be utilized as a certificates authority

This lesson may be broadly utilized to different LDAP and AD eventualities. In our case, we may efficiently hook up with the Impala service working on Cloudera via Kerberos, however not via LDAP. After double-checking that our LDAP-specific Impala configuration was appropriate, we have been nonetheless getting a not-so-helpful “Can’t contact LDAP server” error.

We slowly began to peel again the layers of the issue. We managed to isolate the issue to our LDAP configuration, and we realized this was the case as a result of once we ran ldapsearch in an try to bind the consumer, it gave us the identical error message. Ah-ha! Impala was utilizing an OpenLDAP library beneath the covers.

$ ldapsearch -H ldaps://cid-adc.cdplab.native:636 -D "stevemar@CDPLAB.LOCAL" -b "dc=cdplab,dc=native" '(uid=stevemar)' -W
Enter LDAP Password:
ldap_sasl_bind(SIMPLE): Cannot contact LDAP server (-1)

After double-checking that the Home windows firewall wasn’t the offender, we narrowed down the issue to a lacking bit of knowledge within the self-signed certificates we had created for the AD. We would have liked so as to add the -TextExtension "{textual content}CA=true" flag for the Home windows New-SelfSignedCertificate command. Our new command regarded like (earlier than it was lacking the final parameter):

New-SelfSignedCertificate -Topic *.$dnsName `
  -NotAfter $lifetime.AddDays(365) -KeyUsage DigitalSignature, KeyEncipherment `
  -Sort SSLServerAuthentication -DnsName *.$dnsName, $dnsName `
  -TextExtension "{textual content}CA=true"

There’s no actual single piece of recommendation right here, aside from when you’re going to make use of Kerberos to safe your Cloudera cluster, get accustomed to Kerberos ideas, like keytabs, and instruments like ktutil and ktpass.

Abstract and subsequent steps

We hope you loved studying about a number of the pitfalls we encountered and bear in mind a number of the suggestions we shared the subsequent time you’re deploying an information and AI platform. You possibly can study extra in regards to the Cloudera Information Platform for IBM Cloud Pak for Information joint providing.


Leave a Reply

Your email address will not be published. Required fields are marked *