Adrien Ehrhardt - X, CA SA

Raspberry Pi cluster for Big Data frameworks

Motivation

My personal motivation for starting this project is that I've had a little introductory course to Hadoop while at Centrale Lille, but not enough to claim being a Big Data Engineer or whatever, and that we have, at Crédit Agricole Consumer Finance, a small MapR cluster for storing e.g. web log files, that nobody understands properly except the prepocessed data that lands in structured Vertica tables.
My motivation for writing this guide is that I've found a few tutorials on the subject already, but they were either outdated, had missing stuff, went into great lengths on interesting stuff (to me) and only got to the point of a working install without any real world use case example.
Nevertheless, I obviously borrowed lots of stuff from existing material which you'll find in the "Ressources" section at the end of the post.

Tools and prerequisites

I wanted this to be really cost-efficient since it's only to play with it for a few days and I'll have to think about ways to recycle the Pis (maybe IoT stuff?), but I didn't want to compromise on performance (and versatility, e.g. the Ethernet port), so I went with the Raspberry Pi 3B+ (affiliate link) which I got in January 2019 for 32.80 euros each! With the Raspberry Pi 4 going out around these days, there are already tons of deals on the 3B+ for a very good performance / price ratio.

I needed a switch as well because I wanted everything to be wired. I went for the TP-Link TL-SG105 (affiliate link) switch for 17.84 euros on Amazon.fr.

I considered the PoE (power over Ethernet) feature but it made no sense cost-wise since the switch would have costed a lot more, and the Pis would have needed a special PoE board. The total cost would have shot through the roof for the same computational power just to remove 4 USB cables and a USB port. For those interested in the PoE feature, this tutorial implements it well.

So in the end, I needed a powered USB multiport, which can be had very cheaply. However, I had power problems with cheap adaptors on a first gen Raspberry Pi, so I wanted something that could withstand at least 2A per port. I heard lots of good stuff from Anker so I bought their PowerPort 10 (affiliate link) for 29.99 euros. This is probably the least cost-effective stuff of my project, but again, it's been the weak point of a previous one and I wanted it to serve for other stuff as well...

I ordered the USB and Ethernet cables from AliExpress (affiliate links) for 0.79 and 0.64 euros each respectively. Make sure you get USB cables that scan withstand 2.4A and Ethernet cables with CAT6 or more (for the price difference: better safe than sorry).

I got the Micro SD cards from AliExpress as well. After going through SD card read / write speed tests (some are even dedicated to the Raspberry Pi, see e.g. this one, or that one), I decided to get what was at the time the best performance / price ratio, namely the Samsung Evo + (affilite link) (in its 32Gb version - Amazon affiliate link) for 5.68 euros each. Please be aware that when ordering these stuff on AliExpress, you are exposed to getting a fake one (check the reviews), not having any kind of warranty, and potentially not paying VAT.

Total cost: 207.47 euros.

For fun, I explain how I watercooled the cluster (see below), which is totally not cost-effective and does not make much sense since the Pi's temperature (which can be really high and result in speed throttling) can be well managed by small fans (affiliate link) and / or passive dissipators.

Moreover, there are several ways to "overclock" the Pi: enabling zram, overclocking the SD card reader and standard overclocking i.e. CPU, GPU speeds and voltages, etc.

Setting up the Pis individually

For obvious versatility + potential debugging via Google + simplicity reasons, I've decided to go with the "standard" Pi distro: Raspbian, in its Lite version (see here).

Most micro SD cards come with an SD adaptor, which was the case for me. To install the distro on all 4 Pi's, I went with the terminal approach (since I already used it a lot and we're gonna use it a lot here), but there's also a GUI approach. For both aproaches, refer to the installation instructions.

Here is the first tricky part: for a few years now, Raspbian does not enable ssh login at first boot. You either need to plug a monitor to each Pi (or sequentially if you don't have 4 monitors :-)) or put a file named ssh at the root of the Pi's boot partition (which is what I did).

Tip: at this point, you should either mark the SD cards or the Pi's to know which is which. I've gone with using small stickers on the Ethernet port.

Configuring the network

Once all is hooked up, let's fire them up one by one!
Check e.g. your router for their IP adress, from your command line type:

ssh pi@IP_adress

and login with raspberry as a password.
Update the password for obvious security reasons (+ not the same for all Pi's !) using:

passwd

Update the packages using:

sudo apt-get update
sudo apt-get upgrade

Next, we want all Pi's to have static IPs so that we know which is which. You could either do this from your router or the Pi. To do this on the Pi, let's edit /etc/dhcpcd.conf by uncommenting and editing:

interface eth0
static ip_address=192.168.1.21/24
static routers=192.168.1.1

where 192.168.1.1 is the IP adress of my router. I've decided to use 21 for the first Pi, 22 for the second, etc.

At that point, you can fire up all of them. In /etc/hostname, you can give your Pi's sweet names for each others on the network. By default, their hostname is raspberrypi which I changed to pi1, ..., pi4.

Likewise, for Pi's to communicate, we'll pass on their respective names by adding a file /etc/hosts which contains the names of the other Pi's, e.g. on pi1 we have:

192.168.1.22 pi2
192.168.1.23 pi3
192.168.1.24 pi4

In these files, there should be a line that maps raspberrypi (which is no longer the hostname) to 127.0.1.1, and one that maps localhost to 127.0.0.1: for Hadoop and Spark to work on the cluster, we need to remove/comment these lines.

We will now enable each Pi to ssh to other Pi's and send commands to them. This wouldn't be possible giving their respective passwords each time. Instead, we will use public / private SSH keys. On each Pi (exemplified with pi1), do the following:

ssh-keygen # no passphrase!
ssh-copy-id pi@pi1
ssh-copy-id pi@pi2
ssh-copy-id pi@pi3
ssh-copy-id pi@pi4

All these commands in all these terminals was impractical. For the rest, we'll need some kind of way to distribute the same command to all other Pi's. Edit ~/.bashrc to add these lines (modified from this tutorial):

function otherpis { #this tells the Pi other Pi's hostnames
grep "pi" /etc/hosts | awk '{print $2}' | grep -v $(hostname)
}
function clustercmd { #this will forward our command to all Pi's
for pi in $(otherpis); do ssh $pi "source ~/.bashrc;$@"; done
$@
}
function clusterscp { #same with file transfer
for pi in $(otherpis); do
cat $1 | ssh $pi "sudo tee $1" > /dev/null 2>&1
done
}
function clusterreboot { #reboot the cluster
stop-yarn.sh && stop-dfs.sh && \
clustercmd sudo shutdown -r now
}
function clustershutdown { #shutdown the cluster
stop-yarn.sh && stop-dfs.sh && \
clustercmd sudo shutdown now
}

stop-yarn.sh && stop-dfs.sh are not yet available since we did not install Hadoop yet but since we're in there...

And to comment these lines (so that we can use environment variables in each Pi when sshing non-interactively to them, see here):

# If not running interactively, don't do anything
#case $- in
# *i*) ;;
# *) return;;
#esac

Simply source your ~/.bashrc. You can now use functions defined above on all Pi's (e.g. shutdown using clustercmd sudo shutdown -r now).

One clumsy thing that was the source of errors before I refer to this tutorial was the absence of date synchronisation (which is of foremost importance if we want all Pi's to communicate, keep logs, etc.). So the following commands will install htpdate and synchronise their time with the same server.

clustercmd "sudo apt install htpdate"
clustercmd sudo htpdate -a -l time.nist.gov
clustercmd date

You should now have very similar times.

A crucial point is to ensure that Java, Hadoop, Spark, and H2O Sparkling Water have compatible versions. At the time of this writing, Java 11, Hadoop 3.3.4, Spark 3.0.1 and "Sparkling Water for Spark 3.0" seem to be mutually compatible. To get Hadoop working, we need Java 11, which is not installed by default since we have the 'Lite' Raspbian. I installed open-jdk by typing:

sudo apt-get install openjdk-11-jre-headless -y

Return to your ~/.bashrc file to add Java's location:

export JAVA_HOME=$(readlink –f /usr/bin/java | sed "s:bin/java::")

Installing Hadoop and Spark on Pi 1: single-machine cluster

Now we can get to the real stuff by first making Hadoop and Spark work on a single Pi. At the time of writing, Hadoop is on its 3.3.0 version, so be careful to use the latest version and update the following links if need be.

cd && wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
sudo tar -xvf hadoop-3.3.4.tar.gz -C /opt/
rm hadoop-3.3.4.tar.gz
cd /opt
sudo mv hadoop-3.3.4 hadoop
sudo chown pi:pi -R /opt/hadoop

We follow the same process for Spark:

cd && wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
sudo tar -xvf spark-3.3.1-bin-hadoop3.tgz -C /opt/
rm spark-3.3.1-bin-hadoop3.tgz && cd /opt
sudo mv spark-3.3.1-bin-hadoop3 spark
sudo chown pi:pi -R /opt/spark

NOTE: some system administrators recommend having a separate user id for all things related to Hadoop. Here, pi (the standard user) is the user we'll be using to launch Hadoop and its related processes, but also tons of other stuff which might get confusing. We still need to do some configuration in some Hadoop files to get it working: In /opt/hadoop/etc/hadoop/core-site.xml, between the <configuration> anchors, insert:

<property>
<name>fs.defaultFS</name>
<value>hdfs://pi1:9000</value>
</property>

In /opt/hadoop/etc/hadoop/hdfs-site.xml, between the <configuration> anchors as well, insert:

<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop_tmp/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

In /opt/hadoop/etc/hadoop/mapred-site.xml, between the <configuration> anchors as well, insert:

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

In /opt/hadoop/etc/hadoop/yarn-site.xml, between the <configuration> anchors as well, insert:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

In /opt/hadoop/etc/hadoop/hadoop-env.sh, we need to tell Hadoop as well where Java is located:

export JAVA_HOME=$(readlink –f /usr/bin/java | sed "s:bin/java::")

Okay, now we need to create the temporary directories where Hadoop can write its logs:

sudo mkdir -p /opt/hadoop_tmp/hdfs/datanode
sudo mkdir -p /opt/hadoop_tmp/hdfs/namenode
sudo chown pi:pi -R /opt/hadoop_tmp

On all Pi's (or through rsync), modify again ~/.bashrc (don't forget to source it, at least on the Pi you're installing the single machine cluster):

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH

We can now format the HDFS, start the service, verify that it works well, and stop it again:

hdfs namenode -format -force
start-dfs.sh
hdfs dfsadmin -report
stop-dfs.sh

Hadoop and Spark, orchestrated by Yarn, on the Pi cluster

Now we need to re-configure Hadoop on Pi 1 (master) so that it knows it has to speak to the other Pi's. We go back to core-site.xml in /opt/hadoop/etc/hadoop:

<property>
<name>fs.default.name</name>
<value>hdfs://pi1:9000</value>
</property>

Then to hdfs-site.xml to increase the replica (there was no point in having replicas on the same machine...):

<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop_tmp/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>

Then to mapred-site.xml:

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>256</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>128</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>128</value>
</property>

Finally to yarn-site.xml:

<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>pi1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>900</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>900</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>64</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>

We create a file named master in $HADOOP_HOME/etc/hadoop/ containing pi1. We create a file named workers in $HADOOP_HOME/etc/hadoop/ containing pi2, pi3 and pi4 on separate lines. This tells Hadoop which Pi is the master node and which are workers. Then, we create the appropriate directories, copy all files from Hadoop and Spark to the other Pi's using:

clustercmd sudo mkdir -p /opt/hadoop_tmp/hdfs
clustercmd sudo chown pi:pi -R /opt/hadoop_tmp
clustercmd sudo mkdir -p /opt/hadoop
clustercmd sudo chown pi:pi /opt/hadoop
clustercmd sudo mkdir -p /opt/spark
clustercmd sudo chown pi:pi /opt/spark
rsync -a /opt/hadoop/ pi@pi2:/opt/hadoop/
rsync -a /opt/hadoop/ pi@pi3:/opt/hadoop/
rsync -a /opt/hadoop/ pi@pi4:/opt/hadoop/
rsync -a /opt/spark/ pi@pi2:/opt/spark/
rsync -a /opt/spark/ pi@pi3:/opt/spark/
rsync -a /opt/spark/ pi@pi4:/opt/spark/

Now we need to remove all previous data that might have been created from our earlier test:

clustercmd rm -rf /opt/hadoop_tmp/hdfs/datanode/*
clustercmd rm -rf /opt/hadoop_tmp/hdfs/namenode/*

We're almost there: we're gonna format the namenodes and start Hadoop, Yarn and Spark:

hdfs namenode -format -force
start-dfs.sh && start-yarn.sh

You should be able to browse HDFS's dashboard at pi1:9870 and Yarn's dashboard at pi1:8088.
So that's where most tutorials end, leaving you with a nice command line tool and some nice logging web interface, which is usually, if you're in a Data Scientist role, what is brought to you by Software Engineers and co. Doing analytical stuff / machine learning / etc. will require additional tools and ad hoc knowledge. I'm an R aficionado, so I will illustrate it with RStudio Server and a small example. For an equivalent to RStudio Server for various languages, including Python, check code-server (or JupyterLab for notebooks).

Installing RStudio Server

If you're not familiar with R, there are tons of great resources out there. The language has really been modernized and although I don't want to go in the endless R vs Python debate, it's not the clumsy, buggy, not-production-ready language I've been told...
It also integrates in a point-and-click fashion lots of useful tools: Git(hub), ODBC connectors to lots of kinds of databases, help and vignette pages, Markdown functionality, Python's virtual environments "equivalents" (packrat, renv)...
So we're going to install RStudio-server in pi1, since its ressources are less used than the other Pi's, in the same fashion as explained on RStudio's website.
Until recently, you'd have to install RStudio-Server from source. Binaries are now available from this Github repo but they're a bit outdated. I document both approaches.

From binaries

wget https://github.com/ArturKlauser/raspberrypi-rstudio/releases/download/v1.5/rstudio-server-1.2.5033-1.r2r.buster_armhf.deb
sudo apt --fix-broken install rstudio-server-1.2.5033-1.r2r.buster_armhf.deb

You can now go to http://pi1:8787 and login with any user credentials on pi1.

From source

For this part of the tutorial, I relied heavily on this tutorial.

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install r-base-dev apt install postgresql libpq-dev postgresql-client postgresql-client-common
cd /usr/local/lib/R/
sudo chmod 7777 site-library/
sudo apt-get install git libpam0g-dev uuid-dev ant libssl-dev cmake
sudo curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env
cd && git clone https://github.com/getsentry/sentry-cli.git
cd sentry-cli

If you run into a memory error while building cargo, you can raise the swap to 1Gb. Unfortunately, you can also run into the same error when building and installing RStudio, so it's better to raise it to 3Gb right now. Maybe with the Raspberry 4B 4Gb, one wouldn't have to deal with such a big swap, which could damage the SD card.

sudo service dphys-swapfile stop
sudo nano /etc/dphys-swapfile
sudo service dphys-swapfile restart

Modify the following line:

CONF_SWAPSIZE = 3072

You can now go on to build sentry-cli, downloading RStudio's source files, installing dependencies, building and installing it:

cargo build
cd && rm -rf sentry-cli
git clone https://github.com/rstudio/rstudio.git
cd rstudio/dependencies/common
nano install-common

There, I commented install sentry-cli and ./install-crashpad.

./install-common
cd rstudio/dependencies/linux
wget http://snapshot.debian.org/archive/debian-security/20220210T093340Z/pool/updates/main/o/openjdk-8/openjdk-8-jdk-headless_8u322-b06-1~deb9u1_arm64.deb
wget http://snapshot.debian.org/archive/debian-security/20220210T093340Z/pool/updates/main/o/openjdk-8/openjdk-8-jdk_8u322-b06-1~deb9u1_arm64.deb
wget http://snapshot.debian.org/archive/debian-security/20220210T093340Z/pool/updates/main/o/openjdk-8/openjdk-8-jre-headless_8u322-b06-1~deb9u1_arm64.deb
wget http://snapshot.debian.org/archive/debian-security/20220210T093340Z/pool/updates/main/o/openjdk-8/openjdk-8-jre_8u322-b06-1~deb9u1_arm64.deb
sudo dpkg -i openjdk-8-jre-headless_8u322-b06-1~deb9u1_arm64.deb
sudo dpkg -i openjdk-8-jre_8u322-b06-1~deb9u1_arm64.deb
sudo dpkg -i openjdk-8-jdk-headless_8u322-b06-1~deb9u1_arm64.deb
sudo dpkg -i openjdk-8-jdk_8u322-b06-1~deb9u1_arm64.deb
sudo apt --fix-broken install
nano install-dependencies-jammy

There I commented:

exit 1

To force the installation on Raspberry (platform_codename=) and then run:

./install-dependencies-jammy
cd ../.. && mkdir build && cd build
cmake .. -DRSTUDIO_TARGET=Server -DCMAKE_BUILD_TYPE=Release
sudo ln -sf "$(which node)" /usr/bin/node
sudo NODE_OPTIONS="--max-old-space-size=8196" make install # this takes 1-2 days!
sudo useradd -r rstudio-server
sudo cp src/cpp/server/extras/init.d/debian/rstudio-server /etc/init.d/rstudio-server
sudo chmod +x /etc/init.d/rstudio-server
sudo ln -f -s /usr/local/bin/rstudio-server /usr/sbin/rstudio-server
sudo ln -f -s /usr/local/extras/systemd/rstudio-server.service /etc/systemd/system/rstudio-server.service
sudo rstudio-server start

You can now go to pi1:8787 and login with any user credentials on pi1.

H2O on the cluster

I wanted to add H2O to my configuration since I used it for some time at work, and we'll use it in the next section. On pi1, first, a few environment variables:

export MASTER="yarn"
export _JAVA_OPTIONS="-Xmx512M"

Next, we download Sparkling Water (beware: there might be a newer version, so change the following commands accordingly):

wget https://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.3/3.38.0.2-1-3.3/sparkling-water-3.38.0.2-1-3.3.zip
unzip sparkling-water-3.38.0.2-1-3.3.zip
rm sparkling-water-3.38.0.2-1-3.3.zip
sudo mv sparkling-water-3.38.0.2-1-3.3 /opt/sparkling-water
cd /opt/sparkling-water

You can now start Sparkling Water in your shell to see if everything went well with this command:

bin/sparkling-shell --num-executors 3 --executor-memory 512m --driver-memory 512m --master yarn --deploy-mode client

Machine Learning on the cluster

hadoop fs -mkdir /tmp
curl https://www.data.gouv.fr/fr/datasets/r/b4aaeede-1a80-4d76-8f97-543dad479167 | hdfs dfs -put - /tmp/vehicules-2018.csv

hadoop fs -ls /tmp
curl https://www.data.gouv.fr/fr/datasets/r/72b251e1-d5e1-4c46-a1c2-c65f1b26549a | hdfs dfs -put - /tmp/usagers-2018.csv
curl https://static.data.gouv.fr/resources/base-de-donnees-accidents-corporels-de-la-circulation/20191014-111741/caracteristiques-2018.csv | hdfs dfs -put - /tmp/caracteristiques-2018.csv
curl https://www.data.gouv.fr/fr/datasets/r/d9d65ca1-16a3-4ea3-b7c8-2412c92b69d9 | hdfs dfs -put - /tmp/lieux-2018.csv

The analysis of these datasets were the subject of a previous "projet informatique" I gave for a course at Ecole Polytechnique (see here). Most of my students used Python, and as I plan to give this project again this year, let's not go into details.
Let's go back to RStudio Server (normally at pi1:8787) and install sparklyr, h2o, and rsparkling like so:

install.packages('sparklyr')
install.packages("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R/src/contrib/h2o_3.38.0.2.tar.gz", type="source")
install.packages("/opt/sparkling-water/rsparkling_3.38.0.2-1-3.3.tar.gz", repos = NULL, type="source")
library(sparklyr)
options(sparklyr.log.console = TRUE)
library(rsparkling)
library(h2o)
Sys.setenv(SPARK_HOME="/opt/spark/")
Sys.setenv(HADOOP_HOME="/opt/hadoop/")
Sys.setenv(HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-1.11.0-openjdk-arm64")
Sys.setenv("_JAVA_OPTIONS"="-Xmx512M")
conf <- spark_config()
conf$spark.executor.memory <- "512M"
conf$spark.sparklyr.gateway.start.timeout <- 180
conf$spark.executor.cores <- 2
conf$spark.dynamicAllocation.enabled <- "false"
sc <- spark_connect(master = "yarn", method = "shell", config = conf)

caracteristiques <- spark_read_csv(sc=sc, name="caracteristiques", path="hdfs:///tmp/caracteristiques-2018.csv")
h2o.init(ip = "localhost", nthreads = 1, max_mem_size = "512M")
caracteristiques_hf <- as.h2o(caracteristiques)

You can finally train any model using H2O, e.g. h2o.gbm!

Watercooling the cluster

OK this is where things go out of control! I saw a few YouTube videos of people watercooling Raspberry Pis and I thought it would be great to the same... Also, I did not know anything about watercooling so I learnt a lot along the way...

Parts

Basically, you need:
1. A pump that will run the water into the circuit.

2. A radiator, just like the ones you have in your home or your car, in which there is a very long canal which yields a very big exchange system with the air around it which in turn cools the liquid. It's extremely big and overkill for the Raspberry so you might want a way smaller one if size matters.

3. Fans, to further enhance the heat exchange process, at each side of the radiator: if you don't have any fan, you risk having "hot" air around the radiator that does not cool efficiently. To be compatible with my radiator, I've bought big 180mm fans (Phobya G-Silent 18).

4. Waterblocks which will be put on the parts that you want to cool. Typically, you would want to cool the CPU first, then the network chip (if you intend to use your Raspberry Pis intensively for their networking capabilities) and / or the memory chip (RAM) that is on the other side of the board (I've gone for that option since I thought it was cool to have waterblocks on both sides).

5. Various connectors and tubes: 1/4'' plug, 1/4'' thread, 5/16'' compression fittings, coolant, big tubes for the pump and the radiator, small tubes for the Pis, splitters/dividers, seal plugs.

6. Optional: an enclosure. I've made a custom one using these acrylic cases, additionnal screws for the cases, additional screws for the Pis, a USB 5V to 12V adapter for the pump (now unavailable) and a 3-to-1 3-pin connector for the pump and the two fans.

Total cost: a little under 200 €.
To be honest, I did try to overclock my Pi's, which I managed ridiculously easily, but there is no real point for my use case.

The final beast

Sources

https://dev.to/awwsmm/building-a-raspberry-pi-hadoop-spark-cluster-8b2
https://eltechs.com/overclock-raspberry-pi-3/
https://developer.ibm.com/recipes/tutorials/building-a-hadoop-cluster-with-raspberry-pi/
https://dqydj.com/raspberry-pi-hadoop-cluster-apache-spark-yarn/
https://web.archive.org/web/20170221231927/
http://www.becausewecangeek.com/building-a-raspberry-pi-hadoop-cluster-part-1/
https://medium.com/@oliver_hu/build-a-hadoop-3-cluster-with-raspberry-pi-3-f451b7f93254
https://community.rstudio.com/t/setting-up-your-own-shiny-server-rstudio-server-on-a-raspberry-pi-3b/18982
http://herb.h.kobe-u.ac.jp/raspiinfo/rstudio_en.html https://weidongzhou.wordpress.com/2017/11/16/access-sparkling-water-via-r-studio/

Comment Box is loading comments...