<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://dogac.dev/feed.xml" rel="self" type="application/atom+xml"/><link href="https://dogac.dev/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-04-05T17:46:11+00:00</updated><id>https://dogac.dev/feed.xml</id><title type="html">blank</title><subtitle>Personal website of Doğaç Eldenk. </subtitle><entry><title type="html">Setting up Ephemeral GPU workspaces in Modal for SGLang Development</title><link href="https://dogac.dev/blog/2026/modal-vscode-development/" rel="alternate" type="text/html" title="Setting up Ephemeral GPU workspaces in Modal for SGLang Development"/><published>2026-04-03T00:00:00+00:00</published><updated>2026-04-03T00:00:00+00:00</updated><id>https://dogac.dev/blog/2026/modal-vscode-development</id><content type="html" xml:base="https://dogac.dev/blog/2026/modal-vscode-development/"><![CDATA[<style>div.code-display-wrapper pre,pre.highlight,pre code,.highlight code{line-height:1.4}</style> <h2 id="introduction">Introduction</h2> <p>I’ve recently started doing development on SGLang, which is an LLM inference engine. However I need GPU access to actively test my changes. I have access to GPUs using SLURM, however it requires some queue time (sometimes waiting for days) and moreover most of the time GPU sits idle while doing development which I don’t like. Also the selection of GPUs is limited in that cluster. Therefore I wanted to have some ephemeral environment where I can quickly test on different GPUs without keeping them idling while I do the development. I’ve originally met with Modal on a GPU Kernel programming contest, using <code class="language-plaintext highlighter-rouge">flash-infer</code> to deploy and test kernel performance. I really liked the ephemeral GPU containers, because they are so easy and fast to deploy, you only pay for the time they were running.</p> <blockquote> <p>The code is available at: <a href="https://github.com/Dogacel/modal-workspaces-vscode-sglang">github.com/Dogacel/model-workspaces-vscode-sglang</a></p> </blockquote> <h2 id="modal-architecture">Modal Architecture</h2> <p>Modal is a serverless runtime platform for AI inference that has low startup times. You can define your entire infrastructure in Python dynamically without any other configuration file. Modal runs your code in isolated containers. Your deployments are called “Apps” and it bundles one more more “Functions”. Functions are serverless endpoints, meaning no container will be redundantly running if no request is coming. It also provides “Sandboxes” that allow you to run containers with arbitrary dependencies and scripts, which we will use to create our development environment.</p> <h2 id="setting-up-a-docker-image">Setting up a Docker Image</h2> <p>Modal lets you declare your image declaretively using a Python DSL. For my environment, I needed python 3.11 with CUDA 13.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">modal</span><span class="p">.</span><span class="n">Image</span><span class="p">.</span><span class="nf">from_registry</span><span class="p">(</span>
  <span class="sh">"</span><span class="s">nvidia/cuda:13.0.0-devel-ubuntu24.04</span><span class="sh">"</span><span class="p">,</span> <span class="n">add_python</span><span class="o">=</span><span class="sh">"</span><span class="s">3.11</span><span class="sh">"</span>
<span class="p">)</span>
</code></pre></div></div> <p>Next I’ve installed some tools that help me work in CLI for short tasks, such as tmux and vim. I’ve also installed the build tools needed for SGLang and an ssh server to allow connecting directly. Note that I’ve figured what I need to install by first starting a relatively empty container and next installing stuff in the container manually as I face errors. Trying to rebuild image everytime could be time consuming.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">.</span><span class="nf">apt_install</span><span class="p">(</span>
    <span class="sh">"</span><span class="s">git</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">build-essential</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">cmake</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">ninja-build</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">vim</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">tmux</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">htop</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">wget</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">curl</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">openssh-client</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">openssh-server</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">libnuma1</span><span class="sh">"</span><span class="p">,</span>
<span class="p">)</span>
<span class="p">.</span><span class="nf">env</span><span class="p">({</span><span class="sh">"</span><span class="s">CUDA_HOME</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">/usr/local/cuda</span><span class="sh">"</span><span class="p">})</span>
</code></pre></div></div> <h3 id="project-setup">Project Setup</h3> <p>We will clone <code class="language-plaintext highlighter-rouge">sglang</code> into our image and install its dependencies. I’ve created a secret on Modal’s dashboard for my GitHub token to be able to access my private fork. Note that everytime we add a new dependency, we have to rebuild the image.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">.</span><span class="nf">run_commands</span><span class="p">(</span>
    <span class="sa">f</span><span class="sh">"</span><span class="s">git clone --branch </span><span class="si">{</span><span class="n">SGLANG_BRANCH</span><span class="si">}</span><span class="s"> https://$GITHUB_TOKEN@github.com/</span><span class="si">{</span><span class="n">GITHUB_USER</span><span class="si">}</span><span class="s">/sglang.git /opt/sglang</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">cd /opt/sglang/python &amp;&amp; pip install -e </span><span class="sh">'</span><span class="s">.[all]</span><span class="sh">'"</span><span class="p">,</span>
    <span class="n">secrets</span><span class="o">=</span><span class="p">[</span><span class="n">github_secret</span><span class="p">],</span>
<span class="p">)</span>
</code></pre></div></div> <p>Also I had to set my <code class="language-plaintext highlighter-rouge">LD</code> path for SGLang, otherwise it couldn’t find the required <code class="language-plaintext highlighter-rouge">.so</code> files as they were installed with python.</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-c</span> <span class="s2">"
import nvidia, os, glob;
paths = glob.glob(os.path.join(os.path.dirname(nvidia.__file__), '*/lib'));
open('/etc/ld.so.conf.d/nvidia-python.conf','w').write('</span><span class="se">\n</span><span class="s2">'.join(paths))
"</span> <span class="o">&amp;&amp;</span> ldconfig
</code></pre></div></div> <h2 id="persisting-the-development-environment">Persisting the development environment</h2> <p>I’ve created a volume to persist my workspace, so I don’t have to pull or push my work everytime. We will mount those volumes to our app or the sandbox later on.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hf_cache_vol</span> <span class="o">=</span> <span class="n">modal</span><span class="p">.</span><span class="n">Volume</span><span class="p">.</span><span class="nf">from_name</span><span class="p">(</span><span class="sh">"</span><span class="s">hf-cache</span><span class="sh">"</span><span class="p">,</span> <span class="n">create_if_missing</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">workspace_vol</span> <span class="o">=</span> <span class="n">modal</span><span class="p">.</span><span class="n">Volume</span><span class="p">.</span><span class="nf">from_name</span><span class="p">(</span><span class="sh">"</span><span class="s">sglang-workspace</span><span class="sh">"</span><span class="p">,</span> <span class="n">create_if_missing</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">HF_CACHE_PATH</span> <span class="o">=</span> <span class="sh">"</span><span class="s">/root/.cache/huggingface</span><span class="sh">"</span>
<span class="n">WORKSPACE_PATH</span> <span class="o">=</span> <span class="sh">"</span><span class="s">/workspace</span><span class="sh">"</span>

<span class="n">VOLUMES</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">HF_CACHE_PATH</span><span class="p">:</span> <span class="n">hf_cache_vol</span><span class="p">,</span>
    <span class="n">WORKSPACE_PATH</span><span class="p">:</span> <span class="n">workspace_vol</span><span class="p">,</span>
<span class="p">}</span>

<span class="p">.</span><span class="nf">env</span><span class="p">({</span>
    <span class="sh">"</span><span class="s">HF_HUB_CACHE</span><span class="sh">"</span><span class="p">:</span> <span class="n">HF_CACHE_PATH</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">SGLANG_CACHE_DIR</span><span class="sh">"</span><span class="p">:</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">WORKSPACE_PATH</span><span class="si">}</span><span class="s">/.sglang_cache</span><span class="sh">"</span><span class="p">,</span>
<span class="p">})</span>
</code></pre></div></div> <h2 id="ssh-access">SSH access</h2> <p>To get SSH access without any password prompt, we can upload our public key directly.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_SSH_KEY_NAMES</span> <span class="o">=</span> <span class="p">[</span><span class="sh">"</span><span class="s">id_ed25519.pub</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">id_rsa.pub</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">id_ecdsa.pub</span><span class="sh">"</span><span class="p">]</span>
<span class="n">SSH_PUB_KEY</span> <span class="o">=</span> <span class="nf">next</span><span class="p">(</span>
    <span class="p">(</span><span class="n">Path</span><span class="p">.</span><span class="nf">home</span><span class="p">()</span> <span class="o">/</span> <span class="sh">"</span><span class="s">.ssh</span><span class="sh">"</span> <span class="o">/</span> <span class="n">name</span> <span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">_SSH_KEY_NAMES</span> <span class="nf">if </span><span class="p">(</span><span class="n">Path</span><span class="p">.</span><span class="nf">home</span><span class="p">()</span> <span class="o">/</span> <span class="sh">"</span><span class="s">.ssh</span><span class="sh">"</span> <span class="o">/</span> <span class="n">name</span><span class="p">).</span><span class="nf">exists</span><span class="p">()),</span>
    <span class="bp">None</span><span class="p">,</span>
<span class="p">)</span>

<span class="bp">...</span>
<span class="p">.</span><span class="nf">run_commands</span><span class="p">(</span><span class="sh">"</span><span class="s">mkdir -p /run/sshd /root/.ssh &amp;&amp; chmod 700 /root/.ssh</span><span class="sh">"</span><span class="p">)</span>
<span class="p">.</span><span class="nf">add_local_file</span><span class="p">(</span><span class="nf">str</span><span class="p">(</span><span class="n">SSH_PUB_KEY</span><span class="p">),</span> <span class="sh">"</span><span class="s">/root/.ssh/authorized_keys</span><span class="sh">"</span><span class="p">,</span> <span class="n">copy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div> <h2 id="setting-up-vscode--claude-code">Setting up VSCode &amp; Claude Code</h2> <p>I prefer working on VSCode using the “Remote Access” plugin. Therefore I’ve installed VSCode directly into the image to persist my workspace settings and session connection.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">.</span><span class="nf">run_commands</span><span class="p">(</span>
    <span class="sh">"</span><span class="s">curl -fsSL </span><span class="sh">'</span><span class="s">https://code.visualstudio.com/sha/download?build=stable&amp;os=cli-alpine-x64</span><span class="sh">'</span><span class="s"> -o /tmp/vscode-cli.tar.gz</span><span class="sh">"</span>
    <span class="sh">"</span><span class="s"> &amp;&amp; tar -xzf /tmp/vscode-cli.tar.gz -C /usr/local/bin</span><span class="sh">"</span>
    <span class="sh">"</span><span class="s"> &amp;&amp; rm /tmp/vscode-cli.tar.gz</span><span class="sh">"</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div> <p>Also make sure it persists your plugins, preferences and auth after your first login.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">VSCODE_EXTENSIONS</span> <span class="o">=</span> <span class="p">[</span>
    <span class="sh">"</span><span class="s">ms-python.python</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">ms-python.pylint</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">ms-python.debugpy</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">ms-toolsai.jupyter</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">Anthropic.claude-code</span><span class="sh">"</span><span class="p">,</span>
<span class="p">]</span>

<span class="k">def</span> <span class="nf">start_vscode_tunnel</span><span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Start a VS Code tunnel, persisting auth and extensions on the volume.</span><span class="sh">"""</span>

    <span class="n">vscode_data</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">WORKSPACE_PATH</span><span class="si">}</span><span class="s">/.vscode-cli</span><span class="sh">"</span>
    <span class="n">os</span><span class="p">.</span><span class="nf">makedirs</span><span class="p">(</span><span class="n">vscode_data</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">env</span> <span class="o">=</span> <span class="p">{</span><span class="o">**</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">,</span> <span class="sh">"</span><span class="s">VSCODE_CLI_DATA_DIR</span><span class="sh">"</span><span class="p">:</span> <span class="n">vscode_data</span><span class="p">}</span>

    <span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span><span class="sh">"</span><span class="s">code</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">tunnel</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">--accept-server-license-terms</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">--name</span><span class="sh">"</span><span class="p">,</span> <span class="n">name</span><span class="p">]</span>
    <span class="k">for</span> <span class="n">ext</span> <span class="ow">in</span> <span class="n">VSCODE_EXTENSIONS</span><span class="p">:</span>
        <span class="n">cmd</span><span class="p">.</span><span class="nf">extend</span><span class="p">([</span><span class="sh">"</span><span class="s">--install-extension</span><span class="sh">"</span><span class="p">,</span> <span class="n">ext</span><span class="p">])</span>

    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Starting VS Code tunnel </span><span class="sh">'</span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="sh">'</span><span class="s">...</span><span class="sh">"</span><span class="p">)</span>
    <span class="n">subprocess</span><span class="p">.</span><span class="nf">run</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">env</span><span class="o">=</span><span class="n">env</span><span class="p">,</span> <span class="n">check</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div> <p>Note that I want to persist my Claude Code sesssion too, thus I’ve persisted my claude config under my mounted disk too.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">.</span><span class="nf">env</span><span class="p">({</span>
    <span class="sh">"</span><span class="s">CLAUDE_CONFIG_DIR</span><span class="sh">"</span><span class="p">:</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">WORKSPACE_PATH</span><span class="si">}</span><span class="s">/.claude</span><span class="sh">"</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">IS_SANDBOX</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">1</span><span class="sh">"</span><span class="p">,</span> <span class="c1"># Ensures sure dangerously skip permissions works
</span><span class="p">})</span>
</code></pre></div></div> <h1 id="usage">Usage</h1> <p>Now that we have our Modal environment ready, we can create a local entrypoint that allows us to launch VSCode directly, with an option to add GPU if we need to do some debugging or GPU execution.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@app.local_entrypoint</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">vscode</span><span class="p">(</span><span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
    <span class="bp">...</span>
    <span class="n">sb</span> <span class="o">=</span> <span class="n">modal</span><span class="p">.</span><span class="n">Sandbox</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
        <span class="sh">"</span><span class="s">python</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-c</span><span class="sh">"</span><span class="p">,</span>
        <span class="sa">f</span><span class="sh">"</span><span class="s">from tunnel import start_vscode_tunnel; start_vscode_tunnel(</span><span class="sh">'</span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="sh">'</span><span class="s">)</span><span class="sh">"</span><span class="p">,</span>
        <span class="n">image</span><span class="o">=</span><span class="n">dev_image</span><span class="p">,</span>
        <span class="n">gpu</span><span class="o">=</span><span class="n">use_gpu</span><span class="p">,</span>
        <span class="n">cpu</span><span class="o">=</span><span class="n">use_cpu</span><span class="p">,</span>
        <span class="n">memory</span><span class="o">=</span><span class="n">use_memory</span><span class="p">,</span>
        <span class="n">volumes</span><span class="o">=</span><span class="n">VOLUMES</span><span class="p">,</span>
        <span class="n">secrets</span><span class="o">=</span><span class="n">SECRETS</span><span class="p">,</span>
        <span class="n">timeout</span><span class="o">=</span><span class="mi">6</span> <span class="o">*</span> <span class="mi">3600</span><span class="p">,</span>
        <span class="n">app</span><span class="o">=</span><span class="n">app</span><span class="p">,</span>
    <span class="p">)</span>
</code></pre></div></div> <p>The usage is simple, you can request your development environment with any spec you need,</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>modal run dev.py                                      <span class="c"># no GPU, default resources</span>
modal run dev.py <span class="nt">--gpu</span> H100 <span class="nt">--cpu</span> 16 <span class="nt">--memory</span> 131072  <span class="c"># H100 + 16 CPU + 128 GiB</span>
</code></pre></div></div> <p>This will automaticaly create a VSCode tunnel that you can access from you VSCode desktop as long as it is connected to your account.</p> <div class="row mt-3"> <div style="max-width: 100%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/modal-vscode/vscode_tunnel-480.webp 480w,/assets/img/posts/modal-vscode/vscode_tunnel-800.webp 800w,/assets/img/posts/modal-vscode/vscode_tunnel-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/modal-vscode/vscode_tunnel.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <h1 id="final-words">Final Words</h1> <p>Having low-overhead GPU access is cruicial for GPU development and cost efficiency. Modal lets us achieve this by letting us spin up instances quickly with any desired spec while our workspace is persisted in volumes. In this example, I’ve only setup VSCode tunnel to get live-editing and terminal access however one can create custom functions rather than executing code manually every time. Using functions means</p>]]></content><author><name></name></author><category term="ml"/><category term="ai"/><category term="development"/><summary type="html"><![CDATA[How to setup a persistent workspace in Modal for doing development on GPU required apps.]]></summary></entry><entry><title type="html">Training language models on TPUs shouldn’t be scary</title><link href="https://dogac.dev/blog/2026/migrating-to-tpu/" rel="alternate" type="text/html" title="Training language models on TPUs shouldn’t be scary"/><published>2026-02-03T00:00:00+00:00</published><updated>2026-02-03T00:00:00+00:00</updated><id>https://dogac.dev/blog/2026/migrating-to-tpu</id><content type="html" xml:base="https://dogac.dev/blog/2026/migrating-to-tpu/"><![CDATA[<script src="https://d3js.org/d3.v7.min.js" defer=""></script> <script src="/assets/js/posts/speculative_viz.js" defer=""></script> <style>div.code-display-wrapper pre,pre.highlight,pre code,.highlight code{line-height:1.4}</style> <h2 id="background">Background</h2> <p>Recently, I’ve been working on training speculative decoding <d-cite key="leviathan2023fast"></d-cite> models to speed up LLM inference. These models are small LLMs, distilled from a target model. The variant I train, EAGLE <d-cite key="li2024eagle"></d-cite>, <d-cite key="li2024eagle2"></d-cite>, operates on the hidden states of the target model rather than relying solely on input tokens to predict next tokens. As drafter models are smaller, they can predict multiple tokens much faster than the bigger model (verifier), and the verifier model can verify multiple tokens in a single forward pass. The training process for drafter models is pretty similar to training LLMs, but they train faster as their input is hidden features coming from the verifier. However, the pipeline is much more complicated as we have to run both the verifier LLM and the model that’s being trained together* to generate hidden states on the go.</p> <aside> <p>*Generating hidden states offline and using them during training is hard, as it’s tens of terabytes and trying to load that during training also poses significant challenges especially with network disks.</p> </aside> <div id="speculative-viz" style="padding-bottom: 1em;"></div> <p>I’m experimenting with the training pipeline and the model architecture to get better performance, therefore I have to re-train those models many times. I’ve chosen Llama 3.1 8B as the target model, and my drafter model is a single layer transformer with 450M parameters. Even though they are small, the compute required to train is still very high. On a single H100, it takes around 4 days to train on a 1.4M row dataset for 3 epochs. Even though I have access to up to 8 H100s using the university’s resources, scheduling of those GPUs might take days and running multiple experiments still takes a lot of time. They greatly help during the development and running small experiments without the full training, but I needed more compute to scale up my experiments especially for bigger models.</p> <p>The training pipeline is open-source and available at <a href="https://github.com/sgl-project/SpecForge">SpecForge</a> and the architecture follows <a href="https://github.com/SafeAILab/EAGLE">EAGLE 3</a>.</p> <h2 id="getting-access-to-tpus">Getting Access to TPUs</h2> <p>I’ve stumbled upon <a href="https://sites.research.google/trc/about/">Google’s TRC program</a> while I was looking for GPU grants. Google gives away their spare TPUs while they are not being used to the researchers. The application process was pretty easy, I just filled a short form and two days later I got access to hundreds of TPUs (in different regions). These limits might change user to user, however the one I was most interested in was 64 spot* Cloud TPU v6e chips.</p> <aside> <ul> <li><strong>spot</strong>: preemptible / interruptable instances without previous notice.</li> </ul> </aside> <table> <thead> <tr> <th style="text-align: left">Feature</th> <th style="text-align: left">Google TPU v6e (Trillium)</th> <th style="text-align: left">NVIDIA H100 (SXM)</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>Compute (BF16)</strong></td> <td style="text-align: left">918 TFLOPs</td> <td style="text-align: left">990 TFLOPs (1,979 Sparse)</td> </tr> <tr> <td style="text-align: left"><strong>HBM Capacity</strong></td> <td style="text-align: left">32 GB</td> <td style="text-align: left">80 GB</td> </tr> <tr> <td style="text-align: left"><strong>HBM Bandwidth</strong></td> <td style="text-align: left">1.6 TB/s</td> <td style="text-align: left">3.35 TB/s</td> </tr> <tr> <td style="text-align: left"><strong>Interconnect</strong></td> <td style="text-align: left">800 GB/s</td> <td style="text-align: left">900 GB/s (NVLink)</td> </tr> </tbody> </table> <div class="caption"> <p>Source: <a href="https://docs.cloud.google.com/tpu/docs/v6e">TPU v6e Documentation</a>, <a href="https://www.civo.com/blog/comparing-nvidia-b200-and-h100">Comparing NVIDIA’s B200 and H100</a>.</p> </div> <p>The amount of compute I will get by upgrading from 8 H100 to 64 v6e would help me a lot with scaling my experiments, so I’ve decided to migrate my codebase to support TPUs. To start with it, I choose a 4-chip v6e VM. This should give us performance roughly around 4 H100 GPUs and a total of 128 GB of HBM. Notice that TPUs have significantly less memory, thus we need to be more careful with how we distribute our models accross chips.</p> <h2 id="initial-migration">Initial Migration</h2> <p>I used <code class="language-plaintext highlighter-rouge">torch==2.9.0</code> and <code class="language-plaintext highlighter-rouge">torch_xla[tpu]==2.9.0</code>, which is the latest stable release as of Feb 2 2026. First, we need to remove all <code class="language-plaintext highlighter-rouge">.cuda()</code> calls and use a utility class to call <code class="language-plaintext highlighter-rouge">to_device(...)</code>. Next, we should ensure our models run on <em>bfloat16</em> as TPUs are optimized for <code class="language-plaintext highlighter-rouge">bf16</code> operations. Since we don’t use <code class="language-plaintext highlighter-rouge">torch.distributed</code> anymore, we hide all tensor parallel logic behind a flag that is disabled when training is run on TPUs. However, <em>SPMD</em> on multi-node requires initializing <code class="language-plaintext highlighter-rouge">torch.distributed</code> for coordination, but it can’t use <em>xla</em> backend. Thus, you must use <em>gloo</em> to initialize. Note that if you forget to opt-out of <code class="language-plaintext highlighter-rouge">dist.all_gather</code> or other operations, you might see your training run very slowly, because those tensors are being broadcast over the <em>gloo</em> backend.</p> <p><strong>Tip:</strong> I’ve experienced some <a href="https://github.com/pytorch/xla/issues/9735">hard-to-debug issues due to a race condition</a> when SPMD isn’t initialized early enough.</p> <h2 id="spmd-and-fsdp">SPMD and FSDP</h2> <p>Since I mentioned <em>SPMD</em> (Single Program Multiple Data), let me explain what it is. The GPU code runs using <em>Fully Sharded Data Parallelism</em> (FSDP). The weights of each <em>FSDP Unit</em> are sharded amongst the devices, they are gathered on each device to be used during calculation of next layer. This reduces the memory footprint of models as those weights are lazily loaded. For more information check the <a href="https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html">official PyTorch tutorial on FSDP</a>.</p> <aside> <p>For more information on TPU architecture, check <a href="https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm">TPU architecture docs from Google</a>.</p> </aside> <div class="row mt-3"> <div style="max-width: 60%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/tpu/fsdp_overview-480.webp 480w,/assets/img/posts/tpu/fsdp_overview-800.webp 800w,/assets/img/posts/tpu/fsdp_overview-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/tpu/fsdp_overview.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption" style="text-align: left;"> Overview of the FSDP algorithm, figure from <d-cite key="zhao2023pytorchfsdpexperiencesscaling"/>. Each device holds a shard $(1/N)$ of the weights of each layer. During execution, devices perform an all-gather to reconstruct the full weights within the FSDP group. Immediately after the computation, borrowed weights are freed to minimize the memory. </div> <p>I wanted to keep using the same strategy for multi-TPU training. Even though <code class="language-plaintext highlighter-rouge">pytorch_xla</code> provides a <code class="language-plaintext highlighter-rouge">FSDP</code> class, I prefer to manually shard entities, as I found this wrapper to be ineffective for our training pipeline for a single layer transformer, yielding a single FSDP unit. Moreover, our TPUs (<em>v6e</em>) has only 32GB of HBM, whereas our H100 GPU had 80GB of VRAM. This discrepancy forces us to shard our model more aggressively using <em>SPMD</em>, which is an automatic parallelization system for common ML workloads.</p> <div class="row mt-3"> <div style="max-width: 80%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/tpu/spmd_overview-480.webp 480w,/assets/img/posts/tpu/spmd_overview-800.webp 800w,/assets/img/posts/tpu/spmd_overview-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/tpu/spmd_overview.png" class="img-fluid rounded z-depth-1 bg-white" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption" style="text-align: left"> <p>Overview of the SPMD, source: <a href="https://pytorch.org/blog/pytorch-xla-spmd">pytorch.org/blog/pytorch-xla-spmd</a>. The user defines a logical mesh that maps to the physical TPU topology. By applying sharding annotations to tensors (weights and inputs), the user guides the XLA compiler on how to distribute data across the mesh. The compiler then automatically partitions the computation graph, inserting the necessary collective operations such as all-reduces and all-gathers to ensure mathematical correctness across all devices.</p> </div> <p>SPMD allows us to shard tensors among devices without explicitly specifying communication &amp; collective operations such as all-gather or all-reduce <d-cite key="xu2021gspmd"></d-cite>. It automatically inserts those operations when the computation graph needs the full view of the tensor. The user only needs to hint the compiler on which weights to replicate and the compiler handles the rest.</p> <p>Developers specify how tensors are sharded using shrading specs. I.e. if you use the sharding spec <code class="language-plaintext highlighter-rouge">(None,)</code> on <code class="language-plaintext highlighter-rouge">RMSNorm</code>, you will replicate the weights of <code class="language-plaintext highlighter-rouge">RMSNorm</code> on all devices. Assume you set your mesh shape to <code class="language-plaintext highlighter-rouge">('fsdp', 'model')</code> and <code class="language-plaintext highlighter-rouge">(4,1)</code>, If you use <code class="language-plaintext highlighter-rouge">('fsdp',)</code>, you will shard the weights among all 4 devices, meaning each holds 1/4 of the weights. For more details, check <a href="https://docs.pytorch.org/xla/release/r2.8/perf/spmd_basic.html">PyTorch XLA’s SPMD guide</a>.</p> <p>To migrate our models to SPMD, I’ve created some helpers to automatically shard the model using <code class="language-plaintext highlighter-rouge">xm.mark_sharding(...)</code>. The initial version sharded weights with two dimensions on the first dimension, while leaving the other such as <code class="language-plaintext highlighter-rouge">RMSNorm</code> replicated to prevent excessive communication.</p> <p>I’ve decided to fully utilize the TPUs by using (4,1) topology: Split all weights and input amongst the first dimension, meaning all devices will roughly hold 1/4 of the entire model weights thus the computation. Also make sure to use <code class="language-plaintext highlighter-rouge">mp.MpDeviceLoader</code>. and set input to shard as <code class="language-plaintext highlighter-rouge">('fsdp', None)</code> to split the batch amongst devices.</p> <h2 id="first-run">First Run</h2> <p>In the first run, I got <code class="language-plaintext highlighter-rouge">torch._inductor.exc.InductorError: LoweringException</code> on the initial run coming from compiled kernels. Thus, I decided to disable dynamic compilations by replacing all <code class="language-plaintext highlighter-rouge">@torch.compile</code> with <code class="language-plaintext highlighter-rouge">@maybe_compile</code> and disabled compilations when running on TPU target and we got our first successful run.</p> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TPU Runtime Utilization
┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Chip ┃ HBM Usage (GiB)      ┃ Duty cycle ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ 0    │ 9.92 GiB / 31.25 GiB │ 0.88%      │
│ 1    │ 9.92 GiB / 31.25 GiB │ 0.89%      │
│ 2    │ 9.92 GiB / 31.25 GiB │ 0.89%      │
│ 3    │ 9.92 GiB / 31.25 GiB │ 0.89%      │
└──────┴──────────────────────┴────────────┘
</code></pre></div></div> <p>Looking at <code class="language-plaintext highlighter-rouge">tpu-info</code> logs, it seems like we are able to split our model into 4 TPU cores successfully. But we have an issue,</p> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TensorCore Utilization
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Core ID ┃ TensorCore Utilization ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0       │ 0.00%                  │
│ 1       │ 0.00%                  │
│ 2       │ 0.00%                  │
│ 3       │ 0.00%                  │
└─────────┴────────────────────────┘
</code></pre></div></div> <p>We see our TensorCore utilization is really low and our runs are very slow.</p> <h2 id="recompilations">Recompilations</h2> <p>The first thing we notice when we run our TPU program is that it is really slow. TPUs are not run on eager mode like GPUs by default, so they need to do a fake forward pass to compute the computation graph, and later compile that computation graph to be optimized for <em>MXU</em> (matrix multiplication units). This compilation process is really expensive and causes a big overhead for starting our training pipeline. In our case, it took several minutes to run a single iteration. By setting <code class="language-plaintext highlighter-rouge">PT_XLA_DEBUG_LEVEL=2</code>, we can inspect those compilations. The first thing we notice is that it complains about graph breaks. As TPUs are really good at fusing operations (automatic fused kernels), breaking this graph with operations that require moving data to CPU trigger compilations and less efficient computation graphs. In addition to graph breaks, we see that a re-compilation is triggered after each forward pass,</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Compilation Analysis: Compilation Cause
Compilation Analysis:   torch_xla.sync in parallel loader at step end
Compilation Analysis: Graph Info:
Compilation Analysis:   Graph Hash: ad9b7364e3d7a77aa6178b3269100fd6
Compilation Analysis:   Number of Graph Inputs: 360
Compilation Analysis:   Number of Graph Outputs: 41
</code></pre></div></div> <p>The graph hash keeps changing, we are forced to recompile which is quite slow. Metrics also point out that there are 4 compilations and each compilation takes several minutes,</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Metric: CompileTime
  TotalSamples: 4
  Accumulator: 04m32s050ms558.590us

Metric: ExecuteReplicatedTime
  TotalSamples: 4
  Accumulator: 01s390ms444.752us
...

Counter: MarkStep
  Value: 1
</code></pre></div></div> <p>Let’s try to understand why graph is forced to recompile for each iteration. We will set the following parameters to save our IR (Intermediate Representation) of TPU compilations and try to understand how each graph differs from each other.</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">XLA_IR_DEBUG</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">XLA_SAVE_TENSORS_FMT</span><span class="o">=</span>text
<span class="nb">export </span><span class="nv">XLA_SAVE_TENSORS_FILE</span><span class="o">=</span>/tmp/save.ir
</code></pre></div></div> <p>Our first couple graphs are pretty small, it is not ideal to have such small graphs, however we will return back to this later on. There is one graph that is much bigger than the others and keeps recompiling,</p> <div class="l-page"> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## BEGIN_GRAPH
IR {
  %0 = s64[] xla::device_data(), location=_compute_metric_acc@eagle3.py:570, xla_shape=s64[]
  %1 = s64[] xla::device_data(), location=_compute_metric_acc@eagle3.py:570, xla_shape=s64[]
  %2 = s64[4,473]{1,0} xla::device_data(), scope=cpu_data_to_xla_device.2, location=convert_fn@xla_model.py:1294, xla_shape=s64[4,473]{1,0}
  %3 = s64[4,473,1]{2,1,0} aten::view(%2), location=generate_eagle3_data@eagle3_target_model.py:600, xla_shape=s64[4,473,1]{2,1,0}
  %4 = (s64[4,473,1]{2,1,0}) aten::split(%3), location=get_dp_data_shard_from_tp@train_eagle3.py:799, xla_shape=(s64[4,473,1]{2,1,0})
  %5 = s64[] aten::sum(%4), location=_compute_metric_acc@eagle3.py:570, xla_shape=s64[]
  %6 = s64[] aten::clamp(%5, %1, %0), location=_compute_metric_acc@eagle3.py:570, xla_shape=s64[]
  %7 = f32[] xla::cast(%6), location=_compute_metric_acc@eagle3.py:568, xla_shape=f32[]
  %8 = bf16[] prim::Constant(), location=padding@utils.py:51, xla_shape=bf16[]
  %9 = bf16[] aten::expand(%8), location=padding@utils.py:51, xla_shape=bf16[]
    ...
</code></pre></div> </div> </div> <p>As this graph is huge, it is hard to understand what is going on. It has all the computations for our target model LLM’s forward pass, newly trained model’s forward pass and backwards pass combined with optimizer steps. If we search for other compilations that are triggered for the same line <code class="language-plaintext highlighter-rouge">location=_compute_metric_acc@eagle3.py:570</code>, we will see a prime suspect,</p> <div class="l-page"> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
    %6 = s64[4,1244,1]{2,1,0} aten::view(%5), location=generate_eagle3_data@eagle3_target_model.py:570, xla_shape=s64[4,1244,1]{2,1,0}
    ...

    %6 = s64[4,491,1]{2,1,0} aten::view(%5), location=generate_eagle3_data@eagle3_target_model.py:570, xla_shape=s64[4,491,1]{2,1,0}
    ...

</code></pre></div> </div> </div> <p>The shape of one of the inputs to the graph changes each time, causing a recompilation. The shape <code class="language-plaintext highlighter-rouge">(4, 491, 1)</code> is oddly familiar, it comes from <code class="language-plaintext highlighter-rouge">generate_eagle3_data</code> and it consists of our <code class="language-plaintext highlighter-rouge">(batch_size, seq_len, 1)</code>, meaning this is our input ids. So the varying sequence length of our inputs is causing recompilation. The best solution is to group similar length inputs together (which we already do using <a href="https://github.com/huggingface/transformers/blob/379ec6b9529becf0a464fa58ee783b6bdef4034a/src/transformers/trainer_pt_utils.py#L494"><code class="language-plaintext highlighter-rouge">DistributedLengthGroupedSampler</code></a>), however it still yields different sequence lengths. Best I can think of is to pad those inputs into their closest multiple of 128 (as MXUs operate on 128x128 tiles). So we update the <code class="language-plaintext highlighter-rouge">DataLoader</code> with this call,</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">max_length</span> <span class="o">=</span> <span class="nf">max</span><span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="sh">"</span><span class="s">input_ids</span><span class="sh">"</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">features</span><span class="p">)</span>
<span class="n">max_length</span> <span class="o">=</span> <span class="nf">min</span><span class="p">(</span><span class="n">max_length</span><span class="p">,</span> <span class="n">self</span><span class="p">.</span><span class="n">max_seq_len</span><span class="p">)</span>

<span class="k">if</span> <span class="nf">is_tpu</span><span class="p">():</span>
    <span class="n">padding_lookup</span> <span class="o">=</span> <span class="p">[</span><span class="mi">128</span> <span class="o">*</span> <span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="mi">20</span><span class="p">)]</span>
    <span class="n">max_length</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">padding_lookup</span> <span class="k">if</span> <span class="n">x</span> <span class="o">&gt;=</span> <span class="n">max_length</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>

<span class="n">batch_input_ids</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">cat</span><span class="p">(</span>
    <span class="p">[</span><span class="n">self</span><span class="p">.</span><span class="nf">paddingtensor2D</span><span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="sh">"</span><span class="s">input_ids</span><span class="sh">"</span><span class="p">],</span> <span class="n">max_length</span><span class="p">)</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">features</span><span class="p">]</span>
<span class="p">)</span>
</code></pre></div></div> <p>This still causes frequent compilations during the initial stages of our training pipeline, however as time passess, our recompilations stop and we keep using cached computation graphs. For simplicity, predictability and rapid development, I’ve decided to pad everything to 2048 for the rest of this debugging session. I will get back to comparing results for fixed length and dynamic length later on. After running the code again with 2048 max sequence length, waiting a couple minutes and couple iterations for initial compilations to occur, we see our speed has increased from minutes to seconds for each iteration.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TPU Runtime Utilization                      TensorCore Utilization
┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Chip ┃ HBM Usage (GiB)      ┃ Duty cycle ┃ ┃ Core ID ┃ TensorCore Utilization ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ ┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0    │ 8.93 GiB / 31.25 GiB │ 92.95%     │ │ 0       │ 28.14%                 │
│ 1    │ 8.93 GiB / 31.25 GiB │ 92.95%     │ │ 1       │ 28.32%                 │
│ 2    │ 8.93 GiB / 31.25 GiB │ 92.95%     │ │ 2       │ 28.50%                 │
│ 3    │ 8.93 GiB / 31.25 GiB │ 92.95%     │ │ 3       │ 28.55%                 │
└──────┴──────────────────────┴────────────┘ └─────────┴────────────────────────┘
</code></pre></div></div> <p>Also this is what happens if you forget to comment out <code class="language-plaintext highlighter-rouge">dist</code> calls such as <code class="language-plaintext highlighter-rouge">dist.all_reduce(accuracies, op=dist.ReduceOp.AVG)</code> and replace with <code class="language-plaintext highlighter-rouge">xm.all_reduce</code>. As we utilized <code class="language-plaintext highlighter-rouge">gloo</code> backend, <code class="language-plaintext highlighter-rouge">dist.all_reduce</code> moves tensors to CPU and runs the operation over the standard network. In constrast, <code class="language-plaintext highlighter-rouge">xm.all_reduce</code> utilizes the high-speed TPU interconnect and keeps the operation within the HLO computation graph, allowing XLA to optimize the execution lazily without breaking the trace. So be careful!</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Training Epoch 0:   0%|             | 85/322442 [06:38&lt;212:42:52,  2.38s/it]
TPU Runtime Utilization                      TensorCore Utilization
┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Chip ┃ HBM Usage (GiB)      ┃ Duty cycle ┃ ┃ Core ID ┃ TensorCore Utilization ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ ┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0    │ 8.93 GiB / 31.25 GiB │ 34.47%     │ │ 0       │ 11.65%                 │
│ 1    │ 8.93 GiB / 31.25 GiB │ 34.46%     │ │ 1       │ 11.91%                 │
│ 2    │ 8.93 GiB / 31.25 GiB │ 34.46%     │ │ 2       │ 13.17%                 │
│ 3    │ 8.93 GiB / 31.25 GiB │ 34.46%     │ │ 3       │ 12.81%                 │
└──────┴──────────────────────┴────────────┘ └─────────┴────────────────────────┘
</code></pre></div></div> <h2 id="reading-the-metrics-report">Reading the Metrics Report</h2> <p>However, we are still severely under-utilizing our TPU cores. We can inspect the metrics by calling <code class="language-plaintext highlighter-rouge">met.short_metrics_report()</code> and <code class="language-plaintext highlighter-rouge">met.clear_metrics()</code> after each batch. Let’s take a look,</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Training Epoch 0:   0%|              | 74/322442 [04:25&lt;84:17:51,  1.06it/s]
Counter: CachedCompile
  Value: 438
Metric: ExecuteReplicatedTime
  TotalSamples: 6
  Accumulator: 01s039ms247.686us
  ValueRate: 01s473ms641.554us / second
  Rate: 8.50216 / second
Metric: TransferToDeviceTime
  TotalSamples: 3
  Accumulator: 206.900us
  ValueRate: 026ms712.659us / second
  Rate: 372.827 / second
Metric: TransferFromDeviceTime
  TotalSamples: 6
  Accumulator: 860ms592.740us
  ValueRate: 01s160ms288.419us / second
  Rate: 8.09887 / second
Counter: MarkStep
  Value: 75
Counter: aten::_local_scalar_dense
  Value: 223
Counter: aten::nonzero
  Value: 76
</code></pre></div></div> <p>As you can see, our counter <code class="language-plaintext highlighter-rouge">CachedCompile</code> keeps increasing, meaning we are not re-compiling the computation graph. But,</p> <ol> <li>There are 4 times more <code class="language-plaintext highlighter-rouge">CachedCompile</code> than our <code class="language-plaintext highlighter-rouge">MarkStep</code>, meaning we are splitting our single training loop into 4 computation graphs.</li> <li>There are lots of device transfers <code class="language-plaintext highlighter-rouge">TransferToDeviceTime</code> and <code class="language-plaintext highlighter-rouge">TransferFromDeviceTime</code> per batch.</li> <li>There is an unlowered <code class="language-plaintext highlighter-rouge">aten::nonzero</code> call and <code class="language-plaintext highlighter-rouge">aten::_local_scalar_dense</code> call that force data movement between CPU and device.</li> </ol> <h3 id="logging-without-breaking-the-graph">Logging Without Breaking the Graph</h3> <p>Transfers between CPU and device happen with calls such as <code class="language-plaintext highlighter-rouge">.cpu()</code> or <code class="language-plaintext highlighter-rouge">.item()</code>, which we have to explicit calls to,</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">accuracies</span> <span class="o">=</span> <span class="n">accuracies</span><span class="p">.</span><span class="nf">cpu</span><span class="p">().</span><span class="nf">tolist</span><span class="p">()</span>
<span class="n">plossess</span> <span class="o">=</span> <span class="n">plossess</span><span class="p">.</span><span class="nf">cpu</span><span class="p">().</span><span class="nf">tolist</span><span class="p">()</span>
</code></pre></div></div> <p>Our initial logging frequency was every step, surely we can optimize this quickly by logging less frequently such as every 100 steps. However there is a better solution PyTorch XLA provides, <code class="language-plaintext highlighter-rouge">xm.add_step_closure</code>, which executes the code after the given arguments are already computed by the graph at the end of the step, designed for things such as logging metrics.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">logging_closure</span><span class="p">(</span><span class="n">accuracies</span><span class="p">,</span> <span class="n">losses</span><span class="p">,</span> <span class="n">current_step</span><span class="p">):</span>
    <span class="n">acces_list</span> <span class="o">=</span> <span class="n">acces_stacked</span><span class="p">.</span><span class="nf">tolist</span><span class="p">()</span>
    <span class="n">plosses_list</span> <span class="o">=</span> <span class="n">plosses_stacked</span><span class="p">.</span><span class="nf">tolist</span><span class="p">()</span>

    <span class="n">avg_loss</span> <span class="o">=</span> <span class="nf">sum</span><span class="p">(</span><span class="n">plosses_list</span><span class="p">)</span> <span class="o">/</span> <span class="nf">len</span><span class="p">(</span><span class="n">plosses_list</span><span class="p">)</span>
    <span class="n">avg_acc</span> <span class="o">=</span> <span class="nf">sum</span><span class="p">(</span><span class="n">acces_list</span><span class="p">)</span> <span class="o">/</span> <span class="nf">len</span><span class="p">(</span><span class="n">acces_list</span><span class="p">)</span>

    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Accuracy: </span><span class="si">{</span><span class="n">avg_acc</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">, loss: </span><span class="si">{</span><span class="n">avg_loss</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
    <span class="c1"># Or log to external providers such as WandB...
</span>
<span class="k">if</span> <span class="n">global_step</span> <span class="o">%</span> <span class="p">(</span><span class="n">args</span><span class="p">.</span><span class="n">log_interval</span> <span class="o">*</span> <span class="n">args</span><span class="p">.</span><span class="n">draft_accumulation_steps</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
    <span class="n">xm</span><span class="p">.</span><span class="nf">add_step_closure</span><span class="p">(</span>
        <span class="n">logging_closure</span><span class="p">,</span>
        <span class="n">args</span><span class="o">=</span><span class="p">(</span>
            <span class="n">acces_tensor</span><span class="p">,</span>
            <span class="n">plosses_tensor</span><span class="p">,</span>
            <span class="n">current_step_val</span><span class="p">,</span>
        <span class="p">)</span>
    <span class="p">)</span>
</code></pre></div></div> <p>This small change increases our throughput from 2.4 it/s to 4.3 it/s !</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Training Epoch 0:   0%|        | 97/322442 [02:34&lt;20:57:43,  4.27it/s]
Counter: CachedCompile
  Value: 192
Metric: ExecuteReplicatedTime
  TotalSamples: 2
  Accumulator: 178ms749.887us
  ValueRate: 01m16s143ms388.265us / second
  Rate: 856.748 / second
...

Metric: TransferToDeviceTime
  TotalSamples: 2
Metric: TransferFromDeviceTime
  TotalSamples: 4

Counter: aten::_local_scalar_dense
  Value: 98
Counter: aten::nonzero
  Value: 99
</code></pre></div></div> <p>And our utilization also looks much better.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TPU Runtime Utilization                      TensorCore Utilization
┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Chip ┃ HBM Usage (GiB)      ┃ Duty cycle ┃ ┃ Core ID ┃ TensorCore Utilization ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ ┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0    │ 8.37 GiB / 31.25 GiB │ 77.69%     │ │ 0       │ 24.15%                 │
│ 1    │ 8.37 GiB / 31.25 GiB │ 77.69%     │ │ 1       │ 23.83%                 │
│ 2    │ 8.37 GiB / 31.25 GiB │ 77.70%     │ │ 2       │ 23.64%                 │
│ 3    │ 8.37 GiB / 31.25 GiB │ 77.70%     │ │ 3       │ 23.51%                 │
└──────┴──────────────────────┴────────────┘ └─────────┴────────────────────────┘
</code></pre></div></div> <h3 id="preventing-graph-breaks">Preventing Graph Breaks</h3> <p>We’re now down to 2 graphs from 4. There is one small graph that takes around 382us and one big graph that takes 177ms, which is our forward + backwards pass. Now, let’s try to figure out the remaining transfers, <code class="language-plaintext highlighter-rouge">aten::_local_scalar_dense</code> and <code class="language-plaintext highlighter-rouge">aten::nonzero</code> calls. Let’s look at unintended re-compilations to merge our loop into a single big computation graph.</p> <div class="l-page"> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Compilation Analysis: Compilation Cause
Compilation Analysis:   most likely user code trying to access tensor value before torch_xla.sync
Compilation Analysis: Graph Info:
Compilation Analysis:   Graph Hash: dd86da112d0f633a545849a2943f2f29
Compilation Analysis:   Number of Graph Inputs: 1
Compilation Analysis:   Number of Graph Outputs: 1
Compilation Analysis: Python Frame Triggered Execution:
Compilation Analysis:   _ignore_causal_mask_sdpa (/home/dogac/miniconda/envs/specforge-tpu/lib/python3.11/site-packages/transformers/masking_utils.py:255)
Compilation Analysis:   sdpa_mask_recent_torch (/home/dogac/miniconda/envs/specforge-tpu/lib/python3.11/site-packages/transformers/masking_utils.py:374)
Compilation Analysis:   create_causal_mask (/home/dogac/miniconda/envs/specforge-tpu/lib/python3.11/site-packages/transformers/masking_utils.py:825)
Compilation Analysis:   forward (/home/dogac/specforge/specforge/modeling/target/custom_backend/llama.py:323)
</code></pre></div> </div> </div> <p>This is an optimization inside transformers library’s that allow skipping some heavy calculations for the causal mask.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">padding_mask</span><span class="p">.</span><span class="nf">all</span><span class="p">()</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">_is_torch_xpu_available</span> <span class="ow">or</span> <span class="n">query_length</span> <span class="o">==</span> <span class="mi">1</span>
<span class="k">else</span> <span class="n">padding_mask</span><span class="p">[:,</span> <span class="p">:</span><span class="n">query_length</span><span class="p">].</span><span class="nf">all</span><span class="p">()</span>
</code></pre></div></div> <p>However the <code class="language-plaintext highlighter-rouge">.all()</code> call causes the graph to materialize and break out computation graph. As this is not working well in TPUs, I will override the <code class="language-plaintext highlighter-rouge">allow_is_causal_skip</code> flag by monkey-patching it, so that it always calculates the causal mask. After this change, I did not see a significant benefit on runtime, but now there is only 1 computation graph and the <code class="language-plaintext highlighter-rouge">CachedCompile</code> matches <code class="language-plaintext highlighter-rouge">NumSteps</code> while <code class="language-plaintext highlighter-rouge">aten::_local_scalar_dense</code> calls disappear.</p> <p>Now let’s take a look at <code class="language-plaintext highlighter-rouge">aten::nonzero</code> calls,</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">target_head</span> <span class="o">=</span> <span class="n">target_head</span><span class="p">[...,</span> <span class="n">t2d</span><span class="p">]</span>
</code></pre></div></div> <p>Here, <code class="language-plaintext highlighter-rouge">t2d</code> is a mapping from the verifier model’s token IDs to the drafter’s. The size of <code class="language-plaintext highlighter-rouge">target_head</code> is dynamic because it depends on the number of ones in the <code class="language-plaintext highlighter-rouge">t2d</code> tensor. Since these IDs are fixed, we can pre-compute the indices and avoid the expensive <code class="language-plaintext highlighter-rouge">nonzero</code> operation. This also lets the compiler know the exact dimensions during the forward pass.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">self</span><span class="p">.</span><span class="n">t2d_indices</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">nonzero</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">t2d</span><span class="p">).</span><span class="nf">squeeze</span><span class="p">()</span> <span class="c1"># Pre-compute once at the start
</span>
<span class="n">target_head</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">index_select</span><span class="p">(</span><span class="n">target_head</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">t2d_indices</span><span class="p">)</span>
</code></pre></div></div> <p>After this change, the number of <code class="language-plaintext highlighter-rouge">aten::nonzero</code> calls also disappear.</p> <h3 id="batch-size">Batch Size</h3> <p>Secondly, increasing our batch size from 4 to 16 increased our average batch processing time from 4.3 it/s to 5.2 it/s while increasing the Duty Cycle up to 99%. From now on, I will keep reporting the speed for processing a batch of 16 elements which is 1.30 it/s.</p> <h2 id="using-the-tensorboard-profiler">Using the Tensorboard Profiler</h2> <p>However, we are still not quite there yet, we run with around 25% efficiency. Metrics have only taken us this far. So let’s enable the tensorboard profiler. In the profiler, we can inspect memory, see the slowest HLO ops, do roofline analysis and inspect computation graphs.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">torch_xla.debug.profiler</span> <span class="k">as</span> <span class="n">xp</span>
<span class="n">server</span> <span class="o">=</span> <span class="n">xp</span><span class="p">.</span><span class="nf">start_server</span><span class="p">(</span><span class="mi">9012</span><span class="p">)</span>
<span class="c1"># On another shell, run `tensorboard --logdir ~/tensorboard --port 6006`
</span></code></pre></div></div> <p>Let’s take a look at the roofline analysis to understand how close we are to our hardware limits. Currently, we are using 50% of our HBM bandwidth and only 22.6% of our FLOPs.</p> <div class="l-page"> <table> <thead> <tr> <th style="text-align: left">Step</th> <th style="text-align: left">Total Time per core (us)</th> <th style="text-align: left">Normalized FLOP Rate (GFLOP/s)</th> <th style="text-align: left">Bound by</th> <th style="text-align: left">HBM BW (GiB/s)</th> <th style="text-align: left">Roofline efficiency (%)</th> <th style="text-align: left">FLOP Rate / Peak (%)</th> <th style="text-align: left">Max memory BW utilization (%)</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>Total</strong></td> <td style="text-align: left">1,006,078</td> <td style="text-align: left">207,779.58</td> <td style="text-align: left">HBM</td> <td style="text-align: left">771.74</td> <td style="text-align: left">50.6%</td> <td style="text-align: left">22.6%</td> <td style="text-align: left">50.6%</td> </tr> </tbody> </table> </div> <p>One optimization I wanted to test is, removing the custom causal mask we are passing to the model and let the underlying SDPA kernel handle it. My intuition was that disabling the <code class="language-plaintext highlighter-rouge">allow_is_causal_skip</code> caused additional computation. By setting <code class="language-plaintext highlighter-rouge">causal_mask</code> to <code class="language-plaintext highlighter-rouge">None</code> instead of creating it myself, I saw a significant improvement to 1.89 it/s. Once you take a look at the roofline analysis, you will see our HBM utilization dropped and FLOPs utilization has increased.</p> <div class="l-page"> <table> <thead> <tr> <th style="text-align: left">Step</th> <th style="text-align: left">Total Time per core (us)</th> <th style="text-align: left">Normalized FLOP Rate (GFLOP/s)</th> <th style="text-align: left">Bound by</th> <th style="text-align: left">HBM BW (GiB/s)</th> <th style="text-align: left">Roofline efficiency (%)</th> <th style="text-align: left">FLOP Rate / Peak (%)</th> <th style="text-align: left">Max memory BW utilization (%)</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong></td> <td style="text-align: left">1,003,930</td> <td style="text-align: left">298,421.26</td> <td style="text-align: left">HBM</td> <td style="text-align: left">622.86</td> <td style="text-align: left">40.8%</td> <td style="text-align: left">32.5%</td> <td style="text-align: left">40.8%</td> </tr> </tbody> </table> </div> <h3 id="optimizing-communication">Optimizing Communication</h3> <p>When we look at the HLO Op stats, we notice some operations create very expensive all-to-all or all-reduce operations. Those operations are triggered because weights are sharded among TPUs hence their outputs are too. When moving on to the next calculation, the computation might need another view of the tensor, requiring fetching all tensors and re-sharding or computation might need to aggregate results from other TPUs, requiring all-gather operations such as averaging gradients amongst different batches.</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/tpu/before_vocab_shard-480.webp 480w,/assets/img/posts/tpu/before_vocab_shard-800.webp 800w,/assets/img/posts/tpu/before_vocab_shard-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/tpu/before_vocab_shard.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption"> HLO Op Stats with sharded Vocabulary </div> <div class="l-page"> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%all-to-all.2 = s32[2,2048,4,32064]{1,3,0,2:T(8,128)} all-to-all(s32[2,2048,4,32064]{1,3,0,2:T(8,128)} %multiply_add_fusion), channel_id=485, replica_groups=0, dimensions={2}

%all-reduce.2 = bf16[8,2048,32000]{1,2,0:T(8,128)(2,1)} all-reduce(bf16[8,2048,32000]{1,2,0:T(8,128)(2,1)} %fusion.151), channel_id=479, replica_groups=0, use_global_device_ids=true, to_apply=%add.2.clone
</code></pre></div> </div> </div> <p>Those are vocabularies (32064 * 4 = 128,256) for the target model, 32000 for the draft model. We can sacrifice some memory as we are way below our per chip maxixmum memory, and duplicate those big weights in each chip. This would eliminate the need for those expensive communications. After replicating the LM head weights, training speed has increased to 2.13it/s.</p> <div class="l-page"> <table> <thead> <tr> <th style="text-align: left">Step</th> <th style="text-align: left">Total Time per core (us)</th> <th style="text-align: left">Normalized FLOP Rate (GFLOP/s)</th> <th style="text-align: left">Bound by</th> <th style="text-align: left">HBM BW (GiB/s)</th> <th style="text-align: left">Roofline efficiency (%)</th> <th style="text-align: left">FLOP Rate / Peak (%)</th> <th style="text-align: left">Max memory BW utilization (%)</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong></td> <td style="text-align: left">970,951</td> <td style="text-align: left">343,915.37</td> <td style="text-align: left">HBM</td> <td style="text-align: left">652.71</td> <td style="text-align: left">42.8%</td> <td style="text-align: left">37.5%</td> <td style="text-align: left">42.8%</td> </tr> </tbody> </table> </div> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/tpu/after_vocab_shard-480.webp 480w,/assets/img/posts/tpu/after_vocab_shard-800.webp 800w,/assets/img/posts/tpu/after_vocab_shard-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/tpu/after_vocab_shard.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption"> HLO Op Stats with replicated Vocabulary </div> <p>Moreover, I’ve tried switching the <code class="language-plaintext highlighter-rouge">down_proj</code> and <code class="language-plaintext highlighter-rouge">o_proj</code> layers to row parallelism similar to Tensor Parallel (TP) workloads. However it made no change in performance, hinting that its communication is already optimized by the compiler.</p> <h2 id="automatic-mixed-precision">Automatic Mixed Precision</h2> <p><a href="https://docs.pytorch.org/xla/release/r2.8/perf/amp.html">Automatic Mixed Precision (AMP)</a> is a technique used to train models more efficiently by using lower precision FP16 or BF16 for operations that are safer for lower precision, while automatically switching to FP32 for operations requiring higher precision. Note that our TPU specs have performance reported as BF16, because MXU can only operate on BF16 precision floats. In order to process FP32, you need to use <em>Vector Processing Units</em> (VPU) which has significantly less compute power, since they are not arranged as a 2D grid. We can enable automatic casting of operations to the right numerical type using the torch AMP autocast,</p> <aside><p>In the CUDA world, Tensor Cores do matrix multiplications, and CUDA Cores do FP32 calculations. For H100, CUDA cores only have 67 TFLOPs.</p></aside> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="nf">autocast</span><span class="p">(</span><span class="n">device_type</span><span class="o">=</span><span class="sh">"</span><span class="s">xla</span><span class="sh">"</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">bfloat16</span><span class="p">):</span>
</code></pre></div></div> <p>After the change, our speed has increased to 2.17it/s.</p> <h2 id="compiled-models">Compiled Models</h2> <p>Torch XLA automatically compiles your model code lazily into computation graphs. However, we can also utilize <em>TorchDynamo</em>, a Python-level JIT compiler, to optimize for the bytecode.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">self</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">compile</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">backend</span><span class="o">=</span><span class="sh">"</span><span class="s">openxla</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div> <p>This has increased our speed up to 2.38it/s. However, note that not all compilations are optimal. For example trying to compile our loss function like we have done in CUDA case results in a slowdown and excessive memory usage. In this case, our speed has dropped to 2.15it/s after compiling.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@torch.compile</span><span class="p">(</span><span class="n">backend</span><span class="o">=</span><span class="sh">"</span><span class="s">openxla</span><span class="sh">"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_compute_loss</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">target_p</span><span class="p">,</span> <span class="n">position_mask</span><span class="p">):</span>
    <span class="n">logits</span> <span class="o">=</span> <span class="n">logits</span><span class="p">.</span><span class="nf">float</span><span class="p">()</span>
    <span class="n">out_logp</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="nc">LogSoftmax</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">2</span><span class="p">)(</span><span class="n">logits</span><span class="p">)</span>
    <span class="n">plogp</span> <span class="o">=</span> <span class="n">target_p</span> <span class="o">*</span> <span class="n">out_logp</span>
    <span class="n">loss</span> <span class="o">=</span> <span class="o">-</span><span class="n">torch</span><span class="p">.</span><span class="nf">sum</span><span class="p">(</span><span class="n">position_mask</span> <span class="o">*</span> <span class="n">plogp</span><span class="p">,</span> <span class="mi">2</span><span class="p">).</span><span class="nf">mean</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">loss</span>
</code></pre></div></div> <p>Looking into the memory metrics using our profiler, forcing the compilation of a specific function creates memory fragmentation. Torch XLA compiler can successfully merge some of the computations in forwards and backwards passes, however forcing an eager compilation could prevent those fusions for a more efficient computation graph. Therefore <code class="language-plaintext highlighter-rouge">@torch.compile</code> is not always good for XLA workloads.</p> <div class="row mt-3"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/tpu/with_loss_compile-480.webp 480w,/assets/img/posts/tpu/with_loss_compile-800.webp 800w,/assets/img/posts/tpu/with_loss_compile-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/tpu/with_loss_compile.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/tpu/without_loss_compile-480.webp 480w,/assets/img/posts/tpu/without_loss_compile-800.webp 800w,/assets/img/posts/tpu/without_loss_compile-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/tpu/without_loss_compile.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Memory usage of torch compiled loss &amp; not compiled loss. </div> <h2 id="results">Results</h2> <p>Let’s take a look at our performance and compare our v6e-4 node with a single H100 and a 4 H100 node. The first benchmark is a simple case of next token prediction, which I will denote with TTT=1*. We used two different attention kernels in CUDA, note that this attention kernel is only changing the drafter model, not the verifier. The verifier is hard-coded to use SDPA. The speedup reported on the TPU is relative to the best performing 4-GPU equivalent.</p> <aside> <p>*TTT stands for <em>Training-Time Test</em>. The drafter model makes multiple predictions during training, and we call number of consecutive predictions it makes <em>TTT length</em>.</p> </aside> <p><strong>TTT=1, Sequence Length = 2048 (it/s)</strong></p> <table> <thead> <tr> <th> </th> <th><strong>H100 (sdpa)</strong></th> <th><strong>4xH100 (sdpa)</strong></th> <th><strong>H100 (flex)</strong></th> <th><strong>4xH100 (flex)</strong></th> <th><strong>v6e-4</strong></th> </tr> </thead> <tbody> <tr> <td><strong>bs=32</strong></td> <td>OOM</td> <td>40.96</td> <td>OOM</td> <td>42.88</td> <td>37.76 <strong>(0.88x)</strong></td> </tr> <tr> <td><strong>bs=16</strong></td> <td>OOM</td> <td>40.16</td> <td>10.95</td> <td>41.76 <strong>(3.81x)</strong></td> <td>38.40 <strong>(0.91x)</strong></td> </tr> <tr> <td><strong>bs=4</strong></td> <td>10.4</td> <td>31.2 <strong>(3.08x)</strong></td> <td>10.84</td> <td>32.8</td> <td>32.36 <strong>(0.98x)</strong></td> </tr> <tr> <td><strong>bs=1</strong></td> <td>8.85</td> <td>N/A</td> <td>9.20</td> <td>N/A</td> <td>N/A</td> </tr> </tbody> </table> <p><strong>TTT=1, Sequence Length = Dynamic (it/s)</strong></p> <table> <thead> <tr> <th> </th> <th><strong>H100 (sdpa)</strong></th> <th><strong>4xH100 (sdpa)</strong></th> <th><strong>H100 (flex)</strong></th> <th><strong>4xH100 (flex)</strong></th> <th><strong>v6e-4</strong></th> </tr> </thead> <tbody> <tr> <td><strong>bs=32</strong></td> <td>OOM</td> <td>115.52</td> <td>OOM</td> <td>119.04</td> <td>112.0 <strong>(0.94x)</strong></td> </tr> <tr> <td><strong>bs=16</strong></td> <td>OOM</td> <td>104</td> <td>38.08</td> <td>106.5 <strong>(2.79x)</strong></td> <td>101.7 <strong>(0.95x)</strong></td> </tr> <tr> <td><strong>bs=4</strong></td> <td>30.4</td> <td>63.2 <strong>(2.07x)</strong></td> <td>29.4</td> <td>63.72 <strong>(2.16x)</strong></td> <td>92.00 <strong>(1.44x)</strong></td> </tr> <tr> <td><strong>bs=1</strong></td> <td>19.20</td> <td>N/A</td> <td>22.0</td> <td>N/A</td> <td>N/A</td> </tr> </tbody> </table> <p>We observe that:</p> <ul> <li>Flex attention becomes more efficient on larger batches and longer sequences (both memory &amp; performance).</li> <li>Dynamic sequence length gives us a free-lunch effect on speeding up training.</li> <li>4-GPU nodes do not scale perfectly, especially on smaller input sequences and batches.</li> <li>Efficiency of v6e-4 TPUs is very close to 4-GPU nodes that use Distributed Data Parallel (DDP).*</li> </ul> <aside><p>*We actually use _FSDP_ with `NO_SHARD` option for CUDA, which is the most communication efficient option. But still, TPUs are so efficient at optimizing the computation graph their true sharding can get as fast as DDP.</p></aside> <p>Since we did not use a custom attention kernel for our TPUs (unlike our CUDA code), we observe that TPU model can’t catch up with the efficiency of flex attention baseline.</p> <p>Next is more complicated, according to the EAGLE-3 <dcite key="li2025eagle3"></dcite> paper, running multiple forward passess using the drafter model’s own hidden states during training time yields better accuracy for the drafter model. Therefore instead of predicting the next token, we predict the next 8 tokens by calling our model 8 times repeatedly. This approach has a heavy tax on our memory usage, as we have to store activations and gradients from each 8 forward pass as we calculate our loss as $\mathcal{L} = \sum_{i=1}^{N} 0.8^{(i-1)} \mathcal{L}_i$. Therefore we can’t increase our batch size as high as 32 like we have done for the next-token prediction case.</p> <p><strong>TTT=8, Sequence Length = 2048 (it/s)</strong></p> <table> <thead> <tr> <th> </th> <th><strong>H100 (sdpa)</strong></th> <th><strong>4xH100 (sdpa)</strong></th> <th><strong>H100 (flex)</strong></th> <th><strong>4xH100 (flex)</strong></th> <th><strong>v6e-4</strong></th> </tr> </thead> <tbody> <tr> <td><strong>bs=8</strong></td> <td>OOM</td> <td>17.36</td> <td>OOM</td> <td>23.68</td> <td>16.00 <strong>(0.67x)</strong></td> </tr> <tr> <td><strong>bs=4</strong></td> <td>4.68</td> <td>15.32 <strong>(3.27x)</strong></td> <td>6.64</td> <td>20.60 <strong>(3.10x)</strong></td> <td>15.72 <strong>(0.76x)</strong></td> </tr> <tr> <td><strong>bs=1</strong></td> <td>4.11</td> <td>N/A</td> <td>5.51</td> <td>N/A</td> <td>N/A</td> </tr> </tbody> </table> <aside><p>The efficiency of flex attention on longer sequences is more visible, hence our speedup is smaller. Comparing to the SDPA, our TPU code performs pretty closely.</p></aside> <p><strong>TTT=8, Sequence Length = Dynamic (it/s)</strong></p> <table> <thead> <tr> <th> </th> <th><strong>H100 (sdpa)</strong></th> <th><strong>4xH100 (sdpa)</strong></th> <th><strong>H100 (flex)</strong></th> <th><strong>4xH100 (flex)</strong></th> <th><strong>v6e-4</strong></th> </tr> </thead> <tbody> <tr> <td><strong>bs=8</strong></td> <td>OOM</td> <td>43.84</td> <td>OOM</td> <td>48.88</td> <td>43.63 <strong>(0.89x)</strong></td> </tr> <tr> <td><strong>bs=4</strong></td> <td>15.2</td> <td>33.00 <strong>(2.17x)</strong></td> <td>15.28</td> <td>34.2 <strong>(2.23x)</strong></td> <td>41.60 <strong>(1.21x)</strong></td> </tr> <tr> <td><strong>bs=1</strong></td> <td>10.70</td> <td>N/A</td> <td>11.75</td> <td>N/A</td> <td>N/A</td> </tr> </tbody> </table> <p>Here we see that TPUs are less efficient. It can be attributed to three factors,</p> <ol> <li>Since we are doing more forward passes on the smaller model, we are more memory-bounded. The v6e’s HBM memory bandwidth is half of H100s, thus it is not as fast.</li> <li>Smaller model’s matricies can’t be sharded very efficiently, as they are small and MXU operates well on big matrices that are multiples of 128.</li> <li>Flex attention starts to shine for cases where attention calculation dominates runtime.</li> </ol> <h2 id="roofline-analysis">Roofline Analysis</h2> <p>Let’s take a look at our roofline analysis. Most of our operations are pretty close to <em>Pareto optimality</em>, however we still have operations that are inefficient. In this graph, the yellow dots (loop fusion), are memory-bound, whereas the blue dots (convolution fusion) are compute-bound. I suspect those yellow dots are results of the attention, whereas the blue dots are other operations MLP layers. Let’s try to validate that.</p> <div class="row mt-3"> <div style="max-width: 60%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/tpu/roofline-480.webp 480w,/assets/img/posts/tpu/roofline-800.webp 800w,/assets/img/posts/tpu/roofline-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/tpu/roofline.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p>We can take a look at the HLO Op Profiler to see which operations have spent the most amount of time.</p> <div class="row mt-3"> <div style="max-width: 100%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/tpu/profiler-480.webp 480w,/assets/img/posts/tpu/profiler-800.webp 800w,/assets/img/posts/tpu/profiler-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/tpu/profiler.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p>Here we see the how much time is spent on compute (FLOPs), memory (HBM) and wasted. Let’s inspect HLO Graph of an operation that has high ratio of wasted time.</p> <div class="row mt-3"> <div style="max-width: 100%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/tpu/attention_padding-480.webp 480w,/assets/img/posts/tpu/attention_padding-800.webp 800w,/assets/img/posts/tpu/attention_padding-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/tpu/attention_padding.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p>The input shape <code class="language-plaintext highlighter-rouge">4, 32, 2048, 64</code> validates our claim that this loop fusion is a part of attention. 32 is our number of heads and 64 comes from splitting the head dimension 128 into two during <em>RoPE</em>. Also note that our last dimension 64 is padded to 128, this means our MXU could be under-utilized. Therefore an efficient attention implementation for those small models is cruicial to get better performance.</p> <h2 id="final-words">Final Words</h2> <p>In this post, I walked through how speculative decoding models work and how to train one. I then migrated the training code to run on TPUs, optimizing it step by step until performance nearly matched multi-GPU setups. That said, there’s still room for improvement—particularly by using an attention kernel optimized for XLA. PyTorch XLA has experimental flash attention kernels, but integrating them into our architecture requires some additional work.</p> <p>Looking at the remaining inefficiencies in the convolution fusion, I noticed that some computation graphs include many <code class="language-plaintext highlighter-rouge">u32[]</code> and <code class="language-plaintext highlighter-rouge">s32[]</code> dependencies. I’m not yet sure why these appear, but removing them would likely improve performance further.</p> <div class="row mt-3"> <div style="max-width: 100%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/tpu/inefficient_params-480.webp 480w,/assets/img/posts/tpu/inefficient_params-800.webp 800w,/assets/img/posts/tpu/inefficient_params-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/tpu/inefficient_params.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Computation graph with many unknown dependencies. </div> <p>In the future I will also train drafters for much larger models as well. Hopefully their performance would be more optimal, as their weights are bigger, they should have parallelization issues.</p> <p>If you have any questions or comments, please feel free to <a href="mailto:dogacel@gmail.com">reach out to me</a>.</p> <h2 id="bibtex">BibTeX</h2> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@misc</span><span class="p">{</span><span class="nl">eldenk2026migratingtotpu</span><span class="p">,</span>
  <span class="na">title</span><span class="p">=</span><span class="s">{Training language models on {TPUs} shouldn't be scary}</span><span class="p">,</span>
  <span class="na">author</span><span class="p">=</span><span class="s">{Doğaç Eldenk}</span><span class="p">,</span>
  <span class="na">year</span><span class="p">=</span><span class="s">{2026}</span><span class="p">,</span>
  <span class="na">month</span><span class="p">=</span><span class="nv">feb</span><span class="p">,</span>
  <span class="na">howpublished</span><span class="p">=</span><span class="s">{\url{https://dogac.dev/blog/2026/migrating-to-tpu/}}</span><span class="p">,</span>
  <span class="na">note</span><span class="p">=</span><span class="s">{Blog post}</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name></name></author><category term="ml"/><category term="ai"/><category term="research"/><category term="optimization"/><summary type="html"><![CDATA[A practical guide to migrating your PyTorch training code to TPUs, with step-by-step debugging and optimization tips.]]></summary></entry><entry><title type="html">The Case Against Dependency Injection</title><link href="https://dogac.dev/blog/2025/the-case-against-dependency-injection/" rel="alternate" type="text/html" title="The Case Against Dependency Injection"/><published>2025-06-11T00:00:00+00:00</published><updated>2025-06-11T00:00:00+00:00</updated><id>https://dogac.dev/blog/2025/the-case-against-dependency-injection</id><content type="html" xml:base="https://dogac.dev/blog/2025/the-case-against-dependency-injection/"><![CDATA[<p>I first met with Dependency Injection when I on-boarded myself on a large backend project that used Scala and Play framework. Over time, I have convinced myself that dependency injection is a good way of managing dependencies, but recently, I have come to the conclusion most of the time, it hurts more than it helps.</p> <h2 id="1---interfaces-objects-and-classes-1interfaces-objects-and-classes">1 - Interfaces, Objects and Classes {#1interfaces-objects-and-classes}</h2> <p>One argument Dependency Injection frameworks give is how your implementation is decoupled from the interface. I would like to ask you, how many times your interfaces had multiple implementations? Moreover, which one of those you wanted to abstract out the implementation being passed? I have seen many occasions where developers created interfaces pre-emptively that didn't provide any value because the standard way of doing things is "Dependency Injection" and they create interfaces to decouple the implementation. Hand on heart, did that interface achieve anything real, or is it just a pattern we have been following without thinking much, because we have been advertised this framework allows us to "separate concerns" by allowing us to <em>inject</em> interfaces?</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">interface</span> <span class="nc">IBookRepository</span> <span class="p">{</span>
  <span class="k">fun</span> <span class="nf">getBooks</span><span class="p">():</span> <span class="nc">List</span><span class="p">&lt;</span><span class="nc">Book</span><span class="p">&gt;</span>
<span class="p">}</span>

<span class="kd">class</span> <span class="nc">BookRepository</span> <span class="p">:</span> <span class="nc">IBookRepository</span> <span class="p">{</span>
  <span class="k">override</span> <span class="k">fun</span> <span class="nf">getBooks</span><span class="p">()</span> <span class="p">=</span> <span class="nc">Books</span><span class="p">.</span><span class="nf">selectAll</span><span class="p">().</span><span class="nf">toList</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div> <p>Do you really need the interface <code class="language-plaintext highlighter-rouge">IBookRepository</code> when you only have one datasource that holds up books? Even if you had multiple sources, why would you inject different types of implementation in your code? One possibility is choosing different implementation for local, testing and production environments, however I think it just makes testing less effective, as you have different behavior in different environments now.</p> <p>Let's remember, objects are also still a cool alternative to <code class="language-plaintext highlighter-rouge">@Singleton</code> injection.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">object</span> <span class="nc">BookRepository</span> <span class="p">{</span>
  <span class="k">fun</span> <span class="nf">getBooks</span><span class="p">()</span> <span class="p">=</span> <span class="nc">Books</span><span class="p">.</span><span class="nf">selectAll</span><span class="p">().</span><span class="nf">toList</span><span class="p">()</span>
<span class="p">}</span>

<span class="kd">object</span> <span class="nc">BookController</span> <span class="p">{</span>
  <span class="k">fun</span> <span class="nf">getBooks</span><span class="p">()</span> <span class="p">=</span> <span class="nc">BookRepository</span><span class="p">.</span><span class="nf">getBooks</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div> <p>There isn't a clear reason to me on why this is less acceptable than the dependency injected implementation. Moreover, dependency injection spreads like a plauge, because you can no longer access the <code class="language-plaintext highlighter-rouge">BookRepository</code> instance easily from a class / object that is not created via the dependency injection framework. So anything that depends on something dependency injected, needs to be dependency injected itself.</p> <h2 id="2---testing-2testing">2 - Testing {#2testing}</h2> <p>Testing is not easier if you have constructor with bunch of unrelated dependencies. Some argue seeing dependencies explicitly enforces you to not miss them while writing tests and have fully intended behavior. I don't see it, on contrary I would argue they shift the focus away from the thing that is actually being tested. You write bunch of boilerplate things that you did not really need to test a method that only depends on a single dependency, yet so still initialized the class with 10+ dependencies.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">BookController</span><span class="p">(</span>
  <span class="n">authenticator</span><span class="p">:</span> <span class="nc">Authenticatior</span><span class="p">,</span>
  <span class="n">bookRepository</span><span class="p">:</span> <span class="nc">BookRepository</span><span class="p">,</span>
  <span class="n">userRepository</span><span class="p">:</span> <span class="nc">UserRepository</span><span class="p">,</span>
  <span class="n">libraryRepository</span><span class="p">:</span> <span class="nc">LibraryRepository</span><span class="p">,</span>
<span class="p">)</span> <span class="p">{</span>
  <span class="k">fun</span> <span class="nf">getPublicBooks</span><span class="p">()</span> <span class="p">=</span> <span class="n">bookRepository</span><span class="p">.</span><span class="nf">getPublicBooks</span><span class="p">()</span>

  <span class="o">..</span><span class="p">.</span>
<span class="p">}</span>

<span class="c1">// While testing</span>

<span class="kd">class</span> <span class="nc">BookControllerTest</span> <span class="p">{</span>

  <span class="nd">@Test</span>
  <span class="k">fun</span> <span class="nf">`should</span> <span class="k">get</span> <span class="k">public</span> <span class="nf">books`</span><span class="p">()</span> <span class="p">{</span>

    <span class="kd">val</span> <span class="py">mockBookRepository</span> <span class="p">=</span> <span class="n">mockk</span><span class="p">&lt;</span><span class="nc">BookRepository</span><span class="p">&gt;()</span>

    <span class="c1">// Initialization gets longer and longer over time</span>
    <span class="kd">val</span> <span class="py">sut</span> <span class="p">=</span> <span class="nc">BookController</span><span class="p">(</span>
        <span class="n">authenticator</span> <span class="p">=</span> <span class="nf">mockk</span><span class="p">(),</span>
        <span class="n">bookRepository</span> <span class="p">=</span> <span class="n">mockBookRepository</span><span class="p">,</span>
        <span class="n">userRepository</span> <span class="p">=</span> <span class="nf">mockk</span><span class="p">(),</span>
        <span class="n">libraryRepository</span> <span class="p">=</span> <span class="nf">mockk</span><span class="p">(),</span>
    <span class="p">)</span>

    <span class="nf">every</span> <span class="p">{</span> <span class="n">mockBookRepository</span><span class="p">.</span><span class="nf">getPublickBooks</span><span class="p">()</span> <span class="p">}</span> <span class="n">returns</span> <span class="n">listOfBooks</span>

    <span class="n">sut</span><span class="p">.</span><span class="nf">getPublicBooks</span><span class="p">()</span> <span class="n">should</span> <span class="n">be</span> <span class="n">listOfBooks</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <p>If we are talking about the Controllers or Services, their responsibilities grow over time quickly. Therefore their constructor bloats and causes developers to juggle bunch of test code to make it work. One way you can get away is using property injection rather than constructor. Therefore I prefer using property injection more than constructor injection, especially for those complicated classes with multiple responsibilities (yes I think it is perfectly normal to have them in real life). However the alternative, the mocking libraries can handle testing aspect pretty well, if your language supports it.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">BookControllerTest</span> <span class="p">{</span>

  <span class="nd">@Test</span>
  <span class="k">fun</span> <span class="nf">`should</span> <span class="k">get</span> <span class="k">public</span> <span class="nf">books`</span><span class="p">()</span> <span class="p">{</span>
    <span class="nf">mockkObject</span><span class="p">(</span><span class="nc">BookRepository</span><span class="p">)</span>

    <span class="nf">every</span> <span class="p">{</span> <span class="nc">BookRepository</span><span class="p">.</span><span class="nf">getPublickBooks</span><span class="p">()</span> <span class="p">}</span> <span class="n">returns</span> <span class="n">listOfBooks</span>

    <span class="n">sut</span><span class="p">.</span><span class="nf">getPublicBooks</span><span class="p">()</span> <span class="n">should</span> <span class="n">be</span> <span class="n">listOfBooks</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <p>I am not sure why this is a lot worse than spinning the DI framework or initializing classes by constructor during test run. Is it a real advantage that specifying all dependencies manually ensures there is no unintended behavior?</p> <p>Personal experience, we have deliberately created instances of those classes using the framework provided builders rather than calling the constructor by hand, because it created such a huge overhead while writing tests. Therefore it lead to our constructor to be not called while initializing in tests. We deliberately got rid of that feature because it was such a pain to manage those long dependency lists by hand.</p> <h2 id="3---named-injection-3named-injection">3 - Named Injection {#3named-injection}</h2> <p>Named injection is even worse, why are you messing up with your statically typed language by trying to declare classes with strings? If you have multiple implementations, just use the desired implementation with a proper downcast to the interface, don't use named injection to pull a specific implementation.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">val</span> <span class="py">RestClient</span> <span class="p">=</span> <span class="n">named</span><span class="p">&lt;</span><span class="nc">IClient</span><span class="p">&gt;(</span><span class="s">"rest"</span><span class="p">)</span>
<span class="kd">val</span> <span class="py">GrpcClient</span> <span class="p">=</span> <span class="n">named</span><span class="p">&lt;</span><span class="nc">IClient</span><span class="p">&gt;(</span><span class="s">"grpc"</span><span class="p">)</span>

<span class="c1">// Instead...</span>

<span class="kd">val</span> <span class="py">RestClient</span><span class="p">:</span> <span class="nc">IClient</span> <span class="p">=</span> <span class="nc">RestClientImpl</span>
<span class="kd">val</span> <span class="py">GrpcClient</span><span class="p">:</span> <span class="nc">IClient</span> <span class="p">=</span> <span class="nc">GrpcClientImpl</span>
</code></pre></div></div> <p>I can't find an example for requiring multiple instances of the same object (not a singleton), but you can easily create multiple Instances as so</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">object</span> <span class="nc">ClientPool</span> <span class="p">{</span>
  <span class="kd">val</span> <span class="py">client1</span> <span class="p">=</span> <span class="nc">Client</span><span class="p">.</span><span class="nf">new</span><span class="p">()</span>
  <span class="kd">val</span> <span class="py">client2</span> <span class="p">=</span> <span class="nc">Client</span><span class="p">.</span><span class="nf">new</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div> <p>Or maybe all you need is an <code class="language-plaintext highlighter-rouge">ObjectPool</code> to begin with. Alternatively, leverage your language's type features and just extend the base interface with no modifications to save yourself from some headaches.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">interface</span> <span class="nc">Logger</span> <span class="p">{</span>
    <span class="k">fun</span> <span class="nf">log</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nc">String</span><span class="p">)</span>
<span class="p">}</span>

<span class="kd">interface</span> <span class="nc">PrettyLogger</span> <span class="p">:</span> <span class="nc">Logger</span>
<span class="kd">interface</span> <span class="nc">RegularLogger</span> <span class="p">:</span> <span class="nc">Logger</span>

<span class="kd">object</span> <span class="nc">PrettyLoggerImpl</span> <span class="p">:</span> <span class="nc">PrettyLogger</span> <span class="p">{</span> <span class="o">..</span><span class="p">.</span> <span class="p">}</span>
<span class="kd">object</span> <span class="nc">RegularLoggerImpl</span> <span class="p">:</span> <span class="nc">RegularLogger</span> <span class="p">{</span> <span class="o">..</span><span class="p">.</span> <span class="p">}</span>
</code></pre></div></div> <h2 id="4---cross-compatability-4cross-compatability">4 - Cross Compatability {#4cross-compatability}</h2> <p>If you ever imported a library that uses a dependency injection framework and tried to adopt into your own dependency injection system, good luck with that. You are bringing bunch of dependencies that you don't really understand how it works under the hood, and moreover you now have to make it work properly with your dependency injection system, which is an abstraction that helps you to not deal with managing dependencies yourself, but to your surprise, now you have to know how both DI systems work under the hood and interact together.</p> <p>So if you ever create a re-usable library, please don't use the modern DI frameworks, you should rely on your language features as much as possible. If you ever feel stuck, build something that works standalone, not a part of the DI system such as Spring or Dagger.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">object</span> <span class="nc">LoggerProvider</span> <span class="p">{</span>
  <span class="kd">val</span> <span class="py">logger</span><span class="p">:</span> <span class="nc">Logger</span> <span class="p">=</span> <span class="k">when</span><span class="p">(</span><span class="nc">LoggerConfig</span><span class="p">.</span><span class="n">type</span><span class="p">)</span> <span class="p">{</span>
    <span class="s">"pretty"</span> <span class="p">-&gt;</span> <span class="nc">PrettyLoggerImpl</span>
    <span class="k">else</span> <span class="p">-&gt;</span> <span class="nc">RegularLoggerImpl</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <p>Don't be afraid of creating your own abstractions to fit your own needs, I think it is perfectly normal and a common pattern in many different libraries.</p> <h2 id="5---final-remarks-5final-remarks">5 - Final Remarks {#5final-remarks}</h2> <p>I'm not against dependency injection, but I think we are making ourselves excuses to think it is the best way of managing dependencies and instances around. Instead I wanted to show you how it creates "<em>self-fulfilling prophecies</em>", when it makes sense and when it doesn't. I see when dependency injection might make sense, where implementations change quickly, they differ platform to platform, environment to environment etc. However it is important to understand when we <em>really</em> need it, versus when it just looks cool.</p> <hr/> <p>If you are new to dependency injection, I don't think this article makes a lot of sense. Therefore I have decided to move this "mini-introduction" to the end of the article, as a foot-note to the readers.</p> <h2 id="0---types-of-dependency-injection-0types-of-dependency-injection">0 - Types of Dependency Injection {#0types-of-dependency-injection}</h2> <p>The default approach to dependency injection is Constructor Injection. This type of injection ensures your dependencies are not lazily evaluated and thus your class can be created if and only if your dependencies have been already initialized successfully, unlike property injection. A constructor dependency injected class might look like so,</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="kd">class</span> <span class="nc">BookController</span><span class="p">(</span>
  <span class="n">authenticator</span><span class="p">:</span> <span class="nc">Authenticator</span><span class="p">,</span>
  <span class="n">bookRepository</span><span class="p">:</span> <span class="nc">BookRepository</span><span class="p">,</span>
<span class="p">)</span> <span class="p">{</span>
  <span class="k">fun</span> <span class="nf">getBooks</span><span class="p">()</span> <span class="p">=</span> <span class="n">authenticator</span><span class="p">.</span><span class="nf">protect</span> <span class="p">{</span>
    <span class="n">bookRepository</span><span class="p">.</span><span class="nf">getBooks</span><span class="p">()</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <p>Whereas, a property injection might look like so,</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="kd">class</span> <span class="nc">BookController</span> <span class="p">{</span>
  <span class="kd">var</span> <span class="py">authenticator</span><span class="p">:</span> <span class="nc">Authenticator</span>
  <span class="kd">var</span> <span class="py">bookRepository</span><span class="p">:</span> <span class="nc">BookRepository</span>

  <span class="o">..</span><span class="p">.</span>
<span class="p">}</span>
</code></pre></div></div> <p>Since you should create an instance of <code class="language-plaintext highlighter-rouge">BookController</code> without providing the dependencies and later set them, it doesn't have guardrails that prevent you from calling method such as <code class="language-plaintext highlighter-rouge">getBooks</code> before <code class="language-plaintext highlighter-rouge">bookRepository</code> is set. Therefore it is seen as a less desired way of dependency injection, however it provides some flexibility which is useful during testing and it helps application to initialize in a partial-state, which might be desired in some cases over total-blackout.</p> <p>In some frameworks such as <code class="language-plaintext highlighter-rouge">Koin</code>, you can use language specific features such as <code class="language-plaintext highlighter-rouge">lateinit</code> or lazy initialization in Kotlin and methods provided by Koin framework to initialize properties automatically.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="kd">class</span> <span class="nc">BookController</span> <span class="p">:</span> <span class="nc">KoinComponent</span> <span class="p">{</span>
  <span class="k">private</span> <span class="kd">val</span> <span class="py">authenticator</span> <span class="p">:</span> <span class="nc">Authenticator</span> <span class="k">by</span> <span class="nf">inject</span><span class="p">()</span>
  <span class="k">private</span> <span class="kd">val</span> <span class="py">bookRepository</span> <span class="p">:</span> <span class="nc">BookRepository</span> <span class="k">by</span> <span class="nf">inject</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div> <p>However this couples your classes directly with Koin, so it should be omitted for shared code if possible, otherwise you enforce users to use Koin to ensure classes are initialized properly.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[I first met with Dependency Injection when I on-boarded myself on a large backend project that used Scala and Play framework. Over time, I have convinced myself that dependency injection is a good way of managing dependencies, but recently, I have come to the conclusion most of the time, it hurts more than it helps.]]></summary></entry><entry><title type="html">Rethinking Modern Asynchronous Paradigms</title><link href="https://dogac.dev/blog/2025/modern-asynchronous-paradigms/" rel="alternate" type="text/html" title="Rethinking Modern Asynchronous Paradigms"/><published>2025-05-14T00:00:00+00:00</published><updated>2025-05-14T00:00:00+00:00</updated><id>https://dogac.dev/blog/2025/modern-asynchronous-paradigms</id><content type="html" xml:base="https://dogac.dev/blog/2025/modern-asynchronous-paradigms/"><![CDATA[<p>Most developers deal with some sort of <em>asynchronous</em> operation day to day. For most of us, it is I/O (Input &amp; Output). A web developer does network calls, a systems developer could do some file operations, both are based on a <em>submit and wait</em> system, where program waits until some operation is completed. Different programming languages provide different ways to write code that is asynchronous, as developer wants to utilize the processor during the "wait" phase, by either doing more operations or yielding some CPU cycles back to the host until the async operation finishes, so other processes continue running.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/async-0-480.webp 480w,/assets/img/posts/async-0-800.webp 800w,/assets/img/posts/async-0-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/async-0.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">It takes time for a request to reach the server, be processed, and for the response to arrive back at the client.</figcaption> </figure> </div> <p>For reference, if you have a 4 GHz CPU and the fastest NVMe SSDs, it takes about 0.01 milliseconds of latency to read something from the disk. This is about 40,000 CPU cycles wait, just to read something from the disk that is on your computer. Moreover, if you live in New York city and the servers are located in Chicago, it takes around 20 milliseconds just to do a roundtrip without any additional operations, which takes about 80,000,000 spare cycles.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/async-1-480.webp 480w,/assets/img/posts/async-1-800.webp 800w,/assets/img/posts/async-1-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/async-1.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <p>If your code is running in an operating system, normally the code you write runs sequentially inside the main thread within a process. The OS handles concurrent operations by switching threads super-fast. If your CPU has only 1 core, it can only run 1 thread simultanously. However, from a users perspective, this doesn't sound right, as you can run multiple programs at the same on your OS, while using your keyboard and mouse. This magical effect is achieved by pausing and unpausing threads super quickly, so the user can't feel there had been micro pauses.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/async-2-480.webp 480w,/assets/img/posts/async-2-800.webp 800w,/assets/img/posts/async-2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/async-2.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <p>From an application developer's perspective, how do you know your code is waiting for something to finish? Let's start with an explicit wait, <code class="language-plaintext highlighter-rouge">Thread.sleep(milliseconds)</code>. Assume you are sending some notification, but you don't want to annoy the user by sending them notifications too quickly. So let's wait 2 seconds after each notification is sent. Assume sending a notification is real time for now.</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sendNotifications</span><span class="o">(</span><span class="nl">notifications:</span> <span class="nc">List</span><span class="o">&lt;</span><span class="nc">Notification</span><span class="o">&gt;)</span> <span class="o">{</span>
  <span class="k">for</span> <span class="o">(</span><span class="nc">Notification</span> <span class="n">notification</span> <span class="o">:</span> <span class="n">notifications</span><span class="o">)</span> <span class="o">{</span>
    <span class="n">notification</span><span class="o">.</span><span class="na">send</span><span class="o">();</span>
    <span class="nc">Thread</span><span class="o">.</span><span class="na">sleep</span><span class="o">(</span><span class="mi">2000</span><span class="o">);</span>
  <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div> <p>When you call <code class="language-plaintext highlighter-rouge">Thread.sleep(2000)</code>, your program notifies the OS that current thread doesn't want to run for the next 2000 milliseconds. Therefore, the thread is <strong>blocked</strong> for the next 2 seconds, as it doesn't run any other code. OS will take that thread, suspend it until that given time is passed and it will run other important stuff that needs to be done in the meanwhile, such as rendering stuff on screen or processing background messages.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/async-3-480.webp 480w,/assets/img/posts/async-3-800.webp 800w,/assets/img/posts/async-3-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/async-3.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">A non-blocked thread can pickup other stuff while free</figcaption> </figure> </div> <p>Instead if you wrote some dumb code like</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="n">now</span> <span class="o">=</span> <span class="nc">System</span><span class="o">.</span><span class="na">currentMillis</span><span class="o">();</span>
<span class="k">while</span> <span class="o">(</span><span class="nc">System</span><span class="o">.</span><span class="na">currentMillis</span><span class="o">()</span> <span class="o">&lt;=</span> <span class="n">now</span> <span class="o">+</span> <span class="mi">2000</span><span class="o">)</span> <span class="o">{}</span>
</code></pre></div></div> <p>You will keep wasting CPU cycles, even though you are not doing any valuable calculation. Even though OS will probably pause your thread and do other stuff in the background, it might struggle with scheduling it efficiently, so background tasks might run slower, you might feel like your computer is less responsive and of course, as you are not leaving any spare CPU cycles.</p> <p>In this scenario, we look at only one thread, but in most applications, we spawn more thread called <em>"background threads"</em> to run stuff concurrently inside our application.</p> <p>Let's say you receive some messages from an outside source. You have a web application and you are constantly receiving messages from users and you need to send notifications to the respective target. In this case, you need a background thread that helps you receive those messages. And when you receive a message, you can send those notifications in a separate thread, so you don't block any other notification from being received and processed.</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">Thread</span> <span class="n">worker</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Thread</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="o">{</span>
      <span class="k">while</span> <span class="o">(!</span><span class="nc">Thread</span><span class="o">.</span><span class="na">currentThread</span><span class="o">.</span><span class="na">isInterrupted</span><span class="o">())</span> <span class="o">{</span>
        <span class="nc">List</span><span class="o">&lt;</span><span class="nc">Message</span><span class="o">&gt;</span> <span class="n">messages</span> <span class="o">=</span> <span class="n">pollMessages</span><span class="o">();</span>
        <span class="n">messages</span><span class="o">.</span><span class="na">forEach</span><span class="o">((</span><span class="n">message</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="o">{</span>
          <span class="nc">Thread</span> <span class="n">sender</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Thread</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="o">{</span>
            <span class="n">sendNotifications</span><span class="o">(</span><span class="n">message</span><span class="o">.</span><span class="na">notifications</span><span class="o">);</span>
          <span class="o">});</span>
          <span class="c1">// Start sending but don't wait until it finishes</span>
          <span class="n">sender</span><span class="o">.</span><span class="na">start</span><span class="o">();</span>
        <span class="o">});</span>
        <span class="c1">// Rate limit poll messages to prevent self DDoS</span>
        <span class="nc">Thread</span><span class="o">.</span><span class="na">sleep</span><span class="o">(</span><span class="mi">1000</span><span class="o">);</span>
      <span class="o">}</span>
  <span class="o">});</span>

<span class="c1">// Start the thread</span>
<span class="n">worker</span><span class="o">.</span><span class="na">start</span><span class="o">();</span>

<span class="c1">// Wait until Thread exits (until OS interrupts)</span>
<span class="n">worker</span><span class="o">.</span><span class="na">join</span><span class="o">();</span>
</code></pre></div></div> <p>First glance, this looks fine, we are creating a separate thread for each send operation, so the operating system handles concurrency for us. However, creating a threads is not cheap, it allocates lots of OS-level resources, so it is a relatively slow operation.</p> <p>So another idea is to use <em>Thread Pools</em>, where we initialize the threads beforehand, so we can omit the expensive resource and time cost of initializing threads.</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">ExecutorService</span> <span class="n">notificationPool</span> <span class="o">=</span> <span class="nc">Executors</span><span class="o">.</span><span class="na">newFixedThreadPool</span><span class="o">(</span><span class="mi">10</span><span class="o">);</span>

<span class="nc">Thread</span> <span class="n">worker</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Thread</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="o">{</span>
    <span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Background listener thread started."</span><span class="o">);</span>
    <span class="k">while</span> <span class="o">(!</span><span class="nc">Thread</span><span class="o">.</span><span class="na">currentThread</span><span class="o">().</span><span class="na">isInterrupted</span><span class="o">())</span> <span class="o">{</span>
        <span class="nc">List</span><span class="o">&lt;</span><span class="nc">Message</span><span class="o">&gt;</span> <span class="n">messages</span> <span class="o">=</span> <span class="n">pollMessages</span><span class="o">();</span>
        <span class="n">messages</span><span class="o">.</span><span class="na">forEach</span><span class="o">((</span><span class="n">message</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="o">{</span>
            <span class="n">notificationPool</span><span class="o">.</span><span class="na">submit</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="n">sendNotifications</span><span class="o">(</span><span class="n">message</span><span class="o">.</span><span class="na">notifications</span><span class="o">));</span>
        <span class="o">});</span>

        <span class="c1">// Rate limit poll messages to prevent self DDoS</span>
        <span class="nc">Thread</span><span class="o">.</span><span class="na">sleep</span><span class="o">(</span><span class="mi">1000</span><span class="o">);</span>
    <span class="o">}</span>
<span class="o">});</span>

<span class="c1">// Start the thread</span>
<span class="n">worker</span><span class="o">.</span><span class="na">start</span><span class="o">();</span>

<span class="c1">// Wait until Thread exits (until OS interrupts)</span>
<span class="n">worker</span><span class="o">.</span><span class="na">join</span><span class="o">();</span>

<span class="c1">// Shutdown thread pool after use</span>
<span class="n">notificationPool</span><span class="o">.</span><span class="na">shutdown</span><span class="o">();</span>
<span class="n">notificationPool</span><span class="o">.</span><span class="na">awaitTermination</span><span class="o">(</span><span class="mi">30</span><span class="o">,</span> <span class="nc">TimeUnit</span><span class="o">.</span><span class="na">SECONDS</span><span class="o">);</span>
</code></pre></div></div> <p>Here, we have set a size for the thread pool. This thread pool size is basically our maximum concurrency limit. We can't send notifications concurrently to more than 10 users with this setup. So let's think how we can handle this.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/async-4-480.webp 480w,/assets/img/posts/async-4-800.webp 800w,/assets/img/posts/async-4-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/async-4.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <p>The core issue with 10 user concurrency is the amount of time it takes when you call send notifications. If sending notifications took only a couple CPU cycles, running 10 threads would be more than enough! But our assumption of sending notification taking couple CPU cycles is wrong, in reality, those send notification calls are usually happening over network and takes a long time as we discussed. During those network calls, our threads would be <em>blocked</em>.</p> <p><em><strong>Note:</strong> </em>If you want to run it with minimal overhead, you could choose number of threads to be equal to 2 times number of CPU cores. Usually modern CPUs have 2 logical cores on a single physical core, thus they can run two threads real time per core.</p> <p>So how can we make the send notification only run instructions that are wait-free? It is important that we move everything related to wait outside this thread pool. Why? Because anything that does a wait, basically occupies and blocks your Thread from executing other code, even though it is technically doing nothing. So, here comes the idea of <em>Event Loops</em>. Where we run code that is doing only <em>non-blocking</em> operations, which means thread is newer blocked on a wait operation, or something super CPU intensive, such as a crypthographical calculation. On this loop, we will poll and emit events, which signal some other code to be executed potentially in another thread. For example, anything that does a blocking operation can be run on a different thread pool, where it has bunch of spare threads and a lower priority in OS, which prevents it from interrupting the precious event loop from running and executing low latency code.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/async-5-480.webp 480w,/assets/img/posts/async-5-800.webp 800w,/assets/img/posts/async-5-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/async-5.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <p>Let's think about how we can achieve sleeps and waits, calling <code class="language-plaintext highlighter-rouge">Thread.sleep</code> delegates scheduling to the operating system by blocking the thread until the given time has passed. Instead of blocking a thread, let's build an event-loop system. Instead of calling <code class="language-plaintext highlighter-rouge">Thread.sleep</code>, we can submit some job to a queue with a given delay, we will be creating a pub-sub model, where some jobs are scheduled via a publisher thread and the jobs are consumed and executed when the time comes on a consumer thread.</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">Schedule</span> <span class="n">schedule</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Schedule</span><span class="o">();</span>

<span class="nc">Thread</span> <span class="n">publisher</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Thread</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="o">{</span>
    <span class="k">while</span> <span class="o">(!</span><span class="nc">Thread</span><span class="o">.</span><span class="na">currentThread</span><span class="o">().</span><span class="na">isInterrupted</span><span class="o">())</span> <span class="o">{</span>
        <span class="nc">List</span><span class="o">&lt;</span><span class="nc">Message</span><span class="o">&gt;</span> <span class="n">messages</span> <span class="o">=</span> <span class="n">pollMessages</span><span class="o">();</span>
        <span class="n">messages</span><span class="o">.</span><span class="na">forEach</span><span class="o">((</span><span class="n">message</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="o">{</span>
          <span class="n">schedule</span><span class="o">.</span><span class="na">queue</span><span class="o">(</span><span class="nl">message:</span><span class="o">:</span><span class="n">sendNotification</span><span class="o">,</span> <span class="mi">2000</span><span class="o">);</span>
        <span class="o">});</span>

        <span class="nc">Thread</span><span class="o">.</span><span class="na">sleep</span><span class="o">(</span><span class="mi">1</span><span class="o">);</span>
    <span class="o">}</span>
<span class="o">});</span>

<span class="nc">Thread</span> <span class="n">consumer</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Thread</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="o">{</span>
    <span class="kt">long</span> <span class="n">lastRunAt</span> <span class="o">=</span> <span class="nc">System</span><span class="o">.</span><span class="na">currentMillis</span><span class="o">();</span>
    <span class="k">while</span> <span class="o">(!</span><span class="nc">Thread</span><span class="o">.</span><span class="na">currentThread</span><span class="o">().</span><span class="na">isInterrupted</span><span class="o">())</span> <span class="o">{</span>
        <span class="nc">List</span><span class="o">&lt;</span><span class="nc">Jobs</span><span class="o">&gt;</span> <span class="n">jobs</span> <span class="o">=</span> <span class="n">schedule</span><span class="o">.</span><span class="na">getJobBetween</span><span class="o">(</span><span class="n">lastRunAt</span><span class="o">,</span> <span class="nc">System</span><span class="o">.</span><span class="na">currentMillis</span><span class="o">());</span>

        <span class="n">jobs</span><span class="o">.</span><span class="na">forEach</span><span class="o">((</span><span class="n">job</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="n">job</span><span class="o">.</span><span class="na">run</span><span class="o">());</span>

        <span class="n">lastRunAt</span> <span class="o">=</span> <span class="nc">System</span><span class="o">.</span><span class="na">currentMillis</span><span class="o">();</span>
        <span class="nc">Thread</span><span class="o">.</span><span class="na">sleep</span><span class="o">(</span><span class="mi">1</span><span class="o">);</span> <span class="c1">// 1 milliseconds precision</span>
    <span class="o">}</span>
<span class="o">});</span>
</code></pre></div></div> <p>This is better now, as we are only running 2 threads and not running any major blocking code that affects our performance. Of course it is possible to improve this by using OS level calls. It can utilize hardware to trigger some events based on a timer or hardware level interrupts. However I wanted to show you how we can achieve something similar without relying on OS internals. This logic is actually similar to how Asynchronous frameworks are built, such as <a href="https://netty.io/" rel="noreferrer">Netty</a>. A key distinction is the use of asynchronous triggers and low-level parking mechanisms instead of <code class="language-plaintext highlighter-rouge">Thread.sleep</code>, allowing for more efficient CPU utilization and better responsiveness. Also in this example, our <em>Schedule</em> object acts similarly to a message queue, which is more popular choice in event queues, where different messages are passed around to perform different actions.</p> <p>Inside this event loop, we are currently calling some get job between method to constantly check if a new job has arrived. This is not very efficient. Instead, we could use something like <code class="language-plaintext highlighter-rouge">epoll_wait</code> with <code class="language-plaintext highlighter-rouge">io_uring</code> ,which is a kernel call that blocks the thread until some change happens on a given file descriptor. Alternatively, if you are waiting messages to arrive in your message queue, you can use <code class="language-plaintext highlighter-rouge">pthread_cond_signal</code> with <code class="language-plaintext highlighter-rouge">pthread_cond_wait</code>, which allows a thread to wait until a signal is given. In this case, our event loop can wait if all messages are processed and while adding a message to the queue, we can call signal to wake up the event loop. Those kernel calls do it efficiently, so that you are not wasting CPU cycles while doing this wait.</p> <p>For now we have just considered a static blocking call, <code class="language-plaintext highlighter-rouge">sleep(...)</code>. However, most of the blocking calls we typically use are I/O related. For example network I/O, where you send a request and wait for a response to come back. To write fully non-blocking code, you have to spin-up a thread for each step that has blocking logic (wait). You also need to write schedulers and coordinators to manage those jobs and make sure they are running with high concurrency and low latency. So, developers of Java said concurrency is really hard to manage manually, let's invent some construct that allows developers to write asynchronous code, and that's how <code class="language-plaintext highlighter-rouge">Future</code> is born.</p> <h2 id="javas-future">Java's Future</h2> <p>With a <em>Future</em>, the developer doesn't have to worry about blocking calls as often, because a Future is basically a chain of callbacks. When you construct a future, you register callbacks in your event loop. Whenever the executed code inside the Future has finished, the event loop calls your registered callback. This paradigm decouples the task submission from thread management.</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">CompletableFuture</span><span class="o">&lt;</span><span class="nc">List</span><span class="o">&lt;</span><span class="nc">Message</span><span class="o">&gt;&gt;</span> <span class="n">messagesF</span> <span class="o">=</span> <span class="n">pollMessages</span><span class="o">();</span>
<span class="nc">List</span><span class="o">&lt;</span><span class="nc">Message</span><span class="o">&gt;</span> <span class="n">messages</span> <span class="o">=</span> <span class="n">messagesF</span><span class="o">.</span><span class="na">join</span><span class="o">();</span>
</code></pre></div></div> <figcaption> <p><span style="white-space: pre-wrap;">A simple example to convert a future to a blocking call</span></p> </figcaption> <p>&lt;/figure&gt;</p> <p>A Future is a wrapper that can have values put inside from other sources in a <em>future</em> time. For example, when you call <code class="language-plaintext highlighter-rouge">.join()</code>, your current thread waits until the result inside the Future object is available. The result is usually set from another thread. So you can pass around those Future objects safely in your code without blocking your current thread.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/async-6-480.webp 480w,/assets/img/posts/async-6-800.webp 800w,/assets/img/posts/async-6-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/async-6.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">CompletableFuture</span><span class="o">&lt;</span><span class="nc">Object</span><span class="o">&gt;</span> <span class="n">future</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">CompletableFuture</span><span class="o">&lt;&gt;();</span>

<span class="c1">// Spawn a thread to do calculation in the background</span>
<span class="k">new</span> <span class="nf">Thread</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="o">{</span>
  <span class="nc">Object</span> <span class="n">result</span> <span class="o">=</span> <span class="n">longRunningCalculation</span><span class="o">();</span>
  <span class="n">future</span><span class="o">.</span><span class="na">complete</span><span class="o">(</span><span class="n">result</span><span class="o">);</span>
<span class="o">});</span>

<span class="c1">// Wait until the result is available (complete) is called.</span>
<span class="n">future</span><span class="o">.</span><span class="na">join</span><span class="o">();</span>
</code></pre></div></div> <p>Moreover, you can transform and chain Futures together to do more complex operations such as,</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">CompletableFuture</span><span class="o">.</span><span class="na">supplyAsync</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="n">calculateString</span><span class="o">())</span>
            <span class="o">.</span><span class="na">thenApply</span><span class="o">((</span><span class="nl">String:</span><span class="o">:</span><span class="n">toUpperCase</span><span class="o">))</span>
            <span class="o">.</span><span class="na">thenApply</span><span class="o">(</span><span class="n">s</span> <span class="o">-&gt;</span> <span class="n">s</span> <span class="o">+</span> <span class="s">" world"</span><span class="o">)</span>
            <span class="o">.</span><span class="na">thenAccept</span><span class="o">(</span><span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">::</span><span class="n">println</span><span class="o">);</span>
</code></pre></div></div> <p>Moreover, futures can be chained together, so ones execution will depend on another's result.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/async-7-480.webp 480w,/assets/img/posts/async-7-800.webp 800w,/assets/img/posts/async-7-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/async-7.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">CompletableFuture</span><span class="o">&lt;</span><span class="nc">String</span><span class="o">&gt;</span> <span class="n">f1</span> <span class="o">=</span> <span class="nc">CompletableFuture</span><span class="o">.</span><span class="na">supplyAsync</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="s">"hello"</span><span class="o">);</span>
<span class="n">f1</span><span class="o">.</span><span class="na">thenCompose</span><span class="o">(</span><span class="n">s</span> <span class="o">-&gt;</span> <span class="nc">CompletableFuture</span><span class="o">.</span><span class="na">supplyAsync</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="n">s</span> <span class="o">+</span> <span class="s">" world"</span><span class="o">));</span>
</code></pre></div></div> <p>As you can see, using a Future as a developer is something you need to get used to, you can't write code sequentially as before. You have to rewrite it using a special syntax. For example, a blocking code for polling and sending notifications can be written as,</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">List</span><span class="o">&lt;</span><span class="nc">Message</span><span class="o">&gt;</span> <span class="n">messages</span> <span class="o">=</span> <span class="n">pollMessages</span><span class="o">();</span>
<span class="n">messages</span><span class="o">.</span><span class="na">forEach</span><span class="o">((</span><span class="n">message</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="o">{</span>
  <span class="nc">Result</span> <span class="n">result</span> <span class="o">=</span> <span class="n">sendNotification</span><span class="o">(</span><span class="n">message</span><span class="o">.</span><span class="na">notification</span><span class="o">);</span>
  <span class="n">persistResult</span><span class="o">(</span><span class="n">result</span><span class="o">)</span>
<span class="o">});</span>
</code></pre></div></div> <p>But as usually polling, sending and persisting are waiting operations, let's modify them to return Futures. Therefore we need to write our code in the following way to prevent blocking calls. First, we modify <code class="language-plaintext highlighter-rouge">pollMessages</code>, <code class="language-plaintext highlighter-rouge">sendNotification</code> and <code class="language-plaintext highlighter-rouge">persistResult</code> to return futures, so they are not blocking.</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pollMessages</span><span class="o">()</span>
    <span class="o">.</span><span class="na">thenComposeAsync</span><span class="o">(</span><span class="n">messages</span> <span class="o">-&gt;</span> <span class="o">{</span>
        <span class="nc">List</span><span class="o">&lt;</span><span class="nc">CompletableFuture</span><span class="o">&lt;</span><span class="nc">Void</span><span class="o">&gt;&gt;</span> <span class="n">futures</span> <span class="o">=</span> <span class="n">messages</span><span class="o">.</span><span class="na">stream</span><span class="o">()</span>
            <span class="o">.</span><span class="na">map</span><span class="o">(</span><span class="n">message</span> <span class="o">-&gt;</span>
                <span class="n">sendNotification</span><span class="o">(</span><span class="n">message</span><span class="o">.</span><span class="na">notification</span><span class="o">)</span>
                    <span class="o">.</span><span class="na">thenComposeAsync</span><span class="o">(</span><span class="n">v</span> <span class="o">-&gt;</span> <span class="n">persistResult</span><span class="o">(</span><span class="n">message</span><span class="o">),</span> <span class="n">executor</span><span class="o">)</span>
            <span class="o">)</span>
            <span class="o">.</span><span class="na">toList</span><span class="o">();</span>

        <span class="k">return</span> <span class="nc">CompletableFuture</span><span class="o">.</span><span class="na">allOf</span><span class="o">(</span><span class="n">futures</span><span class="o">.</span><span class="na">toArray</span><span class="o">(</span><span class="k">new</span> <span class="nc">CompletableFuture</span><span class="o">[</span><span class="mi">0</span><span class="o">]));</span>
    <span class="o">},</span> <span class="n">executor</span><span class="o">);</span>
</code></pre></div></div> <p>As you can see, a simple sequential code had become something obscure pretty quickly. We are not doing any kind of trick to run stuff in parallel as well, we just want to run asynchronous operations without blocking.</p> <h2 id="scalas-way-of-sequentialism">Scala's Way of Sequentialism</h2> <p>By using futures, we have the flexibility of keep running more async code without waiting for each one of them. However, an application developer's code is usually written in a sequential way, so that each operation happens back to back. Therefore, futures are usually composed in a nested way. This nesting creates a readability and maintainability issue. So Scala came up with a clever way to manage those nestings, a <em>for comprehension</em>.</p> <div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="o">{</span>
  <span class="n">messages</span> <span class="k">&lt;-</span> <span class="nf">pollMessages</span><span class="o">()</span>
  <span class="n">result</span> <span class="k">&lt;-</span> <span class="nf">sendNotifications</span><span class="o">(</span><span class="nv">messages</span><span class="o">.</span><span class="py">notifications</span><span class="o">)</span>
  <span class="k">_</span> <span class="k">&lt;-</span> <span class="nf">persistResult</span><span class="o">(</span><span class="n">result</span><span class="o">)</span>
<span class="o">}</span> <span class="nf">yield</span> <span class="o">(</span><span class="n">result</span><span class="o">)</span>
</code></pre></div></div> <p>This approach tries to create a sequential syntax for writing asynchronous code unlike Java's traditional Future chaining. However it comes with several limitations,</p> <ol> <li>You still need to write code using a special syntax.</li> <li>Early returns are not possible</li> <li>Error handling is still nested.</li> <li>Iterative code doesn't translate directly.</li> </ol> <p>Those limitations also apply to Java's Future, but demonstrating them would require a different syntax, I found scala's syntax to be slighltly more friendly, but I will show you why it is still limiting. For example you can't conditionally run a code without nesting.</p> <div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="o">{</span>
  <span class="n">result</span> <span class="k">&lt;-</span> <span class="nf">sendNotifications</span><span class="o">(</span><span class="nv">messages</span><span class="o">.</span><span class="py">notifications</span><span class="o">)</span>

  <span class="c1">// This is not a valid syntax</span>
  <span class="nf">if</span> <span class="o">(</span><span class="n">result</span> <span class="o">==</span> <span class="nv">Result</span><span class="o">.</span><span class="py">ERROR</span><span class="o">)</span> <span class="o">{</span>
    <span class="k">_</span> <span class="k">&lt;-</span> <span class="nf">reportErrors</span><span class="o">(</span><span class="n">result</span><span class="o">)</span>
    <span class="k">return</span> <span class="kc">false</span>
  <span class="o">}</span>

  <span class="k">_</span> <span class="k">&lt;-</span> <span class="nf">reportSuccess</span><span class="o">(</span><span class="n">result</span><span class="o">)</span>

<span class="o">}</span> <span class="nf">yield</span> <span class="o">(</span><span class="kc">true</span><span class="o">)</span>
</code></pre></div></div> <p>You have to write it using nested for comprehensions, so each decision point in your comprehension tree needs to branch out.</p> <div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="o">{</span>
  <span class="n">result</span> <span class="k">&lt;-</span> <span class="nf">sendNotifications</span><span class="o">(</span><span class="nv">messages</span><span class="o">.</span><span class="py">notifications</span><span class="o">)</span>
  <span class="n">innerResult</span> <span class="k">&lt;-</span> <span class="n">result</span> <span class="k">match</span> <span class="o">{</span>
    <span class="k">case</span> <span class="nv">Result</span><span class="o">.</span><span class="py">ERROR</span> <span class="k">=&gt;</span> <span class="k">for</span> <span class="o">{</span>
      <span class="k">_</span> <span class="k">&lt;-</span> <span class="nf">reportErrors</span><span class="o">(</span><span class="n">result</span><span class="o">)</span>
    <span class="o">}</span> <span class="k">yield</span> <span class="kc">false</span>

    <span class="c1">// For comprehension is not recommended for single futures.</span>
    <span class="k">case</span> <span class="k">_</span> <span class="k">=&gt;</span> <span class="nf">reportSuccess</span><span class="o">(</span><span class="n">result</span><span class="o">).</span><span class="py">map</span><span class="o">(</span><span class="k">_</span> <span class="k">=&gt;</span> <span class="kc">true</span><span class="o">)</span>
  <span class="o">}</span>
<span class="o">}</span> <span class="k">yield</span> <span class="n">innerResult</span>
</code></pre></div></div> <p>For error handling, similarly you have to write <code class="language-plaintext highlighter-rouge">recover</code> blocks, you can't use your daily tool of <code class="language-plaintext highlighter-rouge">try { .. } catch { .. }</code>.</p> <div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="o">{</span>
  <span class="n">result</span> <span class="k">&lt;-</span> <span class="nf">sendNotifications</span><span class="o">(</span><span class="nv">messages</span><span class="o">.</span><span class="py">notifications</span><span class="o">).</span><span class="py">recoverWith</span> <span class="o">{</span> <span class="k">case</span> <span class="n">err</span> <span class="k">=&gt;</span>
    <span class="k">for</span> <span class="o">{</span>
      <span class="k">_</span> <span class="k">&lt;-</span> <span class="nf">reportError</span><span class="o">(</span><span class="n">err</span><span class="o">)</span>
      <span class="k">_</span> <span class="k">&lt;-</span> <span class="nf">rollback</span><span class="o">(</span><span class="nv">messages</span><span class="o">.</span><span class="py">notifications</span><span class="o">)</span>
    <span class="o">}</span> <span class="k">yield</span> <span class="kc">false</span>
  <span class="o">}</span>
<span class="o">}</span> <span class="k">yield</span> <span class="n">result</span>
</code></pre></div></div> <p>Nesting also forces you to unify the type of <code class="language-plaintext highlighter-rouge">result</code>. Normally, you could assign the result of error to a different variable and called <code class="language-plaintext highlighter-rouge">return</code> early on to prevent code from incrementing sequntially. Moreover, all those limitations still apply to Java's traditional Futures as well.</p> <h2 id="sequentialism-as-first-class-citizen">Sequentialism as First Class Citizen</h2> <p>We now know why Java's <em>Futures</em> exist and how Scala's <em>for comprehension syntax</em> try to solve some fundamental issues with those. However, it is obvious to see Java wasn't designed as first-class asynchronous programming support, where Scala tried to patch some of its inherent issues. However, Scala, never tried to replace Java, but rather tried to extend it. For comprehensions has been a big deal, but it also brought a lot of other benefits as well. On the other hand Kotlin directly targeted Java as its contender and tries to replace it. One of the distinct features of kotlins is <code class="language-plaintext highlighter-rouge">coroutines</code>.</p> <p>Instead of relying threads, which are expensive operating-system level constructs, Kotlin introduces <em>coroutines</em>, which are runtime-level lightweight constructs. Coroutines do still run on threads, but their execution is not strictly tied to a single thread, so they can switch threads during runtime. This flexibility makes them lightweight, similar to jobs submitted to the thread pools as we have shown in the first chapter. However, Kotlin has first-class support for coroutines using its language features, most importantly <code class="language-plaintext highlighter-rouge">suspend</code>.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/async-8-480.webp 480w,/assets/img/posts/async-8-800.webp 800w,/assets/img/posts/async-8-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/async-8.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <p>Unlike threads, coroutines are not paused randomly to let other coroutines run. Note that the thread that is running coroutines can be paused randomly by the OS, that is not possible to prevent, however the coroutine scheduler doesn't internally pause coroutines. On contrary, those coroutines show a cooperative approach. They <code class="language-plaintext highlighter-rouge">yield</code> the current execution whenever possible. Most importantly, they yield during asynchronous operations, where they wait for an operation result. Therefore underlying libraries should expose those asynchronous operations as <code class="language-plaintext highlighter-rouge">suspend</code> functions to allow benefiting from Kotlin's coroutine features.</p> <p>Also the best thing about suspend functions is its written the traditional sequenatial way. Sequential asynchronism is the first class citizen, whereas controlled asynchronism is also provided using other interfaces, including <code class="language-plaintext highlighter-rouge">Future</code> or Kotlin's <code class="language-plaintext highlighter-rouge">Deferred</code> construct.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">suspend</span> <span class="k">fun</span> <span class="nf">processMessages</span><span class="p">()</span> <span class="p">{</span>
  <span class="kd">val</span> <span class="py">messages</span> <span class="p">=</span> <span class="nf">pollMessages</span><span class="p">();</span>
  <span class="n">messages</span><span class="p">.</span><span class="nf">forEach</span> <span class="p">{</span> <span class="n">message</span> <span class="p">-&gt;</span>
    <span class="nf">sendNotification</span><span class="p">(</span><span class="n">message</span><span class="p">.</span><span class="n">notification</span><span class="p">)</span>
    <span class="nf">delay</span><span class="p">(</span><span class="mi">2000</span><span class="p">)</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <p>Wait, that must be blocking right? No, there is no blocking code here! The methods <code class="language-plaintext highlighter-rouge">pollMessages</code>, <code class="language-plaintext highlighter-rouge">sendNotification</code> and <code class="language-plaintext highlighter-rouge">delay</code> is actually <code class="language-plaintext highlighter-rouge">suspend</code> methods. For example, when you are polling messages, it actually does it asynchronously and the coroutine is yielded during this polling process, thus it doesn't block the running thread. Same goes for send and delay. The <code class="language-plaintext highlighter-rouge">delay</code> is a native implementation, where a scheduler stops the coroutine in the background and continues it when the given time has arrived. So we were able to benefit from an event-loop without writing the nested futures and executors. If you are curious about how event-loops are implemented, check the <a href="https://github.com/JetBrains/kotlin/blob/a0dcf483dc9cebfe9a67fd2260fb2b7498bcaf50/kotlin-native/runtime/src/main/cpp/Worker.cpp" rel="noreferrer">C++ Worker implementation for Kotlin</a>.</p> <p>Having Kotlin's suspension language feature solved almost all of our pain points as developers with writing asynchronous code. Most importantly, writing code that does asynchronous stuff without inducing any parallelism. A developer doesn't necessarily care how those futures are chained and handled, especially if they are writing data intensive applications. If a developer needs explicit parallelism, they can use Kotlin's provided <code class="language-plaintext highlighter-rouge">Deferred</code> variables.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">val</span> <span class="py">messagesD</span><span class="p">:</span> <span class="nc">Deferred</span><span class="p">&lt;</span><span class="nc">List</span><span class="p">&lt;</span><span class="nc">Message</span><span class="p">&gt;&gt;</span> <span class="p">=</span> <span class="nf">async</span> <span class="p">{</span> <span class="nf">pollMessages</span><span class="p">()</span> <span class="p">}</span>
<span class="kd">val</span> <span class="py">messages</span> <span class="p">=</span> <span class="n">messagesD</span><span class="p">.</span><span class="nf">await</span><span class="p">()</span> <span class="c1">// calling await is "suspend"</span>
<span class="nf">sendNotifications</span><span class="p">(</span><span class="n">messages</span><span class="p">)</span>
</code></pre></div></div> <p>Moreover, a user might dispatch the given suspend call in a different coroutine context, or thread pool. This is specifically important if an old-school blocking code needs to be executed inside a suspend function.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">val</span> <span class="py">messages</span> <span class="p">=</span> <span class="nf">withContext</span><span class="p">(</span><span class="nc">Dispatchers</span><span class="p">.</span><span class="nc">IO</span><span class="p">)</span> <span class="p">{</span>
    <span class="nf">pollMessages</span><span class="p">()</span>
<span class="p">}</span>
<span class="nf">sendNotifications</span><span class="p">(</span><span class="n">messages</span><span class="p">)</span>
</code></pre></div></div> <h2 id="implicit-parallelism-know-where-to-go">Implicit Parallelism: Know Where to Go</h2> <p>A step forward from sequenatial asynchronism can be thought as implicit parallelism, where the execution of code happens sequentially and asynchronously at the same time. How? It is only possible by the programming language's support. Let's assume when you call,</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">val</span> <span class="py">messages</span> <span class="p">=</span> <span class="nf">pollMessages</span><span class="p">()</span>
<span class="kd">val</span> <span class="py">users</span> <span class="p">=</span> <span class="nf">fetchUsers</span><span class="p">()</span>
</code></pre></div></div> <p>the code <code class="language-plaintext highlighter-rouge">fetchUsers()</code> is executed before <code class="language-plaintext highlighter-rouge">pollMessages()</code> is finished, because they are mutually exclusive events. This can be traditionally done using a futures approach.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">val</span> <span class="py">messagesF</span> <span class="p">=</span> <span class="nf">pollMessages</span><span class="p">()</span>
<span class="kd">val</span> <span class="py">usersF</span> <span class="p">=</span> <span class="nf">fetchUsers</span><span class="p">()</span>

<span class="kd">val</span> <span class="p">(</span><span class="py">messages</span><span class="p">,</span> <span class="py">users</span><span class="p">)</span> <span class="p">=</span> <span class="nf">awaitAll</span><span class="p">(</span><span class="n">messagesF</span><span class="p">,</span> <span class="n">usersF</span><span class="p">)</span>
</code></pre></div></div> <p>However having this in programming language's native construct can both help users write performant code, whereas it can also cause them to write buggy code easily, as the default assumption is sequentialism. Therefore, I think a paradigm where implicit parallelism is possible, but it should be assessed very carefully while using, as there is no way to prevent unintentional race conditions without doing any formal verification. Even in runtime, you might see flakiness issues, as you are starting to build a distributed by default environment. We already know distributed systems is already hard to ensure correctness without doing formal verification, we are pushing this complexity towards our code.</p> <p>That's why I think Kotlin deserves some praises on how it handles paralellism, where it is explicit and easy to shift between paradigms.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">val</span> <span class="py">messages</span> <span class="p">=</span> <span class="nf">async</span> <span class="p">{</span> <span class="nf">pollMessages</span><span class="p">()</span> <span class="p">}</span>
<span class="kd">val</span> <span class="py">users</span> <span class="p">=</span> <span class="nf">async</span> <span class="p">{</span> <span class="nf">pollMessages</span><span class="p">()</span> <span class="p">}</span>

<span class="nf">sendNotifications</span><span class="p">(</span><span class="n">messages</span><span class="p">.</span><span class="nf">await</span><span class="p">(),</span> <span class="n">users</span><span class="p">.</span><span class="nf">await</span><span class="p">())</span>
</code></pre></div></div> <p>I hope to see some language features where calling a second <code class="language-plaintext highlighter-rouge">await</code> is unnecessary because it is already awaited in the past, similar to smart casting, where a nullable type can be cast to be a not-null type automatically if some check has been performed.</p> <h2 id="final-remarks">Final Remarks</h2> <p>There is still a lot to talk. There are bunch of other languages and frameworks that handle asynchronous execution in various ways, such as Go's goroutines, javascript's async/await, python's asyncio, Rust's tokio etc. There is still more in Java related to Future, Mono, Flux – Scala's execution contexts, Cats, Akka – Kotlin's coroutine contexts, dispatchers, Flows, Channels and many more if you are interested in reading about them.</p> <p>We see how programming languages have evolved to catch up with the developers need. Our hardware has improved, our CPUs have many spare cycles, now we are usually a larger share of our time for waiting tasks, such as disk or network. Initially we have written code sequentially, later we have built Futures, executors and event loops. Finally, we have seen how syntax evolved to support asynchronous programming in an easier and more readable way. I do believe asynchronous programming is still open to improvements, frameworks and languages used will keep improving, sequential asynchronism will increase its popularity.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Most developers deal with some sort of asynchronous operation day to day. For most of us, it is I/O (Input &amp; Output). A web developer does network calls, a systems developer could do some file operations, both are based on a submit and wait system, where program waits until some operation is completed. Different programming languages provide different ways to write code that is asynchronous, as developer wants to utilize the processor during the "wait" phase, by either doing more operations or yielding some CPU cycles back to the host until the async operation finishes, so other processes continue running.]]></summary></entry><entry><title type="html">Start with a clean slate: Integration testing with PostgreSQL</title><link href="https://dogac.dev/blog/2025/pg-test-table-track/" rel="alternate" type="text/html" title="Start with a clean slate: Integration testing with PostgreSQL"/><published>2025-04-22T00:00:00+00:00</published><updated>2025-04-22T00:00:00+00:00</updated><id>https://dogac.dev/blog/2025/pg-test-table-track</id><content type="html" xml:base="https://dogac.dev/blog/2025/pg-test-table-track/"><![CDATA[<p>We have been using PostgreSQL as our primary database in production for 4 over years, however over time, as our database grew bigger and reached over 500 tables in a single monolithic application, we had to come up with smart ways to manage it. PostgreSQL is a database that is capable of handling hundreds of tables and billions of rows, however it doesn't necessarily mean it will be easy to develop applications in a such setting. In this post, I am going to write down how I have tackled some bottlenecks in the integration testing pipeline at Carbon Health by speeding up and increasing isolation of our integration test pipelines. The solution powers our CI/CD pipelines for the last 2 years.</p> <hr/> <p><em>This blog post’s topic is my upcoming presentation at </em><a href="https://postgresql.us/events/pgdaychicago2025/schedule/session/1891-start-with-a-clean-slate-setting-up-integration-tests-with-postgresql/"><em>PGDay Chicago 2025</em></a><em>. Conference slides accessible at </em><a href="https://pgday.dogac.dev/"><em>https://pgday.dogac.dev/</em></a><em>.</em></p> <p><strong><em>Link to the tool: </em></strong><a href="https://github.com/Dogacel/pg_test_table_track"><strong><em>github.com/Dogacel/pg_test_table_track</em></strong></a></p> <h1 id="problem">Problem</h1> <p><strong>A short anectode on monoliths: </strong>Microservices is something we often hear about but usually a far reality for many of us. Monoliths (<em>monos: single/one, lithos: stone</em>) still work pretty great in many real-world settings and they only bear a subset of management problems microservices have. One of the core problems in Monoliths is its huge codebase and slow build times. You don't need to open 3 pull requests just to do some CRUD operations on a basic database table, and jump back a couple PRs later, because you forgot to add a field to your proto definitions and you gotta open 3 more PRs to add that. That sounds neat, however most likely, the total CI/CD runtimes of your 6 PRs will be still less than a single PR check in the monolith's PR, just to see your linter failed after 45 minutes, because you failed to define a constant for a magic number, yikes. If you want to have a productive and effective development environment with your Monolith, you have to do some optimizations in your CI/CD and testing environment.</p> <p><strong>Background:</strong> So a little background about our company before we start,</p> <ul> <li>Our tech-stack consists of a monolithic server supported by 30+ micro-services.</li> <li>We host our services on cloud, our primary choice of database is PostgreSQL.</li> <li>We have over 500 tables serving more than 10TBs of data.</li> <li>We have about 6 distinct development teams.</li> </ul> <p>As you might have guessed, those 500 tables are causing a big trouble in our CI/CD pipelines. Almost more than half of our 9000+ tests in our monolith are also integration tests, meaning they use a PostgreSQL instance to run queries. And over time, our pipeline has became painfully slow and annoying to work with, which lead me to come up with a solution.</p> <h2 id="integration-tests">Integration tests</h2> <p>Integration testing checks whether different parts of a system work together correctly as a whole. Unit testing focuses on testing individual components.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/psql-0-480.webp 480w,/assets/img/posts/psql-0-800.webp 800w,/assets/img/posts/psql-0-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/psql-0.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <p>Even though unit tests are much superior in terms of isolation and speed, they are not as good for covering the end to end flows and detect real-life failures. That's why we have extensively written integration tests to ensure our Monolith is tested well before release. Based on our experience, setting up scenarios and running the actual DB queries in tests really help catching bugs early on.</p> <p>So, what's the catch? Writing integration tests are painfully hard, as your data dependencies, such as foreign key constraints, make initializations a hassle for developers. Moreover, your database keeps a state, therefore you need to ensure it doesn't leak in-between tests. So let's explore our options in order to achieve a fast and isolated environment.</p> <h3 id="wrapping-every-test-with-transactions">Wrapping every test with transactions</h3> <p>At first, it sounds like a good idea. In reality, it is a terrible idea. PostgreSQL supports some sort of nested transactions, also called <code class="language-plaintext highlighter-rouge">SAVEPOINTS</code>. However a failure inside a transaction aborts the rest. Therefore, it is not possible to truly wrap every test inside a transaction and run, as some errors might result parent transaction to abort. Moreover, wrapping with additional transactions would result in altering the runtime behavior of tests. This is not something we want, as it might result in hard to debug errors that are only faced during tests, as well as behavioral differences from the actual production environment, which might cause some bugs to be not caught early on.</p> <h3 id="fresh-db-for-each-start">Fresh DB for each start</h3> <p>If you want to maximize isolation, go ahead and create a fresh DB instance for each of your tests. This worked fine in our microservices where the number of tables and tests were lower. However in monoliths, you will quickly realize this is a slow process. We have thousands of migration files, but we can always use a schema dump. In our case, we used <code class="language-plaintext highlighter-rouge">rake:schema:dump</code>. I highly encourage readers to experiment with <code class="language-plaintext highlighter-rouge">TEMPLATE</code> databases as well. However, initialization takes around 400 milliseconds, this results in a little over 1 hours of just DB initialization time for our 9000+ integration tests.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/psql-1-480.webp 480w,/assets/img/posts/psql-1-800.webp 800w,/assets/img/posts/psql-1-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/psql-1.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">A very simple implementation of a DB provider for running isolated tests.</figcaption> </figure> </div> <h3 id="cleaning-all-tables">Cleaning all tables</h3> <p>This was the initial approach in our codebase, maintaining a hand-curated list of <code class="language-plaintext highlighter-rouge">DELETE TABLE</code> queries. However it has some drawbacks,</p> <ol> <li>Order of deletions matter as there are foreign keys.</li> <li>Sometimes tables were missed from the, resulting in flakiness.</li> <li>Sequences and Materialized Views require special attention.</li> </ol> <p>For number 3, our codebase doesn't truly benefit from both, so we didn't care. However, this approach was still too slow and maintaining the list was super annoying. Adding a new table into this list was very hard, you would see weird foreign key errors, random test failures and so. Also, there is no guarantee that your hand-crafted list contains all the necessary tables in the right order. Therefore, a developer might randomly encounter a flakiness while writing a test without knowing it is related to some artifacts leftover from the recent tests.</p> <p>Actually I have had this issue once, and it was super annoying to fix. Updating our build tools resulted in changing the execution order of tests, which ultimately lead to flakiness. It took me an enourmous amount of time to figure out that test order was changed and the bug was caused by a state leak between tests.</p> <blockquote> <p>We also experimented with <code class="language-plaintext highlighter-rouge">TRUNCATE</code> over <code class="language-plaintext highlighter-rouge">DELETE</code>. However it slowed down our pipelines even more. I think it is because our test tables had a small amounts of data, which made truncate less effective and caused overhead.</p> </blockquote> <h3 id="final-solution">Final Solution</h3> <p>So I have decided, our final goal should be</p> <ol> <li>Make each table fresh before each test</li> <li>Clean the state as fast as possible</li> <li>Get rid of hand-crafted lists as entropy always wins</li> </ol> <p>So I built a solution that uses <strong>PL/pgSQL</strong> to automatically clean all tables that are used in-between tests.</p> <h4 id="storing-access">Storing Access</h4> <p>If there are bunch of tables, trying to clean up all of them would generate a big overhead. So instead of that, what about only cleaning the ones that contains data? To do that, we need to store the tables that are used during testing somewhere.</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">NOT</span> <span class="k">EXISTS</span> <span class="n">test_access</span><span class="p">(</span><span class="k">table_name</span> <span class="nb">varchar</span><span class="p">(</span><span class="mi">256</span><span class="p">)</span> <span class="k">not</span> <span class="k">null</span> <span class="k">primary</span> <span class="k">key</span><span class="p">);</span>
</code></pre></div></div> <p>Later, create a function / trigger that adds a given table name to the list.</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">FUNCTION</span> <span class="n">add_table_to_accessed_list</span><span class="p">()</span> <span class="k">RETURNS</span> <span class="k">TRIGGER</span> <span class="k">AS</span> <span class="err">$$</span>
<span class="k">BEGIN</span>
  <span class="c1">--- Assuming that the table name is passed as the first argument to the function.</span>
  <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">test_access</span> <span class="k">VALUES</span> <span class="p">(</span><span class="n">TG_ARGV</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="k">ON</span> <span class="n">CONFLICT</span> <span class="k">DO</span> <span class="k">NOTHING</span><span class="p">;</span>
  <span class="k">RETURN</span> <span class="k">NEW</span><span class="p">;</span>
  <span class="k">END</span> <span class="err">$$</span> <span class="k">LANGUAGE</span> <span class="n">PLPGSQL</span><span class="p">;</span>
</code></pre></div></div> <h4 id="spying-on-tables">Spying on tables</h4> <p>In order to spy on tables that are modified, we can use triggers. This trigger will be executed before every insert, which ensures we capture all tables that are altered during the test run.</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">FUNCTION</span> <span class="n">setup_access_triggers</span><span class="p">(</span><span class="n">schemas</span> <span class="nb">text</span><span class="p">[])</span> <span class="k">RETURNS</span> <span class="nb">int</span> <span class="k">AS</span> <span class="err">$$</span>
<span class="k">DECLARE</span> <span class="n">tables</span> <span class="k">CURSOR</span> <span class="k">FOR</span>
  <span class="k">SELECT</span> <span class="k">table_name</span><span class="p">,</span> <span class="n">table_schema</span> <span class="k">FROM</span> <span class="n">information_schema</span><span class="p">.</span><span class="n">tables</span>
    <span class="k">WHERE</span> <span class="n">table_schema</span> <span class="o">=</span> <span class="k">ANY</span><span class="p">(</span><span class="n">schemas</span><span class="p">)</span>
      <span class="k">AND</span> <span class="n">table_type</span> <span class="o">=</span> <span class="s1">'BASE TABLE'</span> <span class="c1">--- Exclude views.</span>
      <span class="k">AND</span> <span class="k">table_name</span> <span class="k">NOT</span> <span class="k">IN</span> <span class="p">(</span><span class="s1">'test_access'</span><span class="p">,</span> <span class="s1">'schema_migrations'</span><span class="p">);</span>
      <span class="c1">--- Prevent recursion when an insertion happens to 'test_access' table.</span>
<span class="k">BEGIN</span>
  <span class="c1">--- Create a table to store the list of tables that have been accessed.</span>
  <span class="k">EXECUTE</span> <span class="s1">'CREATE TABLE IF NOT EXISTS test_access(table_name varchar(256) not null primary key);'</span><span class="p">;</span>
  <span class="k">FOR</span> <span class="n">stmt</span> <span class="k">IN</span> <span class="n">tables</span> <span class="n">LOOP</span>
    <span class="c1">--- If the trigger exists, first drop it so we can re-create.</span>
    <span class="k">EXECUTE</span> <span class="s1">'DROP TRIGGER IF EXISTS "'</span> <span class="o">||</span> <span class="n">stmt</span><span class="p">.</span><span class="k">table_name</span> <span class="o">||</span> <span class="s1">'_access_trigger" ON "'</span> <span class="o">||</span>
          <span class="n">stmt</span><span class="p">.</span><span class="n">table_schema</span> <span class="o">||</span> <span class="s1">'"."'</span><span class="o">||</span> <span class="n">stmt</span><span class="p">.</span><span class="k">table_name</span> <span class="o">||</span> <span class="s1">'"'</span><span class="p">;</span>
    <span class="c1">--- Create the on insert trigger.</span>
    <span class="c1">--- This calls `add_table_to_accessed_list` everytime a row is inserted into the table with table name.</span>
    <span class="c1">--- The table name also includes the table schema.</span>
    <span class="k">EXECUTE</span> <span class="s1">'CREATE TRIGGER "'</span> <span class="o">||</span> <span class="n">stmt</span><span class="p">.</span><span class="k">table_name</span> <span class="o">||</span> <span class="s1">'_access_trigger"'</span> <span class="o">||</span>
            <span class="s1">' BEFORE INSERT ON "'</span> <span class="o">||</span> <span class="n">stmt</span><span class="p">.</span><span class="n">table_schema</span> <span class="o">||</span><span class="s1">'"."'</span><span class="o">||</span> <span class="n">stmt</span><span class="p">.</span><span class="k">table_name</span> <span class="o">||</span> <span class="s1">'"'</span> <span class="o">||</span>
            <span class="s1">' FOR EACH STATEMENT '</span> <span class="o">||</span>
            <span class="s1">' EXECUTE PROCEDURE public.add_table_to_accessed_list (</span><span class="se">''</span><span class="s1">"'</span><span class="o">||</span>
            <span class="n">stmt</span><span class="p">.</span><span class="n">table_schema</span> <span class="o">||</span><span class="s1">'"."'</span><span class="o">||</span> <span class="n">stmt</span><span class="p">.</span><span class="k">table_name</span> <span class="o">||</span><span class="s1">'"</span><span class="se">''</span><span class="s1">)'</span><span class="p">;</span>
  <span class="k">END</span> <span class="n">LOOP</span><span class="p">;</span>
<span class="k">RETURN</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">END</span> <span class="err">$$</span> <span class="k">LANGUAGE</span> <span class="n">plpgsql</span><span class="p">;</span>
</code></pre></div></div> <h4 id="cleaning-the-tables">Cleaning the tables</h4> <p>As a last step, we need to create a function that allows us to clean all tables that are accessed during the last test execution cycle. We disable foreign keys before deleting to ensure deletion order doesn't matter as our final goal is to clean all tables.</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">FUNCTION</span> <span class="n">delete_from_accessed_tables</span><span class="p">()</span> <span class="k">RETURNS</span> <span class="nb">int</span> <span class="k">AS</span> <span class="err">$$</span>
<span class="k">DECLARE</span> <span class="n">tables</span> <span class="k">CURSOR</span> <span class="k">FOR</span>
  <span class="k">SELECT</span> <span class="k">table_name</span> <span class="k">FROM</span> <span class="n">test_access</span><span class="p">;</span>
<span class="k">BEGIN</span>
<span class="c1">--- Disable foreign key constraints temporarily. Without this, we need to clear tables in a specific order.</span>
<span class="c1">--- But it is very hard to find this order and this trick makes the process even faster.</span>
<span class="c1">--- Because we clear every table, we don't care about any foreign key constraints.</span>
<span class="k">EXECUTE</span> <span class="s1">'SET session_replication_role = </span><span class="se">''</span><span class="s1">replica</span><span class="se">''</span><span class="s1">;'</span><span class="p">;</span>
<span class="c1">--- Clear all tables that have been accessed.</span>
<span class="k">FOR</span> <span class="n">stmt</span> <span class="k">IN</span> <span class="n">tables</span> <span class="n">LOOP</span>
  <span class="k">BEGIN</span>
    <span class="k">EXECUTE</span> <span class="s1">'DELETE FROM '</span><span class="o">||</span> <span class="n">stmt</span><span class="p">.</span><span class="k">table_name</span><span class="p">;</span>
    <span class="c1">--- If we accessed a table that is dropped, an exception will occur. This ignored the exception.</span>
    <span class="n">EXCEPTION</span> <span class="k">WHEN</span> <span class="n">OTHERS</span> <span class="k">THEN</span>
  <span class="k">END</span><span class="p">;</span>
<span class="k">END</span> <span class="n">LOOP</span><span class="p">;</span>
<span class="c1">--- Clear the list o accessed tables because those tables are now empty.</span>
<span class="k">EXECUTE</span> <span class="s1">'DELETE FROM test_access'</span><span class="p">;</span>
<span class="c1">--- Turn foreign key constraints back on.</span>
<span class="k">EXECUTE</span> <span class="s1">'SET session_replication_role = </span><span class="se">''</span><span class="s1">origin</span><span class="se">''</span><span class="s1">;'</span><span class="p">;</span>
<span class="k">RETURN</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">END</span> <span class="err">$$</span> <span class="k">LANGUAGE</span> <span class="n">plpgsql</span><span class="p">;</span>
</code></pre></div></div> <h4 id="embedding-into-tests">Embedding into Tests</h4> <p>We have developed an interface / trait called <code class="language-plaintext highlighter-rouge">CleanDBBetweenTests</code> and every integration test in our system extends this trait. Inside this trait, we have setup some before and after test triggers to ensure our tables are cleaned.</p> <div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">clearAccessedTables</span><span class="o">()</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
  <span class="nf">finishOperation</span><span class="o">(</span><span class="n">sql</span><span class="s">"""SELECT public.delete_from_accessed_tables()"""</span><span class="o">.</span><span class="py">as</span><span class="o">[</span><span class="kt">Int</span><span class="o">])</span>
<span class="o">}</span>

<span class="k">def</span> <span class="nf">setupTestTriggers</span><span class="o">()</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
  <span class="nf">finishOperation</span><span class="o">(</span><span class="n">sql</span><span class="s">"""SELECT public.setup_access_triggers(array['test_schema'])"""</span><span class="o">.</span><span class="py">as</span><span class="o">[</span><span class="kt">Int</span><span class="o">])</span>
<span class="o">}</span>

<span class="k">trait</span> <span class="nc">CleanDBBetweenTests</span> <span class="k">extends</span> <span class="nc">BeforeAndAfterEach</span> <span class="k">with</span> <span class="nc">BeforeAndAfterAll</span> <span class="o">{</span> <span class="k">this:</span> <span class="kt">Suite</span> <span class="o">=&gt;</span>
  <span class="k">override</span> <span class="k">def</span> <span class="nf">beforeAll</span><span class="o">()</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
    <span class="nf">setupTestTriggers</span><span class="o">()</span>
    <span class="nf">clearAccessedTables</span><span class="o">()</span>
  <span class="o">}</span>
  <span class="k">override</span> <span class="k">def</span> <span class="nf">beforeEach</span><span class="o">()</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
    <span class="nf">clearAccessedTables</span><span class="o">()</span>
  <span class="o">}</span>
  <span class="k">override</span> <span class="k">def</span> <span class="nf">afterAll</span><span class="o">()</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
    <span class="nf">clearAccessedTables</span><span class="o">()</span>
  <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div> <h2 id="results">Results</h2> <p>Using this approach, we were able to cut our CI/CD times by 30%. The speed increase and better isoaltion greately improved our developer experience. We have never had issues with our table cleaning approach since we first rolled out this tool. As our codebase keeps growing, without this change, our current CI runtime would be more than 1.5 hours by now. Speeding up our CI times didn’t only decrease our bills but it also motivated people towards writing more code and tests as the PR feedback cycle was much quicker</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/psql-2-480.webp 480w,/assets/img/posts/psql-2-800.webp 800w,/assets/img/posts/psql-2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/psql-2.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <p><strong>Future Work: </strong>Exploring strategies to support constant rows that would stay during all execution cycles, as well as setting up scenarios. Moreover, <code class="language-plaintext highlighter-rouge">UNLOGGED TABLE</code>s can potentially speed up the execution further more.</p> <p><strong>Last words… </strong>I have decided that I should open-source this tool so everyone can benefit from it.<strong> </strong>Your feedback is very valuable, please let me know what you think.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[We have been using PostgreSQL as our primary database in production for 4 over years, however over time, as our database grew bigger and reached over 500 tables in a single monolithic application, we had to come up with smart ways to manage it. PostgreSQL is a database that is capable of handling hundreds of tables and billions of rows, however it doesn't necessarily mean it will be easy to develop applications in a such setting. In this post, I am going to write down how I have tackled some bottlenecks in the integration testing pipeline at Carbon Health by speeding up and increasing isolation of our integration test pipelines. The solution powers our CI/CD pipelines for the last 2 years.]]></summary></entry><entry><title type="html">Behind the 6-digit code: Building HOTP and TOTP from scratch</title><link href="https://dogac.dev/blog/2025/how-do-one-time-passwords-work/" rel="alternate" type="text/html" title="Behind the 6-digit code: Building HOTP and TOTP from scratch"/><published>2025-04-11T00:00:00+00:00</published><updated>2025-04-11T00:00:00+00:00</updated><id>https://dogac.dev/blog/2025/how-do-one-time-passwords-work</id><content type="html" xml:base="https://dogac.dev/blog/2025/how-do-one-time-passwords-work/"><![CDATA[<p>A while ago, I have started working on authorization and authentication at work. This taught me a lot about how modern authentication systems work. However I have always thought One-Time Password logins are the most mystical ones. A six-digit code that changes every time and can be used to verify your identity. How does the server know the newly generated one, and how is it really secure? In this post, I will explain what HOTP, TOTP is and how they work by sharing my own implementation from scratch.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/6digit-0-480.webp 480w,/assets/img/posts/6digit-0-800.webp 800w,/assets/img/posts/6digit-0-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/6digit-0.png" class="img-fluid rounded z-depth-1" width="500px" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <h1 id="what-are-otps"><strong>What Are OTPs</strong>?</h1> <p>One-Time Passwords (OTPs) are a widely-used form of authentication. You’ve likely encountered them when using a “Secure Login” app like Google Authenticator, or during a “Forgot Password” flow where a temporary code is sent to your email or phone.</p> <p>Unlike traditional passwords, OTPs are only valid for a single use or a limited time window. This greatly reduces the risk of password replay attacks, where someone captures the password used to login and tries to reuse it.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/6digit-1-480.webp 480w,/assets/img/posts/6digit-1-800.webp 800w,/assets/img/posts/6digit-1-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/6digit-1.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 600px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">Passwords can be used repeatedly. When leaked, malicious actors can impersonate the user and access critical information.</figcaption> </figure> </div> <p>Like the traditional password authentication approach, the user and the authority (server) still needs to agree on a common secret key. During the regular password authentication, this secret key is directly communicated to the authority. There are many ways of doing this process safely, such as hashing the password or sending it over an encrypted network. However the risk still exists, as the password itself never changes, as long as we use our devices to type our passwords, there is some way those malicious actors can watch and get that information before it reaching the network.</p> <p>So instead of using a constant secret key, we can use something dynamic that changes over time. As a simple example, assume when those two people first met, they have set their secretly hidden clocks to a random time together.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/6digit-2-480.webp 480w,/assets/img/posts/6digit-2-800.webp 800w,/assets/img/posts/6digit-2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/6digit-2.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 600px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">Using secret clocks as a basic OTP implementation</figcaption> </figure> </div> <p>Also in some examples like a password recovery, we can use also use a secret clock. This secret clock not shared with the user directly but rather server's generated one-time password is sent via a trusted medium, such as an email to the user.</p> <p><strong>*Edit</strong>: Several readers have warned me it is much easier to generate random numbers instead. The server has to store number of attempts to make sure it is not brute forced as well.*</p> <p>Obviously a clock on its own is not secure, as in this example Plankton could have predicted the time-shift of the secret clock based on the real time. However for the sake of this example, I wanted to show how copying the "password" is not enough on its own. Let's take a look at some strategies to build this "secret clock" and make sure it is not possible to predict time just by knowing a single code in some point in time.</p> <p>There are two common types of OTP algorithms:</p> <ul> <li><strong>HOTP (HMAC-based One-Time Password)</strong> – based on a counter that increments every time an OTP is requested.</li> <li><strong>TOTP (Time-based One-Time Password)</strong> – based on the current time, typically using 30-second intervals.</li> </ul> <p>These methods are standardized in <a href="https://www.rfc-editor.org/rfc/rfc4226" rel="noreferrer">RFC 4226</a> (for HOTP) and <a href="https://www.rfc-editor.org/rfc/rfc6238" rel="noreferrer">RFC 6238</a> (for TOTP), and are used in many modern 2FA (two-factor authentication) implementations.</p> <p>A counter based password method is easier to understand. Imagine two people met and generated a totally random series of numbers. They both start from count 0, as in each attempt, user needs to communicate to the server with the secret key in the given index. However this comes with several problems,</p> <ol> <li>Clients needs to sync their counter, if there is a skew, they might get temporarily locked out.</li> <li>Malicious actors can collect upcoming login codes by phishing the user and those codes can be used for a long time.</li> </ol> <p>Therefore, instead of storing a counter, we can use the current time as the counter. That's how TOTP works. Using time makes synchronization easier, as many modern machines already use technologies such as NTP to sync their time and this prevents malicious actors from harvesting codes as their code will be valid for only next 30 seconds or so, not for a long sequence of future login attempts.</p> <h1 id="how-to-generate-totps">How to Generate TOTPs?</h1> <p>The analogy of two people met and decided on a totally random series of numbers is partially realistic. However it is not feasible to have such a huge list, you potentially need to have millions of secret numbers to support OTPs for a reasonable time. Therefore we should use algorithms that are cryptographically safe that generate values based on a secret key. It is important that this algorithm is not random, as both user and the authority will hold a copy of this secret key and they should be able to generate the same value given the same time.</p> <p>We have introduced HOTP first because the actual implementation of TOTPs are actually HOTP based. Instead of using a static counter, TOTPs use the time as the current counter. We can write the following formula to find the counter in any given time,</p> \[c(t) = \left\lfloor \frac{t - t_0}{X} \right\rfloor\] <p>Here $t_0$ is the starting time, in most systems this is the default UNIX epoch timestamp, 1 January 1970. $X$ is the period you want the code to rotate. For example, if you want the login code to change every 30 seconds, X should be 30 seconds.</p> <h1 id="how-to-actually-generate-hotps">How to <em>Actually</em> Generate HOTPs?</h1> <p>In order to generate an HOTP, you need to decide on three things:</p> <ol> <li>A secret key</li> <li>A hash function</li> <li>Number of digits you will output</li> </ol> <p>First, we need to start by hashing our secret key. For example, if we have chosen <code class="language-plaintext highlighter-rouge">SHA-1</code> as our hashing algorithm, our output would be only 64 bytes. If secret key is shorter than 64 bytes, we can just pad it with zeroes. Otherwise, given $K$ is our secret key and $H$ is our hashing algorithm,</p> \[K\_{pad} = H(K)\] <p>Later we do an XOR operation on text with some pre-defined magic constants $I_{pad}$ and $O_{pad}$.</p> \[\begin{align} I_{pad} &amp;= [\texttt{0x36}, \dots] \\ O_{pad} &amp;= [\texttt{0x5c}, \dots] \end{align}\] <p>Those numbers are originally chosen by HMAC designers and any pair where $I_{pad} \neq O_{pad}$ could have been chosen. Their length should be also 64 bytes, same as our hashing algorithm’s digest length. Later we define the famous $\text{HMAC}$, Hash-based Message Authentication Code, function as in <a href="https://www.rfc-editor.org/rfc/rfc2104">RFC 2104</a>. It outputs a crypthographic hash calculated using the given key and message.</p> \[\text{HMAC}(K, M) = H(K*{pad} \oplus O*{pad} + H(K*{pad} \oplus I*{pad} + M))\] <p>This cryptographic hash function is secure, so that user can’t infer the secret key $K_{pad}$ even if they knew $M$ and the resulting hash.</p> <p>Later we will define a new function to generate a 4-byte result. Here is the definition of DT from the original RFC,</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    DT(String) // String = String[0]...String[19]
     Let OffsetBits be the low-order 4 bits of String[19]
     Offset = StToNum(OffsetBits) // 0 &lt;= OffSet &lt;= 15
     Let P = String[OffSet]...String[OffSet+3]
     Return the Last 31 bits of P
</code></pre></div></div> <p>This function allows us to shrink our 20 byte input to 4 bytes dynamically by choosing the bytes offsetted by the number that is represented using the last 4 bits of the input. The outputs of the DT on distinct counter inputs are uniformly and independently distributed.</p> <p>Finally, we can define our HOTP function as,</p> \[\text{HOTP}(K,C) = \text{DT}(\text{HMAC}(K,C)) \bmod 10^{\text{digits}}\] <p>Here we can replace our counter $C$ with $c(t)$ to get a TOTP code.</p> <h1 id="final-remarks">Final Remarks</h1> <p>There are many online resources with TOTP and HOTPs, however I have struggled to find a website that help me check my implementation as their secret-key representations were not standardized. Thus, I have published my own short demo app to showcase.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/6digit-3-480.webp 480w,/assets/img/posts/6digit-3-800.webp 800w,/assets/img/posts/6digit-3-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/6digit-3.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <p><img src="/assets/img/posts/favicon.ico" alt=""/></p> <p>I have published this app on my website and also on GitHub, the implementation uses Kotlin.</p> <ul> <li>Link to the app <a href="https://otp.dogac.dev/">https://otp.dogac.dev/</a></li> <li>Link to the GitHub repository: <a href="https://github.com/dogacel/otp-server">github.com/Dogacel/otp-server</a></li> </ul> <p><strong>To recap:</strong> We’ve looked at how HOTP and TOTP work, explored how they're derived from HMAC, and saw how the server and client can generate matching codes without ever transmitting the password itself.</p> <p>Working on this project helped me understand how OTPs work at a much deeper level. What once felt like magic now feels like elegant design.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[A while ago, I have started working on authorization and authentication at work. This taught me a lot about how modern authentication systems work. However I have always thought One-Time Password logins are the most mystical ones. A six-digit code that changes every time and can be used to verify your identity. How does the server know the newly generated one, and how is it really secure? In this post, I will explain what HOTP, TOTP is and how they work by sharing my own implementation from scratch.]]></summary></entry><entry><title type="html">On Decidability of Our Jobs and AI Replacing Software Engineers</title><link href="https://dogac.dev/blog/2025/view-on-ai-in-2025/" rel="alternate" type="text/html" title="On Decidability of Our Jobs and AI Replacing Software Engineers"/><published>2025-04-03T00:00:00+00:00</published><updated>2025-04-03T00:00:00+00:00</updated><id>https://dogac.dev/blog/2025/view-on-ai-in-2025</id><content type="html" xml:base="https://dogac.dev/blog/2025/view-on-ai-in-2025/"><![CDATA[<p>Among all the occupations AI could replace, why are we focusing so much on <em>Engineering</em> jobs that require such expertise? I'm well aware of quality of the code AI writes, it is beyond useful but I don't see a world where that piece of code can find its way into the real world without the help of a software engineer. First, I would like to talk about the kind of jobs that I think AI will replace first and how AI can't replace Software Engineers any time soon.</p> <p>I would like to use the Turing-completeness analogy to describe jobs (a system that can simulate any computation). I think as there are two categories of jobs, <strong><em>decidable</em></strong> and <strong><em>undecidable</em></strong>. In the traditional sense, a <em>decidable problem</em> can be solved by a well-defined algorithm that always halts with a correct answer. Translating this to the world of work, a <em>decidable job</em> has well-defined inputs and a finite set of outputs. So it can be fully automated or scripted. Jobs that are mostly <em>decidable</em> are most prone to being replaced by AI. On the other hand, an<em> undecidable job</em> is open-ended. Given a problem, there is no guaranteed algorithm that always gives you a solution, or even tells you if a solution exists. An example of the <em>decidable jobs</em> could be a customer support agent. Even though your input set is not well-defined as it is usually a text written by a human, (probably this is the only reason why this job still exists today), your possible actions are all documented. On the other hand, an engineering job can be well-defined as <em>undecidable</em>, build a system that scales and supports X features under Y constraints. Arguably most of your job is to figure out <em>how</em> and planning the process rather than the actual implementation. Take construction engineers for example, their primary duty is to come up with a plan rather than carrying the material that is required to build.</p> <p>One might say, in this context every job can be defined as either <em>decidable</em> or <em>undecidable</em> based on the job description. It's also fair to say each job has some <em>decidable</em> factor and some <em>undecidable </em>factor. For example, a customer support agent might have creating new workflows as a part of their duty rather than only using pre-existing procedures, which makes it less <em>decidable</em>. On the other hand, an engineer can have a job where the only expectation is to transform some data from one format to another. Therefore it is not always possible to classify an occupation entirely as <em>decidable </em>or <em>undecidable</em>. Here is a key take-away, <strong>the definition of a job ultimately determines its decidability</strong>. We create jobs to solve problems and there are infinitely many ways to define those jobs. If we are able to define those jobs with a clear separation of <em>decidable</em> and <em>undecidable</em>, we can easily replace the <em>decidable</em> part with AI.</p> <p>However this will give birth to new jobs where the person's primary function is to split a job into a <em>decidable</em> and an <em>undecidable</em> factor. We can think them as <strong><em>AI-integration engineers</em></strong>. Their primary function is to extract out the <em>decidable</em> factor from the <em>undecidable</em> factor. Since the process of extracting <em>decidable</em> from <em>undecidable</em> is <em>undecidable</em> (the classic halting problem), it is fair to say their jobs are secure. I do believe software engineering overlaps quite a lot with the definition of extracting out the <em>decidable</em>. Not just software engineers, but most engineering jobs have this function, where engineers primary function is to create individual units of job each can be autonomously executed. It's almost like we have defined what engineering is…</p> <p>As programming languages have a formal spec, their syntax is <em>decidable</em>. I think this is one of the pitfalls that make people think software engineering is going to vanish. However their function is absolutely not as we have discussed. Furthermore it is discussed only around one third of a Software Engineer's duties consist of writing code. So, it is fair to say <em>most</em> software engineers' jobs are safe. Only a small portion of their job can be replaced by AI, the syntax of their programming languages. Note that it is not possible to generalize this to all software engineers, as some engineers might find their jobs to be more <em>decidable</em> than others. But you don't really need AI to replace those types of Software Engineers, software can replace developers on its own. We have been seeing this shift for a while, from "punch card punchers" being replaced by terminal emulators to static website generators allowing non-programmers to create websites. However this didn't end the Web Developers' jobs but rather pushed them towards building more advanced frameworks and tackling harder problems. So ultimately, my takeaway is that AI will help us eliminate the <em>decidable</em> part of our jobs faster than ever, which is usually the most boring and uninspiring part anyway. It will allow us to spend more time on tinkering and building more advanced tools.</p> <h2 id="final-remarks">Final Remarks</h2> <p>I have intentionally tried to keep this article short, as there is much more to say about software engineering. The article "<a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;arnumber=10705752">AI Over-Hype: A Dangerous Threat (and How to Fix It)</a>" motivated me to write this post, as it advocates professionals to rally against the remarks of "AI will write all the code" (Another shoutout to Anthropic's CEO). It dives much deeper into the topic of software and AI, supports its arguments with empirical data. Also a great blog post from Alperen, "<a href="https://alperenkeles.com/posts/verifiability-is-the-limit/">Verifiability is the Limit</a>" dives much deeper into software engineering. It discusses the pitfalls of AI on correctness and verifiability in relation to software engineering, which inspired me to come up with the analogy of <em>Turing Completeness </em>in terms of job functions. Finally David Graeber's "Bullshit Jobs" is a must read on a broader context. It's not meaningful to discuss what jobs can be replaced and what can't without really understanding their functions and why do they exists in the first place.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Among all the occupations AI could replace, why are we focusing so much on Engineering jobs that require such expertise? I'm well aware of quality of the code AI writes, it is beyond useful but I don't see a world where that piece of code can find its way into the real world without the help of a software engineer. First, I would like to talk about the kind of jobs that I think AI will replace first and how AI can't replace Software Engineers any time soon.]]></summary></entry><entry><title type="html">Supercharge Your Home Cluster Using Cloudflare Tunnel</title><link href="https://dogac.dev/blog/2025/cloudflare-tunnel/" rel="alternate" type="text/html" title="Supercharge Your Home Cluster Using Cloudflare Tunnel"/><published>2025-03-29T00:00:00+00:00</published><updated>2025-03-29T00:00:00+00:00</updated><id>https://dogac.dev/blog/2025/cloudflare-tunnel</id><content type="html" xml:base="https://dogac.dev/blog/2025/cloudflare-tunnel/"><![CDATA[<p>I'm a big fan of self-hosting and DIY. Since writing my previous <a href="/about-self-hosting/" rel="noreferrer">blog post about my self-hosting journey</a>, I have learned some exciting new things that I want to share it with you. First, I’ll explain my initial server setup. Then, I’ll discuss why I looked for an alternative, and finally, I’ll show how Cloudflare Tunnels helped me achieve my goals.</p> <h2 id="the-problem">The Problem</h2> <p>If you are like me and you are hosting your website on your own home-cluster, there is some configuration you have to do to ensure you are not exposing your devices to the internet insecurely. Secondly, you will soon realize your home-cluster is not accessible in the same way inside your home (private network) as someone outside your network.</p> <h2 id="initial-setup">Initial Setup</h2> <p>My home-server runs on Proxmox and it exposes a lightweight Alpine LXC (Linux Container) to handle external traffic. I’ve deliberately disabled SSH on this container for extra security; I can only connect to its TTY via the Proxmox console. Instead of port-forwarding the services I want to expose one-by-one, I deliberately placed that container behind a DMZ, so I don't need to configure a port-forward everytime I need one (Shout-out to Xfinity for making port forwards extra difficult).</p> <p>I currently host many things in my home-cluster, some of the applications run on a Kubernetes cluster and some of them run as standalone docker images because I was lazy to move them. I have my Blog, a Fresh RSS instance to manage my RSS subscriptions, a generic-purpose PostgreSQL instance to collect data from experiment runs for research projects, a Minecraft server to play with my friends, a Grafana dashboard to visualize different kinds of data and set various alerts, such as SSL certificate expiration of my website, an influx DB to collect sensor data from my house and many more.</p> <p>My main Kubernetes cluster runs on MicroK8s, it has a MetalLB on front and I can setup ingress rules to forward traffic to different applications. However this is not enough on its own, because there are other applications outside kubernetes that I need to expose. Therefore I have decided to put everything behind HAProxy.</p> <pre><code class="language-cfg">defaults
  mode http
  timeout client 60s
  timeout connect 30s
  timeout server 60s
  timeout http-request 60s

frontend .dogac.dev
  mode http
  bind :443 ssl crt /root/haproxy/all.pem

  acl is_freshrss hdr(host) -i freshrss.dogac.dev
  use_backend fresh-rss if is_freshrss

  acl is_blog hdr(host) -i blog.dogac.dev
  use_backend ghost-blog if is_blog

  acl is_grafana hdr(host) -i grafana.dogac.dev
  use_backend main-cluster if is_grafana

  acl is_healthcheck hdr(host) -i health.dogac.dev
  use_backend main-cluster if is_healthcheck

  acl is_otp hdr(host) -i otp.dogac.dev
  use_backend main-cluster if is_otp

  default_backend ghost-blog


backend fresh-rss
  mode http
  option forwardfor
  http-request add-header X-Forwarded-Proto https if { ssl_fc }
  server container-master 10.0.0.X:X

backend ghost-blog
  mode http
  option forwardfor
  http-request add-header X-Forwarded-Proto https if { ssl_fc }
  server container-master 10.0.0.X:X

backend main-cluster
  mode http
  option forwardfor
  http-request add-header X-Forwarded-Proto https if { ssl_fc }
  server main-cluster 10.0.0.X:X ssl verify none

frontend psql-fe
  mode tcp
  bind :X
  default_backend psql-be

backend psql-be
  mode tcp
  server psql 10.0.0.X:X

frontend minecraft-server
  mode tcp
  bind :X
  default_backend minecraft-sv

backend minecraft-sv
  mode tcp
  server game-server 10.0.0.X:X
</code></pre> <p>By using this configuration, I am able to forward traffic coming from different sub-domains to different applications and potentially to kubernetes. Just to make everything extra secure, I did not forward unknown domains / subdomains to kubernetes, so I wouldn’t accidentally expose something.</p> <p>I still had a couple of additional issues,</p> <ol> <li>I don't own a static IP address</li> <li>SSL certificates rotate every 3 months</li> <li>I expose my home IP address directly to the domain provider</li> </ol> <p>For number 1, I used <a href="https://ddclient.net/">DDClient</a> to automatically update my IP address to PorkBun domain provider regularly. As I don't provide any availability SLAs for my personal website and my IP address doesn't change often, it works fine.</p> <p>For number 2, I have created a small script to automatically update my HAProxy certificates from Porkbun and I want to share with you. I ran this script a week before my SSL certs expired (my Grafana instance reminds me). Alternatively I could have scheduled this to be a weekly job.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#/bin/ash</span>

<span class="nb">set</span> <span class="nt">-eo</span> pipefail

<span class="nv">apikey</span><span class="o">=</span>X
<span class="nv">secretapikey</span><span class="o">=</span>X
<span class="nv">domainname</span><span class="o">=</span>X

<span class="nv">resp</span><span class="o">=</span><span class="si">$(</span>curl <span class="nt">-s</span> <span class="nt">-X</span> POST https://api.porkbun.com/api/json/v3/ssl/retrieve/<span class="nv">$domainname</span> <span class="nt">-d</span> <span class="s2">"{</span><span class="se">\"</span><span class="s2">secretapikey</span><span class="se">\"</span><span class="s2">: </span><span class="se">\"</span><span class="nv">$secretapikey</span><span class="se">\"</span><span class="s2">, </span><span class="se">\"</span><span class="s2">apikey</span><span class="se">\"</span><span class="s2">: </span><span class="se">\"</span><span class="nv">$apikey</span><span class="se">\"</span><span class="s2">}"</span> | jq<span class="si">)</span>

<span class="nv">result</span><span class="o">=</span><span class="si">$(</span><span class="nb">echo</span> <span class="nv">$resp</span> | jq <span class="nt">-r</span> <span class="s1">'.status'</span><span class="si">)</span>

<span class="k">if</span> <span class="o">[[</span> <span class="s2">"SUCCESS"</span> <span class="o">!=</span> <span class="s2">"</span><span class="nv">$result</span><span class="s2">"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"Not successful result: </span><span class="nv">$resp</span><span class="s2">"</span>
    <span class="nb">exit </span>1
<span class="k">fi

</span><span class="nv">chain</span><span class="o">=</span><span class="si">$(</span><span class="nb">echo</span> <span class="nv">$resp</span> | jq <span class="nt">-r</span> <span class="s1">'.certificatechain'</span><span class="si">)</span>
<span class="nv">privatekey</span><span class="o">=</span><span class="si">$(</span><span class="nb">echo</span> <span class="nv">$resp</span> | jq <span class="nt">-r</span> <span class="s1">'.privatekey'</span><span class="si">)</span>

<span class="nb">mv </span>all.pem old.pem 2&gt;/dev/null

<span class="nb">echo</span> <span class="s2">"</span><span class="nv">$chain</span><span class="s2">"</span> <span class="o">&gt;&gt;</span> all.pem
<span class="nb">echo</span> <span class="s2">"</span><span class="nv">$privatekey</span><span class="s2">"</span> <span class="o">&gt;&gt;</span> all.pem

<span class="nb">echo</span> <span class="s2">"Done!"</span>
</code></pre></div></div> <p>For number 3, I did not have much option with my current setup. As far as I know, Porkbun doesn't have a direct way to secure my IP address and I don't want to pay monthly for a proxy server.</p> <h2 id="access-from-home">Access from Home</h2> <p>My ISP and router doesn't allow NAT Loopback, meaning that I can't access my own network using its external IP while I am in the internal IP. You might ask, why do I need that? For example when I try to visit my website while I am at home, I can't access it because its domain name resolves to my external IP and my router doesn't allow it. There are a couple of ways around it but none of them are perfect,</p> <ol> <li><strong>Change your router to support NAT Loopback</strong>. Firstly it is not guaranteed that it will work and secondly it requires additional maintenance cost and money.</li> <li><strong>Update <code>/etc/hosts</code>. </strong>This technique works locally but you have to remember to update your hosts file every time you connect to an external network / internal network. Also, it needs to be configured per device. I am not sure if there is an equivalent way for my iPhone for example. Also you might face SSL certificate issues.</li> <li><strong>Update Router's DNS records</strong>. As I have stated before, I don't think it is possible in my case and I don't want to deal with the complexity of an additional DNS server.<strong> </strong></li> <li><strong>Use server IP directly</strong>. My LB forwards traffic based on domain names and some services are configured to only listen to those domains. Also I have to switch to the domain name when I am on an external network. I can't change the server address each time for every application I use, such as my RSS reader, <em>NetNewsWire</em>.</li> <li><strong>Use a Proxy server</strong>. My motivation is to not pay for an additional server and maintain it. However, Cloudflare provides a free solution that you can use. Let's explore it.</li> </ol> <h2 id="where-cloudflare-tunnels-shine">Where Cloudflare Tunnels Shine</h2> <p>In the previous sections, I have explained why a Proxy Server increases the security of your home cluster and helps you federate access from internal and external network. So I started searching for a free proxy alternative, however you shouldn't really trust a free proxy server. Previously I have set my own proxy using Squid, however I wasn't happy with its performance with AWS's Lightsail solution. Even if you don't mind paying for an additional server you still have to maintain it.</p> <p>After a careful investigation, I have found Cloudflare Tunnel. From <a href="https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/">Cloudflare's website,</a></p> <blockquote> <p>Cloudflare Tunnel provides you with a secure way to connect your resources to Cloudflare without a publicly routable IP address. With Tunnel, you do not send traffic to an external IP — instead, a lightweight daemon in your infrastructure (<code class="language-plaintext highlighter-rouge">cloudflared</code>) creates outbound-only connections to Cloudflare's global network</p> </blockquote> <p>This seemed like the perfect solution for me. First of all, I followed their docs to move my DNS nameservers from Porkbun to Cloudflare to start using it.</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-L</span> https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 <span class="nt">-o</span> /usr/bin/cloudflared
<span class="nb">chmod</span> +x /usr/bin/cloudflared
cloudflared <span class="nt">--help</span>
</code></pre></div></div> <p>Later I have created a cloudflared tunnel and installed it as a service. The service makes sure my certificates are up-to-date and tunnel forwards to my latest external IP.</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cloudflared tunnel create home-server
cloudflared tunnel route dns home-server <span class="k">*</span>.dogac.dev
cloudflared tunnel route dns home-server dogac.dev
cloudflared service <span class="nb">install</span>
</code></pre></div></div> <p>Initially I thought about replacing HAProxy with cloudflared's ingress configuration. However, I concluded that it would be a lot of effort and a step backward from my current setup. So instead, I have decided to forward all traffic coming to my domain directly to HAproxy without any additional configuration</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">tunnel</span><span class="pi">:</span> <span class="s">home-server</span>
<span class="na">credentials-file</span><span class="pi">:</span> <span class="s">/root/.cloudflared/XXX.json</span>

<span class="na">ingress</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">service</span><span class="pi">:</span> <span class="s">http://localhost:8443</span>
</code></pre></div></div> <p>Also I had to change my HAProxy config now. I should delete the SSL certificate now as cloudflared handles the SSL automatically and I have switched from port 443 -&gt; 8443 so the cloudflare tunnel can use the standard port 443 for HTTPS.</p> <pre><code class="language-cfg">frontend .dogac.dev
  mode http
  bind :8443

  ...
</code></pre> <p>Also when you run <code class="language-plaintext highlighter-rouge">dig</code> queries on my domain now, you will see that your home IP address is hidden and instead it shows cloudflare's IP addresses. Now I can also benefit from Cloudflare's analytics on top of Google Analytics</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/cloudflare-dashboard-480.webp 480w,/assets/img/posts/cloudflare-dashboard-800.webp 800w,/assets/img/posts/cloudflare-dashboard-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/cloudflare-dashboard.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <p>And most importantly, I am able to visit my website directly from its domain, <code class="language-plaintext highlighter-rouge">dogac.dev</code> without needing any extra configuration. This allowed me to configure all my devices to directly use the domain address no matter which network I am connected to.</p> <h2 id="conclusion">Conclusion</h2> <p>Cloudflare Tunnel is a free and secure solution for hosting your home-server. The setup was pretty straightforward and it helped me secure my home-cluster while providing a federated access both from my internal network and other external networks. Cloudflare has many other free features that I didn't have a chance to explore yet. I recommend it for any hobbyist that has a home-cluster. Let me know what you think about this post, and feel free to share any recommendations for my home cluster.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[I'm a big fan of self-hosting and DIY. Since writing my previous blog post about my self-hosting journey, I have learned some exciting new things that I want to share it with you. First, I’ll explain my initial server setup. Then, I’ll discuss why I looked for an alternative, and finally, I’ll show how Cloudflare Tunnels helped me achieve my goals.]]></summary></entry><entry><title type="html">I Like Self-Hosting</title><link href="https://dogac.dev/blog/2024/about-self-hosting/" rel="alternate" type="text/html" title="I Like Self-Hosting"/><published>2024-06-10T00:00:00+00:00</published><updated>2024-06-10T00:00:00+00:00</updated><id>https://dogac.dev/blog/2024/about-self-hosting</id><content type="html" xml:base="https://dogac.dev/blog/2024/about-self-hosting/"><![CDATA[<p>As I mentioned in my previous posts, I love open-source software. How do I prefer my open-source projects? Of course, self-hosted. I love self-hosting not because I save a ton of money on hosting or SaaS fees, but because it offers a fun and educational experience. However, in my professional life, I usually avoid self-hosting due to the responsibilities it entails and the total cost that accumulates over time. I understand it depends on individual circumstances, so it’s not right to generalize, but this has been my general experience.</p> <h2 id="where-it-all-began">Where It All Began</h2> <p>So, where did it all start? Probably, many of us have gone through a similar journey. I was just learning how to code and created a basic website using <em>Bootstrap </em>(thank you for teaching me what responsive design is). I had the webpage ready and needed to share it with my friends and family."OK, here you go <code class="language-plaintext highlighter-rouge">C:/Users/Dogac Eldenk/Desktop/awesome_website.html</code>". You're saying it doesn't work? I installed something called <em>Apache,</em> now try <code class="language-plaintext highlighter-rouge">localhost:8000</code>. It still doesn't work? Oh, there are <em>internal</em> and <em>external</em> IP addresses. I found mine; here you go: <code class="language-plaintext highlighter-rouge">100.0.0.255</code>. It doesn't work either. Let's learn about <em>NAT</em>, <em>firewalls</em> and <em>port forwarding</em>. Oh, I have to pay my ISP just to forward my port? Fine, a couple bucks per month must be worth it (it was really hard to convince my dad back then). Finally, my website was <em>online</em>!</p> <p>But who could remember those digits to visit my website? I wanted a cool <em>domain name</em>, so I paid for one. OK, I finally got everything in place, I thought. My server goes offline when I turn off my computer? Do people really keep their computers open all day? I don't think so. I thought the magical answer to this question was <em>CPanel</em> hosting. It was super cheap to host my own website on CPanel, however it had several limitations. It was constrained to PHP, HTML and MySQL in my case. So, what is the alternative? I found <em>Digital Ocean</em>. This website allowed me to host a <em>VPS</em> (Virtual Private Server) for $5 per month. As you might have guessed, I was broke because I paid yearly for hosting and a domain name and could never afforded a proper VPS for more than a month.</p> <p>Of course, things have changed quite fast in the last 10 years. When I went through this, I knew basically <em>nothing</em> about how servers work. I was routing myself towards the <em>shortest path</em> to achieve hosting a website online to get only 3 clicks per month. Why am I telling this story? Because I have learned so many things just to showcase my website to my friends. I know this could have been a screenshot or screen recording, but where is the fun in that?</p> <h2 id="what-about-my-own-server">What About My Own Server?</h2> <p>So, at this point, I had a brief idea about how websites work. I also questioned how game servers worked when I wanted to spin-up my own <em>Minecraft</em> server locally to play with my friends. Then I found out about <em>SBCs, </em>Single Board Computers! I have bought my first <em>Raspberry Pi</em>, plugged it in and started tumbling. I learned so much when I first used it: Linux, python, hardware, networking etc… I previously had an <em>Arduino</em>, ESP-8266; I wrote C code to blink some LEDs, display some text on an old-school display. However none of them were as capable as real computers. So, when I met the Raspberry Pi, I realized I could connect everything together and connect it to the internet.</p> <p>My first significant project was creating a <em>weather station</em>. I had some electronic components that measured temperature and humidity. I also had an RF radio module which communicated with the Raspberry Pi I set up in my room. Using this setup, I was running my weather station on an AA battery, which I prematurely optimized to run for years on a single battery even though I was going to use it for only a week and leave it. The station was reporting to my Raspberry Pi using RF and the Pi was hosting the data on my website online!</p> <p>Of course, this was a fun little project that thought me a lot. I was still unable to do more complex tasks, for example run a Minecraft server. Also. the ARM architecture was not so popular back then, so occasionally I hit a hard wall of "x86 only" applications. So, I never actually had a server that I could use generally; it was always limited in some way.</p> <p>When the COVID pandemic began, I had to suspend school and go back to live with my parents. I was bored for months; the only fun I had was playing Counter-Strike with my friends. So as a fun project, I spun up my own game server to play with my friends. To do this, I used my old laptop, which wasn't being used. I kept it open all day. However, it only lasted until summer because my room would get super hot, and the fan of that computer was giving me a hard time sleeping.</p> <p>Half a year later, I returned to the College and started working part-time. With my first month's salary, I upgraded my desktop computer. During my internship in the UK, one of my colleagues, Aaron, showed me his home setup. He had a separate computer running <em>Unraid</em>. So I got inspired and thought I could use the remaining parts to create a home server like he did. I bought a wooden crate to put my computer parts in because I did not have an extra case.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/homeserver-480.webp 480w,/assets/img/posts/homeserver-800.webp 800w,/assets/img/posts/homeserver-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/homeserver.jpeg" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <p>I did not use <em>Unraid </em>for some reason. I installed Ubuntu on my computer (Pop OS!) and occasionally used the HDMI output to connect to my TV. For the first time ever, I had a computer at home that was capable of almost anything I could imagine, except for machine learning training for my coursework. I used this computer to host our final year project, game server, media station and even a personal cloud. I was so happy to be using this computer for my daily life.</p> <h2 id="what-about-now">What About Now?</h2> <p>Rolling forward for two and a half years, I moved to the US in the meantime. I had to leave a lot of stuff I owned back in my home in Turkey. During this period, I learned a lot about the cloud. However, I was still pretty much excited about having my own hardware. My first months in the US were rough, so I bought an Orange Pi 5 Pro and Raspberry Pi 5 to spend some time with. I probably should have only bought the Raspberry Pi; however I wanted the 16GB of RAM on my server and thought 8GB was not enough.</p> <p>It felt pretty nostalgic to be working with a SBC again after 10 years. The community was much more bigger than before, and I knew much more about computers and software. I have used those two SBCs to learn about <em>Kubernetes</em> and and created my own Kubernetes cluster using those two. I still have some fun projects in my mind involving <em>Pi Hole</em>, personal <em>Grafana</em> dashboards, <em>CasaOS </em>and so on.</p> <p>Later on, I was frustrated by the fact that the tutorials I followed <em>still</em> did not publish Docker images on the ARM architecture. I was frustrated and I wanted to do something more professional with x86, something I can use more generally, something with more memory and more importantly, disk space.</p> <p>I surfed through eBay to find refurbished computers. Then I found the <em>Dell OptiPlex 3060</em>. This computer offers so much for only $135: specs are i5 8500T, 32GB RAM and 512GB SSD. It had twice the performance, double the memory, much much more disk capacity and speed.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/homeserver2-480.webp 480w,/assets/img/posts/homeserver2-800.webp 800w,/assets/img/posts/homeserver2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/homeserver2.jpeg" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <p>This time, I won't be running barebones Linux on my server. So I installed <em>Proxmox</em>. Because I had so much memory and disk space, I could create multiple VMs for various tasks. First, I created 3 Ubuntu Server virtual machines. I used those virtual machines to learn about kubernetes and created a cluster using <em>microk8s</em>.</p> <div class="d-flex justify-content-center"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/proxmox-480.webp 480w,/assets/img/posts/proxmox-800.webp 800w,/assets/img/posts/proxmox-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/proxmox.png" class="img-fluid rounded z-depth-1" width="auto" height="auto" style=" max-height: 400px; " loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <p>One day at work, I had to do some testing with an external integration and I needed a public HTTP server. Unfortunately, I am not able to port-forward devices based on IP addresses because the Xfinity app sucks. I went ahead and created a lightweight container using Proxmox without any VM and set up <em>HAProxy</em>. This was an opportunity for me to learn about HAProxy. I put this container behind DMZ and let it handle the traffic and forward it to my laptop. I also configured a Minecraft server and put it behind the load balancer.</p> <p>Moving forward, I decided to start a blog, the one that you are reading right now. I created this blog using <a href="https://ghost.org/">Ghost</a>, an open source blogging platform. I am self-hosting this using the Dell machine inside a virtual machine. As always, it is a fun experience to go through the hassle of initial setup with software. You always learn something during the process.</p> <h2 id="getting-serious">Getting Serious</h2> <p>Now I want to get serious with my blog. As I post more stuff, I hope to get more traffic (more than 2 people per day). As I do this, availability is a concern. For example, yesterday, the electricity went off and my server shut down. The website was inaccessible for a whole day, and I did not realize it.</p> <p>I was at home, so I was able to recover my blog. However, I am planning to travel this summer to visit home. So I won't be able to do any emergency recovery in that case. Also, I am not backing up my blog yet. So I am thinking about moving my blog temporarily during summer and observing how it goes.</p> <p>So I have a couple of alternatives; either to keep self-hosting on the cloud with higher availability or use a hosted version of Ghost. And I chose, of course, to keep self hosting. I have done some cost analysis. To keep things simple, I have used AWS as the baseline. The server I need should be fairly minimal with a couple of gigs of RAM.</p> <ul> <li>ECS: Too expensive, not even near EC2 for the smallest instance. 1 vCPU + 1 GB RAM for $30 per month.</li> <li>EC2: My choice of instance is t4g.micro with 2 vCPU and 1GB RAM. It costs about $7 for a public IP, and I choose spot instances, so $3, adding up to about $10 per month.</li> <li>Amazon Lightsail: This is the most traditional approach, a VPS. The pricing for this class is also much more predictable as there aren't many moving pieces around. The same setup with 2 vCPU and 1GB RAM is only $5 per month using IPv6.</li> </ul> <p>Currently, my choice is Amazon Lightsail. My website is super lightweight; I only need a couple of features. I am not even thinking about using AWS managed MySQL to manage my data. I am hosting Ghost and MySQL under the same instance. Note that MySQL can run on instances with 512MB RAM out of the box.</p> <p>For backing up and alerting, I am planning to set up some alerts that would send me an email regarding downtime on my server. I am also planning to back up my MySQL database to S3 daily or weekly. This seems like the cheapest and easiest option.</p> <h2 id="conclusion">Conclusion</h2> <p>Self-hosting is fun; it teaches you a lot. It doesn't matter if the thing you do is the most efficient or productive way possible. What matters is how you get there and what values it brings. I feel like trying to do stuff on my own is an important part of my early life that taught me lots of stuff that I currently know. In this post, I have focused on s<em>ervers</em>;<em> </em>however, it doesn't have to be just servers. It can be writing your own X, where X is already a solved problem such as writing your own serialization format, implementing compression with <em>Huffman Encoding</em>, writing a Chess Engine, implementing Neural Networks from scratch, a custom JSON parser, and so on.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[As I mentioned in my previous posts, I love open-source software. How do I prefer my open-source projects? Of course, self-hosted. I love self-hosting not because I save a ton of money on hosting or SaaS fees, but because it offers a fun and educational experience. However, in my professional life, I usually avoid self-hosting due to the responsibilities it entails and the total cost that accumulates over time. I understand it depends on individual circumstances, so it’s not right to generalize, but this has been my general experience.]]></summary></entry><entry><title type="html">Building an Authorization Framework with Armeria - a Case Study</title><link href="https://dogac.dev/blog/2024/building-an-authorization-framework-with-armeria/" rel="alternate" type="text/html" title="Building an Authorization Framework with Armeria - a Case Study"/><published>2024-06-03T00:00:00+00:00</published><updated>2024-06-03T00:00:00+00:00</updated><id>https://dogac.dev/blog/2024/building-an-authorization-framework-with-armeria</id><content type="html" xml:base="https://dogac.dev/blog/2024/building-an-authorization-framework-with-armeria/"><![CDATA[<p>I have been introduced to <em>Armeria</em> 2 years ago in 2022. Since then, it is my go-to framework for <em>JVM</em> based projects. Recently, I had some experience at work to build some shared authorization code in our system and I wanted to share my experience on how we built our authorization framework using Armeria by applying it to a theoretical scenario.</p> <h2 id="case-study-blog-application">Case Study: Blog Application</h2> <p>Let's start by describing a theoretical scenario. We have a Blog website, in this website there will be members and authors. Members can subscribe to authors. Authors can write blog posts. Authors can change the visibility of each blog post to public, members-only or subscribers-only.</p> <p>First issue, authentication. In today's standards, <em>OAuth2</em> tokens are a pretty common way to authenticate. Let's assume our application uses OAuth2 <em>JWT</em> tokens. Armeria allows us to Decorate our code using <em>Decorators</em>. Let's create a decorator that requires a valid OAuth2 token.</p> <figure class="kg-card kg-code-card"> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">val</span> <span class="py">ACCESS_TOKEN_KEY</span><span class="p">:</span> <span class="nc">AttributeKey</span><span class="p">&lt;</span><span class="nc">Token</span><span class="p">&gt;</span> <span class="p">=</span> <span class="nc">AttributeKey</span><span class="p">.</span><span class="nf">valueOf</span><span class="p">(</span><span class="s">"access_token"</span><span class="p">)</span>

<span class="kd">class</span> <span class="nc">RequireAccessToken</span> <span class="p">:</span> <span class="nc">DecoratingHttpServiceFunction</span> <span class="p">{</span>
<span class="k">override</span> <span class="k">fun</span> <span class="nf">serve</span><span class="p">(</span>
<span class="n">delegate</span><span class="p">:</span> <span class="nc">HttpService</span><span class="p">,</span>
<span class="n">ctx</span><span class="p">:</span> <span class="nc">ServiceRequestContext</span><span class="p">,</span>
<span class="n">req</span><span class="p">:</span> <span class="nc">HttpRequest</span><span class="p">,</span>
<span class="p">):</span> <span class="nc">HttpResponse</span> <span class="p">{</span>
<span class="kd">val</span> <span class="py">token</span><span class="p">:</span> <span class="nc">String</span> <span class="p">=</span> <span class="n">ctx</span><span class="p">.</span><span class="nf">request</span><span class="p">()</span>
<span class="p">.</span><span class="nf">headers</span><span class="p">()</span>
<span class="p">.</span><span class="k">get</span><span class="p">(</span><span class="s">"Authorization"</span><span class="p">)</span>
<span class="p">.</span><span class="nf">removePrefix</span><span class="p">(</span><span class="s">"Bearer "</span><span class="p">)</span>

        <span class="kd">val</span> <span class="py">claims</span> <span class="p">=</span> <span class="nc">MyJWTVerifier</span><span class="p">.</span><span class="nf">validate</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>

        <span class="k">return</span> <span class="k">if</span> <span class="p">(</span><span class="n">claims</span><span class="p">.</span><span class="n">isValid</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">ctx</span><span class="p">.</span><span class="nf">setAttr</span><span class="p">(</span><span class="nc">ACCESS_TOKEN_KEY</span><span class="p">,</span> <span class="n">claims</span><span class="p">)</span>
            <span class="n">delegate</span><span class="p">.</span><span class="nf">serve</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="nc">HttpResponse</span><span class="p">.</span><span class="nf">of</span><span class="p">(</span><span class="nc">HttpStatus</span><span class="p">.</span><span class="nc">UNAUTHORIZED</span><span class="p">)</span>
        <span class="p">}</span>
    <span class="p">}</span>

<span class="p">}</span>

</code></pre></div> </div> <figcaption> <p><span style="white-space: pre-wrap;">A decorator that mandates an access token</span></p> </figcaption> </figure> <p>This decorator does two things:</p> <ol> <li>Ensure there is a valid JWT token issued. (Implementation of <code class="language-plaintext highlighter-rouge">MyJWTVerifier</code> is up-to-you).</li> <li>Inject claims parsed from the JWT token to the request context.</li> </ol> <p>Number 1 is obviously required to make an endpoint protected. Number 2 will be used to authorize using actors and relations (<em>ABAC, RBAC</em>…) in upcoming section. So, let's go ahead and apply this decorator.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Decorator</span><span class="p">(</span><span class="nc">RequireAccessToken</span><span class="o">::</span><span class="k">class</span><span class="p">)</span>
<span class="kd">class</span> <span class="nc">BlogPostController</span> <span class="p">{</span>

    <span class="nd">@Get</span><span class="p">(</span><span class="s">"/blog_posts"</span><span class="p">)</span>
    <span class="k">suspend</span> <span class="k">fun</span> <span class="nf">listBlogPosts</span><span class="p">():</span> <span class="nc">List</span><span class="p">&lt;</span><span class="nc">BlogPost</span><span class="p">&gt;</span> <span class="p">{</span> <span class="o">..</span><span class="p">.</span> <span class="p">}</span>

    <span class="nd">@Post</span><span class="p">(</span><span class="s">"/blog_posts"</span><span class="p">)</span>
    <span class="k">suspend</span> <span class="k">fun</span> <span class="nf">createBlogPost</span><span class="p">(</span><span class="n">body</span><span class="p">:</span> <span class="nc">CreateBlogPostBody</span><span class="p">):</span> <span class="nc">BlogPost</span> <span class="p">{</span> <span class="o">..</span><span class="p">.</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <p>Now, using the Access Token decorator, we have enforced all requests coming to our controller to have a valid JWT token. Note that this is an annotated service, however the same decorator will work for other <em>HTTP</em> services even including <em>gRPC</em> services.</p> <p>In this implementation, some requirements are not met. For example, there are blog posts that are publicly visible. To fix it, we need a graceful way to inject token metadata into the request context.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">MaybeAccessToken</span> <span class="p">:</span> <span class="nc">DecoratingHttpServiceFunction</span> <span class="p">{</span>
    <span class="k">override</span> <span class="k">fun</span> <span class="nf">serve</span><span class="p">(</span>
        <span class="n">delegate</span><span class="p">:</span> <span class="nc">HttpService</span><span class="p">,</span>
        <span class="n">ctx</span><span class="p">:</span> <span class="nc">ServiceRequestContext</span><span class="p">,</span>
        <span class="n">req</span><span class="p">:</span> <span class="nc">HttpRequest</span><span class="p">,</span>
    <span class="p">):</span> <span class="nc">HttpResponse</span> <span class="p">{</span>
        <span class="kd">val</span> <span class="py">token</span><span class="p">:</span> <span class="nc">String</span> <span class="p">=</span> <span class="n">ctx</span><span class="p">.</span><span class="nf">request</span><span class="p">()</span>
                                <span class="p">.</span><span class="nf">headers</span><span class="p">()</span>
                                <span class="p">.</span><span class="k">get</span><span class="p">(</span><span class="s">"Authorization"</span><span class="p">)</span>
                                <span class="p">.</span><span class="nf">removePrefix</span><span class="p">(</span><span class="s">"Bearer "</span><span class="p">)</span>

        <span class="kd">val</span> <span class="py">claims</span> <span class="p">=</span> <span class="nc">MyJWTVerifier</span><span class="p">.</span><span class="nf">validate</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">claims</span><span class="p">.</span><span class="n">isValid</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">ctx</span><span class="p">.</span><span class="nf">setAttr</span><span class="p">(</span><span class="nc">ACCESS_TOKEN_KEY</span><span class="p">,</span> <span class="n">claims</span><span class="p">)</span>
        <span class="p">}</span>

        <span class="k">return</span> <span class="n">delegate</span><span class="p">.</span><span class="nf">serve</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <p>This slight modified decorator will not throw an <em>Unauthorized</em> exception when there is no token present. Let's modify our controller to accommodate this change.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">BlogPostController</span> <span class="p">{</span>

    <span class="nd">@Get</span><span class="p">(</span><span class="s">"/blog_posts"</span><span class="p">)</span>
    <span class="nd">@Decorator</span><span class="p">(</span><span class="nc">MaybeAccessToken</span><span class="o">::</span><span class="k">class</span><span class="p">)</span>
    <span class="k">suspend</span> <span class="k">fun</span> <span class="nf">listBlogPosts</span><span class="p">():</span> <span class="nc">List</span><span class="p">&lt;</span><span class="nc">BlogPost</span><span class="p">&gt;</span> <span class="p">{</span>
    <span class="kd">val</span> <span class="py">token</span> <span class="p">=</span> <span class="nc">ServiceRequestContext</span><span class="p">.</span><span class="nf">current</span><span class="p">().</span><span class="nf">getAttr</span><span class="p">(</span><span class="nc">ACCESS_TOKEN_KEY</span><span class="p">)</span>

    <span class="c1">// Only public endpoints</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">token</span> <span class="p">==</span> <span class="k">null</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="nc">BlogRepository</span><span class="p">.</span><span class="nf">listPublicBlogPosts</span><span class="p">()</span>
    <span class="p">}</span>

    <span class="kd">val</span> <span class="py">subscribedAuthors</span> <span class="p">=</span> <span class="nc">Subscriptions</span><span class="p">.</span><span class="nf">getForUser</span><span class="p">(</span><span class="n">token</span><span class="p">.</span><span class="n">userId</span><span class="p">).</span><span class="nf">map</span> <span class="p">{</span> <span class="n">it</span><span class="p">.</span><span class="n">authorId</span> <span class="p">}</span>

    <span class="k">return</span> <span class="n">repository</span><span class="p">.</span><span class="nf">listAllBlogPosts</span><span class="p">(</span><span class="n">subscribedAuthors</span><span class="p">)</span>
    <span class="p">}</span>

    <span class="nd">@Post</span><span class="p">(</span><span class="s">"/blog_posts"</span><span class="p">)</span>
    <span class="nd">@Decorator</span><span class="p">(</span><span class="nc">RequireAccessToken</span><span class="o">::</span><span class="k">class</span><span class="p">)</span>
    <span class="k">suspend</span> <span class="k">fun</span> <span class="nf">createBlogPost</span><span class="p">(</span><span class="n">body</span><span class="p">:</span> <span class="nc">CreateBlogPostBody</span><span class="p">):</span> <span class="nc">BlogPost</span> <span class="p">{</span> <span class="o">..</span><span class="p">.</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <p>Now, let's add a slight twist to this scenario. Let's assume this Blog website was created long ago and it also has a mobile app. In the mobile app, instead of using JWT tokens, we were using username and password header (Yikes!). Even though this is not desired, some real world applications might need to support their legacy code for different reasons. In this example, the application was created long ago and it did not have <em>OTA</em> updates. So even if we migrate to JWT in the mobile app, to keep serving our old users, we need to keep supporting their way of authorizing.</p> <p>Let's modify the decorator to take the username, password header into account.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">val</span> <span class="py">USER_ID_KEY</span><span class="p">:</span> <span class="nc">AttributeKey</span><span class="p">&lt;</span><span class="nc">UUID</span><span class="p">&gt;</span> <span class="p">=</span> <span class="nc">AttributeKey</span><span class="p">.</span><span class="nf">valueOf</span><span class="p">(</span><span class="s">"user_id"</span><span class="p">)</span>

<span class="kd">class</span> <span class="nc">MaybeUser</span> <span class="p">:</span> <span class="nc">DecoratingHttpServiceFunction</span> <span class="p">{</span>
    <span class="k">override</span> <span class="k">fun</span> <span class="nf">serve</span><span class="p">(</span>
        <span class="n">delegate</span><span class="p">:</span> <span class="nc">HttpService</span><span class="p">,</span>
        <span class="n">ctx</span><span class="p">:</span> <span class="nc">ServiceRequestContext</span><span class="p">,</span>
        <span class="n">req</span><span class="p">:</span> <span class="nc">HttpRequest</span><span class="p">,</span>
    <span class="p">):</span> <span class="nc">HttpResponse</span> <span class="p">{</span>
        <span class="kd">val</span> <span class="py">jwtToken</span><span class="p">:</span> <span class="nc">String</span> <span class="p">=</span> <span class="n">ctx</span><span class="p">.</span><span class="nf">request</span><span class="p">()</span>
                                <span class="p">.</span><span class="nf">headers</span><span class="p">()</span>
                                <span class="p">.</span><span class="k">get</span><span class="p">(</span><span class="s">"Authorization"</span><span class="p">)</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">jwtToken</span><span class="p">.</span><span class="nf">startsWith</span><span class="p">(</span><span class="s">"Bearer"</span><span class="p">))</span> <span class="p">{</span>
            <span class="kd">val</span> <span class="py">claims</span> <span class="p">=</span> <span class="nc">MyJWTVerifier</span><span class="p">.</span><span class="nf">validate</span><span class="p">(</span><span class="n">jwtToken</span><span class="p">.</span><span class="nf">removePrefix</span><span class="p">(</span><span class="s">"Bearer "</span><span class="p">)</span>
            <span class="n">ctx</span><span class="p">.</span><span class="nf">setAttr</span><span class="p">(</span><span class="nc">USER_ID_KEY</span><span class="p">,</span> <span class="n">claims</span><span class="p">.</span><span class="n">userId</span><span class="p">)</span>
        <span class="p">}</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">jwtToken</span><span class="p">.</span><span class="nf">startsWith</span><span class="p">(</span><span class="s">"Basic"</span><span class="p">))</span> <span class="p">{</span>
            <span class="kd">val</span> <span class="py">userNameAndPassword</span> <span class="p">=</span> <span class="n">jwtToken</span><span class="p">.</span><span class="nf">removePrefix</span><span class="p">(</span><span class="s">"Basic "</span><span class="p">).</span><span class="nf">split</span><span class="p">(</span><span class="s">":"</span><span class="p">)</span>
            <span class="kd">val</span> <span class="py">userId</span><span class="p">:</span> <span class="nc">UUID</span><span class="p">?</span> <span class="p">=</span> <span class="nc">Users</span><span class="p">.</span><span class="nf">check</span><span class="p">(</span><span class="n">userNameAndPassword</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">userNameAndPassword</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
            <span class="n">ctx</span><span class="p">.</span><span class="nf">setAttr</span><span class="p">(</span><span class="nc">USER_ID_KEY</span><span class="p">,</span> <span class="n">claims</span><span class="p">.</span><span class="n">userId</span><span class="p">)</span>
        <span class="p">}</span>


        <span class="k">return</span> <span class="n">delegate</span><span class="p">.</span><span class="nf">serve</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <p>Now, with this new decorator, we have delegated the business logic for finding out which user made to call outside the controller. We basically wrap our controller with a single annotation and it magically injects the calling user into the context.</p> <h2 id="suspend-calls-in-decorators">Suspend Calls in Decorators</h2> <p>We most likely need to make suspending calls from decorators to do certain checks such as database calls, network calls etc. This includes user login check and maybe JWT verification. As you might have noticed, it is currently not possible to create suspend decorators (<a href="https://github.com/line/armeria/issues/4725">issue to track</a>). So to achieve this, we can use the event loop as our dispatcher.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">val</span> <span class="py">future</span> <span class="p">=</span> <span class="nc">CoroutineScope</span><span class="p">(</span><span class="n">ctx</span><span class="p">.</span><span class="nf">eventLoop</span><span class="p">().</span><span class="nf">asCoroutineDispatcher</span><span class="p">()).</span><span class="nf">future</span> <span class="p">{</span>
    <span class="c1">// Can call suspend functions here</span>
    <span class="nc">UserRepository</span><span class="p">.</span><span class="nf">login</span><span class="p">(</span><span class="o">..</span><span class="p">.)</span>
    <span class="nc">HttpResponse</span><span class="p">.</span><span class="nf">of</span><span class="p">(</span><span class="nc">HttpStatus</span><span class="p">.</span><span class="nc">OK</span><span class="p">)</span>
<span class="p">}</span>

<span class="k">return</span> <span class="nc">HttpResponse</span><span class="p">.</span><span class="nf">of</span><span class="p">(</span><span class="n">future</span><span class="p">)</span>
</code></pre></div></div> <p>Note that by using event loop, you should ensure your suspend functions are following <a href="https://developer.android.com/kotlin/coroutines/coroutines-best-practices#main-safe">the best practice</a> and they can be safely called from the main thread without blocking it. otherwise you should use some other dispatcher, i.e. blocking task executor or <code class="language-plaintext highlighter-rouge">Dispatchers.IO</code>.</p> <h2 id="handling-dependency-injection">Handling Dependency Injection</h2> <p>As you might have noticed, our repositories were assumed to be <em>object</em>s for simplicity in the first examples. However, in real world application, dependency injection frameworks such as K<em>oin</em> is being widely adopted. For example with Koin, we can mark a decorator as <code class="language-plaintext highlighter-rouge">KoinComponent</code>.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">MaybeAccessToken</span> <span class="p">:</span> <span class="nc">DecoratingHttpServiceFunction</span><span class="p">,</span> <span class="nc">KoinComponent</span> <span class="p">{</span>
    <span class="k">private</span> <span class="kd">val</span> <span class="py">jwtVerifier</span> <span class="k">by</span> <span class="n">inject</span><span class="p">&lt;</span><span class="nc">JWTVerifier</span><span class="p">&gt;()</span>

    <span class="k">override</span> <span class="k">fun</span> <span class="nf">serve</span><span class="p">(</span>
        <span class="n">delegate</span><span class="p">:</span> <span class="nc">HttpService</span><span class="p">,</span>
        <span class="n">ctx</span><span class="p">:</span> <span class="nc">ServiceRequestContext</span><span class="p">,</span>
        <span class="n">req</span><span class="p">:</span> <span class="nc">HttpRequest</span><span class="p">,</span>
    <span class="p">):</span> <span class="nc">HttpResponse</span> <span class="p">{</span> <span class="o">..</span><span class="p">.</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <h2 id="custom-annotations-and-parameters">Custom Annotations and Parameters</h2> <p>Sometimes a decorator might be generic and it might need to take parameters. For example, let's say <code class="language-plaintext highlighter-rouge">MaybeUser</code> annotation can be constrained to only a certain types of users. Such as <code class="language-plaintext highlighter-rouge">subscriber</code>, <code class="language-plaintext highlighter-rouge">member</code> or <code class="language-plaintext highlighter-rouge">visitor</code>. We want something like the following,</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@RequireUser</span><span class="p">(</span><span class="n">allow</span> <span class="p">=</span> <span class="p">[</span><span class="s">"subscriber"</span><span class="p">,</span> <span class="s">"member"</span><span class="p">])</span>
<span class="nd">@Post</span><span class="p">(</span><span class="s">"/blog_post/{id}/like"</span><span class="p">)</span>
<span class="k">fun</span> <span class="nf">likeBlogPost</span><span class="p">(</span><span class="nd">@Param</span> <span class="n">id</span><span class="p">:</span> <span class="nc">String</span><span class="p">)</span> <span class="p">{</span> <span class="o">..</span><span class="p">.</span> <span class="p">}</span>
</code></pre></div></div> <p>To achieve this functionality, we can't user <code class="language-plaintext highlighter-rouge">@Decorator(...)</code> approach because it does not accept parameters. Instead, we should use a <code class="language-plaintext highlighter-rouge">@DecoratingFactoryFunction</code>.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@DecoratorFactory</span><span class="p">(</span><span class="nc">RequireUserDecoratorFactory</span><span class="o">::</span><span class="k">class</span><span class="p">)</span>
<span class="k">annotation</span> <span class="kd">class</span> <span class="nc">RequireUser</span><span class="p">(</span><span class="kd">val</span> <span class="py">allow</span><span class="p">:</span> <span class="nc">Array</span><span class="p">&lt;</span><span class="nc">String</span><span class="p">&gt;</span> <span class="p">=</span> <span class="p">[])</span>

<span class="kd">class</span> <span class="nc">RequireUserDecorator</span><span class="p">(</span><span class="n">delegate</span><span class="p">:</span> <span class="nc">HttpService</span><span class="p">,</span> <span class="n">allow</span><span class="p">:</span> <span class="nc">Array</span><span class="p">&lt;</span><span class="nc">String</span><span class="p">&gt;):</span> <span class="nc">SimpleDecoratingHttpService</span><span class="p">(</span><span class="n">delegate</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">override</span> <span class="k">fun</span> <span class="nf">serve</span><span class="p">(</span>
        <span class="n">ctx</span><span class="p">:</span> <span class="nc">ServiceRequestContext</span><span class="p">,</span>
        <span class="n">req</span><span class="p">:</span> <span class="nc">HttpRequest</span><span class="p">,</span>
    <span class="p">):</span> <span class="nc">HttpResponse</span> <span class="p">{</span>
        <span class="kd">val</span> <span class="py">token</span> <span class="p">=</span> <span class="nc">MyJWTVerifier</span><span class="p">.</span><span class="nf">verify</span><span class="p">(</span><span class="n">ctx</span><span class="p">.</span><span class="nf">request</span><span class="p">()</span>
                                            <span class="p">.</span><span class="nf">headers</span><span class="p">()</span>
                                            <span class="p">.</span><span class="k">get</span><span class="p">(</span><span class="s">"Authorization"</span><span class="p">))</span>


        <span class="k">if</span> <span class="p">(</span><span class="n">token</span> <span class="p">==</span> <span class="k">null</span> <span class="p">||</span> <span class="n">token</span><span class="p">.</span><span class="n">groups</span><span class="p">.</span><span class="nf">containsAll</span><span class="p">(</span><span class="n">allow</span><span class="p">).</span><span class="nf">not</span><span class="p">())</span> <span class="p">{</span>
            <span class="k">return</span> <span class="nc">HttpResponse</span><span class="p">.</span><span class="nf">of</span><span class="p">(</span><span class="nc">HttpStatus</span><span class="p">.</span><span class="nc">UNAUTHORIZED</span><span class="p">)</span>
        <span class="p">}</span>

        <span class="k">return</span> <span class="nf">unwrap</span><span class="p">().</span><span class="nf">serve</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="kd">class</span> <span class="nc">RequireUserDecoratorFactory</span><span class="p">:</span> <span class="nc">DecoratorFactoryFunction</span><span class="p">&lt;</span><span class="nc">RequireUser</span><span class="p">&gt;</span> <span class="p">{</span>
    <span class="k">override</span> <span class="k">fun</span> <span class="nf">newDecorator</span><span class="p">(</span><span class="n">parameter</span><span class="p">:</span> <span class="nc">RequireUser</span><span class="p">):</span> <span class="nc">Function</span><span class="p">&lt;</span><span class="k">in</span> <span class="nc">HttpService</span><span class="p">,</span> <span class="k">out</span> <span class="nc">HttpService</span><span class="p">&gt;</span> <span class="p">{</span>
        <span class="k">return</span> <span class="nc">Function</span> <span class="p">{</span> <span class="nc">RequireUserDecorator</span><span class="p">(</span><span class="n">it</span><span class="p">,</span> <span class="n">parameter</span><span class="p">.</span><span class="n">allow</span><span class="p">)</span> <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <p>By including this, Armeria automatically detects whenever the <code class="language-plaintext highlighter-rouge">@RequireUser</code> annotation is applied to a controller / service and it automatically decorates it with <code class="language-plaintext highlighter-rouge">RequireUserDecorator</code>.</p> <h2 id="authorized-by-default">Authorized By Default</h2> <p>Let's add an authorization-by-default semantic into our application. Adding auth by default ensures sensitive applications to not leak data by mistake. The challenge with this approach is that the authorization decorator will be the top most decorator however overriding this behavior in method level is though. So, we should slightly modify our decorators to be more aware of each other. Let's define our syntax as the following,</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@NeedsAuthentication</span>
<span class="kd">class</span> <span class="nc">MembersController</span> <span class="p">{</span>

    <span class="nd">@PublicEndpoint</span>
    <span class="k">fun</span> <span class="nf">getMemberCount</span><span class="p">():</span> <span class="nc">Int</span> <span class="p">{</span> <span class="o">..</span><span class="p">.</span> <span class="p">}</span>

    <span class="nd">@Get</span><span class="p">(</span><span class="s">"/members"</span><span class="p">)</span>
    <span class="k">fun</span> <span class="nf">getMembers</span><span class="p">():</span> <span class="nc">List</span><span class="p">&lt;...&gt;</span> <span class="p">{</span> <span class="o">..</span><span class="p">.</span> <span class="p">}</span>
<span class="p">}</span>

<span class="c1">// Or alternatively...</span>

<span class="nc">Server</span><span class="p">.</span><span class="nf">builder</span><span class="p">().</span><span class="nf">decorator</span><span class="p">(</span><span class="nc">NeedsAuthentication</span><span class="p">.</span><span class="nf">newDecorator</span><span class="p">())</span>
</code></pre></div></div> <p>Here, the decorator <code class="language-plaintext highlighter-rouge">@RequireAuth</code> will be applied first. However we should override the behavior there using <code class="language-plaintext highlighter-rouge">@Public</code>. So, let's define our annotations.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@DecoratorFactory</span><span class="p">(</span><span class="nc">NeedsAuthenticationDecoratorFactory</span><span class="o">::</span><span class="k">class</span><span class="p">)</span>
<span class="nd">@Target</span><span class="p">(</span><span class="nc">AnnotationTarget</span><span class="p">.</span><span class="nc">FUNCTION</span><span class="p">,</span> <span class="nc">AnnotationTarget</span><span class="p">.</span><span class="nc">CLASS</span><span class="p">)</span>
<span class="k">annotation</span> <span class="kd">class</span> <span class="nc">NeedsAuthentication</span>

<span class="nd">@DecoratorFactory</span><span class="p">(</span><span class="nc">PublicEndpointDecoratorFactory</span><span class="o">::</span><span class="k">class</span><span class="p">)</span>
<span class="nd">@Target</span><span class="p">(</span><span class="nc">AnnotationTarget</span><span class="p">.</span><span class="nc">FUNCTION</span><span class="p">,</span> <span class="nc">AnnotationTarget</span><span class="p">.</span><span class="nc">CLASS</span><span class="p">)</span>
<span class="k">annotation</span> <span class="kd">class</span> <span class="nc">PublicEndpoint</span>
</code></pre></div></div> <p>Let's define our services. Here, public endpoint service is only a dummy service used as a marker.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">NeedsAuthService</span><span class="p">(</span><span class="n">delegate</span><span class="p">:</span> <span class="nc">HttpService</span><span class="p">):</span> <span class="nc">SimpleDecoratingHttpService</span><span class="p">(</span><span class="n">delegate</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">override</span> <span class="k">fun</span> <span class="nf">serve</span><span class="p">(</span>
        <span class="n">ctx</span><span class="p">:</span> <span class="nc">ServiceRequestContext</span><span class="p">,</span>
        <span class="n">req</span><span class="p">:</span> <span class="nc">HttpRequest</span><span class="p">,</span>
    <span class="p">):</span> <span class="nc">HttpResponse</span> <span class="p">{</span>
        <span class="kd">val</span> <span class="py">token</span> <span class="p">=</span> <span class="nc">ServiceRequestContextAuthChecker</span><span class="p">.</span><span class="nf">getAccessToken</span><span class="p">()</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">token</span> <span class="p">==</span> <span class="k">null</span> <span class="p">||</span> <span class="n">token</span><span class="p">.</span><span class="n">groups</span><span class="p">.</span><span class="nf">containsAll</span><span class="p">(</span><span class="n">allow</span><span class="p">).</span><span class="nf">not</span><span class="p">())</span> <span class="p">{</span>
            <span class="k">return</span> <span class="nc">HttpResponse</span><span class="p">.</span><span class="nf">of</span><span class="p">(</span><span class="nc">HttpStatus</span><span class="p">.</span><span class="nc">UNAUTHORIZED</span><span class="p">)</span>
        <span class="p">}</span>

        <span class="k">return</span> <span class="nf">unwrap</span><span class="p">().</span><span class="nf">serve</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c1">// A dummy service used as a marker</span>
<span class="kd">class</span> <span class="nc">PublicEndpointService</span><span class="p">(</span><span class="n">delegate</span><span class="p">:</span> <span class="nc">HttpService</span><span class="p">):</span> <span class="nc">SimpleDecoratingHttpService</span><span class="p">(</span><span class="n">delegate</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">override</span> <span class="k">fun</span> <span class="nf">serve</span><span class="p">(</span>
        <span class="n">ctx</span><span class="p">:</span> <span class="nc">ServiceRequestContext</span><span class="p">,</span>
        <span class="n">req</span><span class="p">:</span> <span class="nc">HttpRequest</span><span class="p">,</span>
    <span class="p">):</span> <span class="nc">HttpResponse</span> <span class="p">{</span>
        <span class="k">return</span> <span class="nf">unwrap</span><span class="p">().</span><span class="nf">serve</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <p>So, why the marker? We basically need to find a way to figure out if a service is annotated using <code class="language-plaintext highlighter-rouge">@PublicEndpoint</code> annotation. If so, we should conditionally not apply the auth decorator. This factory function also eliminates the duplicate auth checks by trying to down cast the delegate to <code class="language-plaintext highlighter-rouge">NeedsAuthService</code> once more.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">NeedsAuthDecoratorFactory</span> <span class="p">:</span> <span class="nc">DecoratorFactoryFunction</span><span class="p">&lt;</span><span class="nc">NeedsAuth</span><span class="p">&gt;</span> <span class="p">{</span>
    <span class="k">fun</span> <span class="nf">newDecorator</span><span class="p">():</span> <span class="nc">Function</span><span class="p">&lt;</span><span class="k">in</span> <span class="nc">HttpService</span><span class="p">,</span> <span class="k">out</span> <span class="nc">HttpService</span><span class="p">&gt;</span> <span class="p">{</span>
        <span class="k">return</span> <span class="nc">Function</span> <span class="p">{</span> <span class="n">delegate</span> <span class="p">-&gt;</span>
            <span class="kd">val</span> <span class="py">maybePublic</span><span class="p">:</span> <span class="nc">PublicApiService</span><span class="p">?</span> <span class="p">=</span> <span class="n">delegate</span><span class="p">.</span><span class="nf">`as`</span><span class="p">(</span><span class="nc">PublicApiService</span><span class="o">::</span><span class="k">class</span><span class="p">.</span><span class="n">java</span><span class="p">)</span>
            <span class="kd">val</span> <span class="py">maybeAuthenticated</span><span class="p">:</span> <span class="nc">NeedsAuthService</span><span class="p">?</span> <span class="p">=</span> <span class="n">delegate</span><span class="p">.</span><span class="nf">`as`</span><span class="p">(</span><span class="nc">NeedsAuthService</span><span class="o">::</span><span class="k">class</span><span class="p">.</span><span class="n">java</span><span class="p">)</span>

            <span class="k">if</span> <span class="p">(</span><span class="n">maybePublic</span> <span class="p">!=</span> <span class="k">null</span> <span class="p">||</span> <span class="n">maybeAuthenticated</span> <span class="p">!=</span> <span class="k">null</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">return</span><span class="nd">@Function</span> <span class="n">delegate</span>
            <span class="p">}</span>

            <span class="nc">NeedsAuthService</span><span class="p">(</span><span class="n">delegate</span><span class="p">)</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="kd">class</span> <span class="nc">PublicEndpointDecoratorFactory</span> <span class="p">:</span> <span class="nc">DecoratorFactoryFunction</span><span class="p">&lt;</span><span class="nc">PublicEndpoint</span><span class="p">&gt;</span> <span class="p">{</span>
    <span class="k">fun</span> <span class="nf">newDecorator</span><span class="p">():</span> <span class="nc">Function</span><span class="p">&lt;</span><span class="k">in</span> <span class="nc">HttpService</span><span class="p">,</span> <span class="k">out</span> <span class="nc">HttpService</span><span class="p">&gt;</span> <span class="p">{</span>
        <span class="k">return</span> <span class="nc">Function</span> <span class="p">{</span> <span class="n">delegate</span> <span class="p">-&gt;</span> <span class="nc">PublicEndpointService</span><span class="p">(</span><span class="n">delegate</span><span class="p">)</span> <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <h2 id="bonus-open-policy-agent-opa">Bonus: Open Policy Agent (OPA)</h2> <p>As a bonus, let's use the popular policy language <em>OPA</em> to authorize our system. Recommended way to authorize using OPA is using the <em>Envoy</em> sidecar with an external authorization filter. However, this scenario might be not sufficient or not available at all if you are not using Envoy.</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/opa-envoy-480.webp 480w,/assets/img/posts/opa-envoy-800.webp 800w,/assets/img/posts/opa-envoy-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/opa-envoy.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">Source: https://www.openpolicyagent.org/docs/latest/envoy-introduction/</figcaption> </figure> <p>So we can create a decorator that will intercept all requests coming to our service at the top level and checks access.</p> <div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">Authorize</span> <span class="p">:</span> <span class="nc">DecoratingHttpServiceFunction</span> <span class="p">{</span>
    <span class="k">override</span> <span class="k">fun</span> <span class="nf">serve</span><span class="p">(</span>
        <span class="n">delegate</span><span class="p">:</span> <span class="nc">HttpService</span><span class="p">,</span>
        <span class="n">ctx</span><span class="p">:</span> <span class="nc">ServiceRequestContext</span><span class="p">,</span>
        <span class="n">req</span><span class="p">:</span> <span class="nc">HttpRequest</span><span class="p">,</span>
    <span class="p">):</span> <span class="nc">HttpResponse</span> <span class="p">{</span>
        <span class="kd">val</span> <span class="py">path</span> <span class="p">=</span> <span class="n">ctx</span><span class="p">.</span><span class="nf">routingContext</span><span class="p">().</span><span class="nf">path</span><span class="p">()</span>
        <span class="kd">val</span> <span class="py">method</span> <span class="p">=</span> <span class="n">ctx</span><span class="p">.</span><span class="nf">routingContext</span><span class="p">().</span><span class="nf">method</span><span class="p">()</span>
        <span class="kd">val</span> <span class="py">token</span> <span class="p">=</span> <span class="n">ctx</span><span class="p">.</span><span class="n">authorizationHeader</span><span class="p">.</span><span class="nf">removePrefix</span><span class="p">(</span><span class="s">"Bearer "</span><span class="p">)</span>

        <span class="c1">// More contextual data can be added as desired</span>
        <span class="kd">val</span> <span class="py">result</span> <span class="p">=</span> <span class="nc">OPAClient</span><span class="p">.</span><span class="nf">check</span><span class="p">(</span><span class="nf">mapOf</span><span class="p">(</span><span class="s">"path"</span> <span class="n">to</span> <span class="n">path</span><span class="p">,</span> <span class="s">"method"</span> <span class="n">to</span> <span class="n">method</span><span class="p">,</span> <span class="s">"bearer_token"</span> <span class="n">to</span> <span class="n">token</span><span class="p">))</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">authorized</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">delegate</span><span class="p">.</span><span class="nf">serve</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">req</span><span class="p">)</span>
        <span class="p">}</span>

        <span class="k">return</span> <span class="nc">HttpResponse</span><span class="p">.</span><span class="nf">of</span><span class="p">(</span><span class="nc">HttpStatus</span><span class="p">.</span><span class="nc">UNAUTHORIZED</span><span class="p">)</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <h2 id="wrap-up">Wrap Up</h2> <p>In this blog post, I have covered how Decorators can be a useful building blocks for your application's authorization framework. They are flexible, customizable and allow you to separate concerns for various tasks such as authentication &amp; authorization into a different layer. I hope this was an inspiration for you to use decorators. Please let me know if you liked this article and if so please subscribe to be notified about future articles.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[I have been introduced to Armeria 2 years ago in 2022. Since then, it is my go-to framework for JVM based projects. Recently, I had some experience at work to build some shared authorization code in our system and I wanted to share my experience on how we built our authorization framework using Armeria by applying it to a theoretical scenario.]]></summary></entry></feed>