<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>DataEng on Blog | Émilien F. </title>
    <link>https://emilien-foissotte.github.io/tags/dataeng/</link>
    <description>Recent content in DataEng on Blog | Émilien F. </description>
    <generator>Hugo -- 0.140.2</generator>
    <language>en</language>
    <lastBuildDate>Sat, 11 Jan 2025 09:04:03 +0100</lastBuildDate>
    <atom:link href="https://emilien-foissotte.github.io/tags/dataeng/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Going Offline Efficiently</title>
      <link>https://emilien-foissotte.github.io/posts/2025/01/going-offline-efficiently/</link>
      <pubDate>Sat, 11 Jan 2025 09:04:03 +0100</pubDate>
      <guid>https://emilien-foissotte.github.io/posts/2025/01/going-offline-efficiently/</guid>
      <description>&lt;h1 id=&#34;tldr&#34;&gt;TL;DR&lt;/h1&gt;
&lt;p&gt;This post will show some tips on how to work efficiently as a Data Engineer 🚀, either navigating throught documentation
or using a local LLM to ease your development experience (having a Mac chip will be mandatory for this one). 👨‍💻&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s go !&lt;/p&gt;
&lt;h2 id=&#34;intro&#34;&gt;Intro&lt;/h2&gt;
&lt;p&gt;Nowadays, working in limited internet connection can occur and there is a huge gap compared to our workstation setup 🦾&lt;/p&gt;
&lt;p&gt;In those situations, connection speed might be very slow, with a very broken bandwith. This makes it very difficult to work in those environments, but with a few
preparation you might be as effective than before ! 💥&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h1 id="tldr">TL;DR</h1>
<p>This post will show some tips on how to work efficiently as a Data Engineer 🚀, either navigating throught documentation
or using a local LLM to ease your development experience (having a Mac chip will be mandatory for this one). 👨‍💻</p>
<p>Let&rsquo;s go !</p>
<h2 id="intro">Intro</h2>
<p>Nowadays, working in limited internet connection can occur and there is a huge gap compared to our workstation setup 🦾</p>
<p>In those situations, connection speed might be very slow, with a very broken bandwith. This makes it very difficult to work in those environments, but with a few
preparation you might be as effective than before ! 💥</p>
<p><img alt="sloth" loading="lazy" src="https://media1.giphy.com/media/v1.Y2lkPTc5MGI3NjExbzN1eTJlY2Y5cndiaGdydDdhZW9iZjE3bzhpZ3VtczZwODhyY3p5NyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/3oz8xOu5Gw81qULRh6/giphy.gif#center"></p>
<p>Get ready to boost your productivity, on a train, on a bus or enjoying a family trip in the countryside ! 🚜</p>
<h2 id="keep-a-browsable-documentation-everywhere">Keep a browsable documentation, everywhere</h2>
<p>Never received a RTFM raising an Issue ? Make yourself a gift, read the documentation when approaching a
new concept or feature of a package ! 📖</p>
<h3 id="unix-troubles">Unix troubles</h3>
<p>A very good tool to use when working on UNIX computers is <code>man</code>, a simple documentation tool that gives you
every details of software installed.</p>
<p>However, it can be hard to navigate the entire documentation. And it&rsquo;s quite difficult to get the exact purpose
you are looking for.</p>
<p>To solve this, a community tool have emerged : <code>tldr</code>.</p>
<p>In a few lines of text documentation you will find out the main usage of the software and its syntax.</p>
<p>For instance, for <code>xargs</code> command :</p>
<p><img alt="xargs_tldr" loading="lazy" src="/posts/2025/01/going-offline-efficiently/tldr_xargs.png#center"></p>
<p>Pretty neat, isn&rsquo;t it ? 🔥</p>
<p>To install it, have a look at the <a href="https://github.com/tldr-pages/tldr">repository</a>.</p>
<h3 id="browsing-python-documentation">Browsing Python documentation</h3>
<p>As Python is becoming the 1st programming language, you will surely encounter issues while working on it.
You are struggling using a built-in Python object ? 🐍</p>
<p>A deep exploration using Pythonic Wizard tricks like <code>yourobject.__dict__</code> asn&rsquo;t provided you any useful information ?</p>
<p>Fire up the buit-in documentation server associated to your Python version with :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>pydoc -p <span style="color:#ae81ff">0</span>
</span></span></code></pre></div><p>and type <code>b</code> to open automatically a browser page. 🌐</p>
<p>All the installed package will have their documentation provided here, with docstrings and examples.</p>
<p>For instance, here is the docstring of <code>pyspark</code> with the <code>agg</code> function :</p>
<p><img alt="pyspark_agg" loading="lazy" src="/posts/2025/01/going-offline-efficiently/pyspark_agg.png#center"></p>
<h2 id="browsing-offline-duckdb-documentation">Browsing offline Duckdb documentation</h2>
<p>Sometimes, with the right preparation a Data Engineer can safely work on a <code>dbt</code> projects
using <code>unit-tests</code> or <code>data-tests</code> using mock sources. 📊</p>
<p>For instance, I&rsquo;ve been working recently with <a href="https://github.com/carbonfact/lea">lea</a>, a lightweight alternative to
<code>dbt</code> which is plug and play with <code>duckdb</code>.</p>
<p>My staging models where used to retrieve some local data which was not too large to oversize my hard-drive.</p>
<p>I declare views as follow :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-txt" data-lang="txt"><span style="display:flex;"><span>.
</span></span><span style="display:flex;"><span>├── seeds
</span></span><span style="display:flex;"><span>│   ├── inventory.sql
</span></span><span style="display:flex;"><span>│   ├── raw_animals.csv
</span></span><span style="display:flex;"><span>│   └── raw_inventory.parquet
</span></span><span style="display:flex;"><span>├── views
</span></span><span style="display:flex;"><span>│   ├── analytics
</span></span><span style="display:flex;"><span>│   │   └── stats.sql
</span></span><span style="display:flex;"><span>│   ├── core
</span></span><span style="display:flex;"><span>│   │   └── wrangled_inventory.sql
</span></span><span style="display:flex;"><span>│   ├── gold
</span></span><span style="display:flex;"><span>│   │   ├── animals.sql
</span></span><span style="display:flex;"><span>│   │   └── inventory.sql
</span></span><span style="display:flex;"><span>│   └── staging
</span></span><span style="display:flex;"><span>│       ├── animals.py
</span></span><span style="display:flex;"><span>│       └── inventory.py
</span></span><span style="display:flex;"><span>├── wrangling.db
</span></span></code></pre></div><p>The <code>seeds/inventory.sql</code> model contains :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span><span style="color:#66d9ef">CALL</span> load_aws_credentials(<span style="color:#e6db74">&#39;my-profile&#39;</span>);
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">DROP</span> <span style="color:#66d9ef">TABLE</span> <span style="color:#66d9ef">IF</span> <span style="color:#66d9ef">EXISTS</span> inventory;
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">COPY</span> (
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">SELECT</span> <span style="color:#f92672">*</span> <span style="color:#66d9ef">FROM</span> <span style="color:#e6db74">&#39;s3://mylarge-bucket/inventory.parquet&#39;</span>
</span></span><span style="display:flex;"><span>) <span style="color:#66d9ef">TO</span> <span style="color:#e6db74">&#39;seeds/raw_inventory.parquet&#39;</span> (FORMAT PARQUET);
</span></span></code></pre></div><p>When I&rsquo;m still online, I make a :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>duckdb &lt; seeds/inventory.sql
</span></span></code></pre></div><p>It generates a dump of <em>raw_inventory.parquet</em> file ⚙️</p>
<p>And later on I can declare <code>staging</code> model, which contains :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-py" data-lang="py"><span style="display:flex;"><span><span style="color:#f92672">from</span> __future__ <span style="color:#f92672">import</span> annotations
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> pathlib
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> pandas <span style="color:#66d9ef">as</span> pd
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>here <span style="color:#f92672">=</span> pathlib<span style="color:#f92672">.</span>Path(__file__)<span style="color:#f92672">.</span>parent
</span></span><span style="display:flex;"><span>inventory <span style="color:#f92672">=</span> pd<span style="color:#f92672">.</span>read_parquet(here<span style="color:#f92672">.</span>parent<span style="color:#f92672">.</span>parent <span style="color:#f92672">/</span> <span style="color:#e6db74">&#34;seeds&#34;</span> <span style="color:#f92672">/</span> <span style="color:#e6db74">&#34;raw_inventory.parquet&#34;</span>)
</span></span></code></pre></div><p>You need to review duckdb docs on an edge case for a function ?</p>
<p>Before your offline trip, download the latest <!-- raw HTML omitted --> documentation <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> <!-- raw HTML omitted -->
as a zip file at <a href="https://duckdb.org/duckdb-docs.zip">https://duckdb.org/duckdb-docs.zip</a>.</p>
<p>Download it, unzip it :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>mkdir ~/duckdb-offline <span style="color:#f92672">&amp;&amp;</span> mv ~/Downloads/duckdb-docs.zip ~/duckdb-offline
</span></span><span style="display:flex;"><span>cd duckdb-offline <span style="color:#f92672">&amp;&amp;</span> unzip duckdb-docs.zip
</span></span></code></pre></div><p>Once it&rsquo;s unzipped, load the python built-in webserver, it will be available everywhere
offline, even with the search bar, very neat ! 🦆</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>python -m http.server
</span></span></code></pre></div><p><img alt="duckdb-docs" loading="lazy" src="/posts/2025/01/going-offline-efficiently/duckdb_docs.png#center"></p>
<h2 id="facing-a-hard-issue-on-this-one--ask-for-local-llm-help-">Facing a hard issue on this one ? Ask for local LLM help !</h2>
<p>Using managed services like Github Copilot is super handy, but it might be
costly (~100$/yr) and not suitable when developing in limited bandwith
environments. 🐌</p>
<p>To overcome theses challenges, if your are a happy owner of Apple M2 or M3 chip,
you will have enough computing power to run a local LLM, within 1 to 3B weights.</p>
<p>Hopefully, the ARM architecture of the chip will save us also from completely wipe the
battery out of power. 🔋</p>
<p>To do so, before your first offline session, head to <a href="https://tabby.tabbyml.com/docs/quick-start/installation/apple/">tabby documentation</a>,
a framework that makes available local LLM to LSP servers.</p>
<p>Install tabby and launch it with :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>tabby serve --port <span style="color:#ae81ff">8889</span> --device metal --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct
</span></span></code></pre></div><p>Use the plugin on the <!-- raw HTML omitted --> IDE <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> <!-- raw HTML omitted --> of your choice, for instance <code>vim-tabby</code> on neovim,
with a few setup to do not conflict with Copilot setup :</p>
<p>Lazy.nvim configuration :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-lua" data-lang="lua"><span style="display:flex;"><span><span style="color:#66d9ef">return</span> {
</span></span><span style="display:flex;"><span>	<span style="color:#e6db74">&#34;TabbyML/vim-tabby&#34;</span>,
</span></span><span style="display:flex;"><span>	lazy <span style="color:#f92672">=</span> <span style="color:#66d9ef">false</span>,
</span></span><span style="display:flex;"><span>	dependencies <span style="color:#f92672">=</span> {
</span></span><span style="display:flex;"><span>		<span style="color:#e6db74">&#34;neovim/nvim-lspconfig&#34;</span>,
</span></span><span style="display:flex;"><span>	},
</span></span><span style="display:flex;"><span>	init <span style="color:#f92672">=</span> <span style="color:#66d9ef">function</span>()
</span></span><span style="display:flex;"><span>		vim.g.tabby_agent_start_command <span style="color:#f92672">=</span> { <span style="color:#e6db74">&#34;npx&#34;</span>, <span style="color:#e6db74">&#34;tabby-agent&#34;</span>, <span style="color:#e6db74">&#34;--stdio&#34;</span> }
</span></span><span style="display:flex;"><span>		vim.g.tabby_inline_completion_trigger <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;manual&#34;</span>
</span></span><span style="display:flex;"><span>		vim.g.tabby_inline_completion_keybinding_accept <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;&lt;leader&gt;%&#34;</span>
</span></span><span style="display:flex;"><span>	<span style="color:#66d9ef">end</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>and use leader + % to accept the tabby autocomplete from LLM.</p>
<p>If your are more familiar with a chat interface, head to <a href="http://0.0.0.0:8889/">http://0.0.0.0:8889/</a> to
use the chat web app ! 💬</p>
<p>Here is a result defining fibonacci function :</p>
<p><img alt="tabby_suggestions" loading="lazy" src="/posts/2025/01/going-offline-efficiently/tabby.png#center"></p>
<p>Not so bad !</p>
<h2 id="conclusion">Conclusion</h2>
<p>Going offline is still an occasion of executing deepwork, even if the network connection is not so great.
It doesn&rsquo;t mean that you have to fail on each issue and try to debug it without documentation.</p>
<p>It&rsquo;s a great way of standing on <!-- raw HTML omitted --> <em>the shoulders of giants</em> <sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup><!-- raw HTML omitted -->, tons of developer tried to provide the most
efficient documentation, RTFM. And nowadays, having a copilot is a great you to have straight to the point
code suggestions, do not underestimate it.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Duckdb team provide several ways to browse offline the documentation, head to <a href="https://duckdb.org/docs/guides/offline-copy.html">offline doc page</a> for more.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Tabby team developped also plugins for Visual Studio Code, IntelliJ. Have a look at <a href="https://tabby.tabbyml.com/docs/extensions/troubleshooting/">documentation</a>.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>Knowledge is cumulative, [wikipedia]( <a href="https://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants">https://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants</a> %C3%A9ants) pour&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>Saving Money, End to End DataEng dashboard showcase</title>
      <link>https://emilien-foissotte.github.io/posts/2024/05/streamlit-gas-stations/</link>
      <pubDate>Sun, 09 Jun 2024 18:53:22 +0200</pubDate>
      <guid>https://emilien-foissotte.github.io/posts/2024/05/streamlit-gas-stations/</guid>
      <description>&lt;h1 id=&#34;tldr&#34;&gt;TL;DR&lt;/h1&gt;
&lt;p&gt;This post will deep dive in the buidling of an end to end
data engineering project ⚙️ .&lt;/p&gt;
&lt;p&gt;The idea will be to retrieve a price list of gas stations in France ⛽,
create a job to extract it every day 📅 and craft a dashboard to expose those price to
logged user 📊&lt;/p&gt;
&lt;p&gt;After reading this blog post, you&amp;rsquo;ll have fundamentals on how to build data
dashboard and scrap your own data sources 🚀&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h1 id="tldr">TL;DR</h1>
<p>This post will deep dive in the buidling of an end to end
data engineering project ⚙️ .</p>
<p>The idea will be to retrieve a price list of gas stations in France ⛽,
create a job to extract it every day 📅 and craft a dashboard to expose those price to
logged user 📊</p>
<p>After reading this blog post, you&rsquo;ll have fundamentals on how to build data
dashboard and scrap your own data sources 🚀</p>
<p>Just a reader not interested into the technicals details ? Have a look to the dashboard,
you&rsquo;ll save on the gas bill 🤑 And reinvest the remainings into ecological transition towards
<a href="https://green-got.com/">a carbon-free world</a>(get 1 month for free with discount voucher <code>emilien-foissotte</code>) 😇</p>
<p>Let&rsquo;s go !</p>
<h2 id="intro">Intro</h2>
<p>I do not take my car often, but when I do, I always have a dilemna when it comes to fill it at the gas station.. 🤨</p>
<p>In France 🇫🇷, we have public APIs exposing price of gas stations each day. However the website is very clunky and there is no
way to store your favorite gas stations. 😭</p>
<p>So each time I had to fullfill my gas tank, I had to grab price of surrounding stations, on a mobile UI unfriendly website.
Not so efficient.. 😓</p>
<p>A few years ago, as I was getting hands on Docker, my Raspberry Pi and Flask, I had the idea to expose a minimal web
page with my own stations. The backend was efficient, but in no way evolutive. 💀</p>
<p>My friends and relatives had no hability to enjoy the dashboard as there were no ability to add new stations, or
add users. I was that close to tell them to open a ticket on the project board, just a casual job habit 😇. I wasn&rsquo;t lacking
of motivation or time to do it, but the codebase was way too monolithic to make a few evolution possible, at all.. 🫠</p>
<p>Hence, my new idea was pretty clear : create a web exposed dashboard, using solo Free Tier so that anyone could create an account,
pick his own station and make savings on his gas bill ! ⛽</p>
<p><em>PS : BTW, the cheapest energy will always be the one you will not consume. Take your bike or your legs when you can,
that&rsquo;s better for your health, your wallet, your mind and for the planet!</em> 🌱</p>
<figure>
    <img loading="lazy" src="frontpage.png"
         alt="Landing page of the developed dashboard"/> <figcaption>
            <p>Landing page of the developed dashboard</p>
        </figcaption>
</figure>

<p>Live version availabe here <a href="https://carburoam.streamlit.app/">https://carburoam.streamlit.app/</a> ! 🚀</p>
<h2 id="extracting-price-data">Extracting price data</h2>
<p>First, as on every data engineering, the cornerstone of the project will be the availability of the data.</p>
<p>To scrap and retrieve all gas stations price, we will use an Open Data platform which makes available this
data, every day.</p>
<p>The format is pretty simple, it&rsquo;s not an API but a zipped dump file, containing XML data about all the stations in France, updated
with their prices. On a daily basis, the platform updates it, and the quality is rather good !</p>
<p><img alt="francais" loading="lazy" src="https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExamRud3FicTkxdnloZHRkMWt1MXo4bms5bXcwcmo1NDQyeHc2aXZ3ZyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/KcFV1Hm2vFMAolcHbu/giphy-downsized.gif#center"></p>
<p>Here is an extract of the file to demonstrate the format :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-xml" data-lang="xml"><span style="display:flex;"><span><span style="color:#75715e">&lt;?xml version=&#34;1.0&#34; encoding=&#34;ISO-8859-1&#34; standalone=&#34;yes&#34;?&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&lt;pdv_liste&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&lt;pdv</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#34;91190012&#34;</span> <span style="color:#a6e22e">latitude=</span><span style="color:#e6db74">&#34;4870300&#34;</span> <span style="color:#a6e22e">longitude=</span><span style="color:#e6db74">&#34;212800&#34;</span> <span style="color:#a6e22e">cp=</span><span style="color:#e6db74">&#34;91190&#34;</span> <span style="color:#a6e22e">pop=</span><span style="color:#e6db74">&#34;R&#34;</span><span style="color:#f92672">&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;adresse&gt;</span>27 AV DU GENERAL LECLERC<span style="color:#f92672">&lt;/adresse&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;ville&gt;</span>Gif-sur-Yvette<span style="color:#f92672">&lt;/ville&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;horaires</span> <span style="color:#a6e22e">automate-24-24=</span><span style="color:#e6db74">&#34;1&#34;</span><span style="color:#f92672">&gt;</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&lt;jour</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#34;1&#34;</span> <span style="color:#a6e22e">nom=</span><span style="color:#e6db74">&#34;Lundi&#34;</span> <span style="color:#a6e22e">ferme=</span><span style="color:#e6db74">&#34;1&#34;</span><span style="color:#f92672">/&gt;</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&lt;jour</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#34;2&#34;</span> <span style="color:#a6e22e">nom=</span><span style="color:#e6db74">&#34;Mardi&#34;</span> <span style="color:#a6e22e">ferme=</span><span style="color:#e6db74">&#34;1&#34;</span><span style="color:#f92672">/&gt;</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&lt;jour</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#34;3&#34;</span> <span style="color:#a6e22e">nom=</span><span style="color:#e6db74">&#34;Mercredi&#34;</span> <span style="color:#a6e22e">ferme=</span><span style="color:#e6db74">&#34;1&#34;</span><span style="color:#f92672">/&gt;</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&lt;jour</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#34;4&#34;</span> <span style="color:#a6e22e">nom=</span><span style="color:#e6db74">&#34;Jeudi&#34;</span> <span style="color:#a6e22e">ferme=</span><span style="color:#e6db74">&#34;1&#34;</span><span style="color:#f92672">/&gt;</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&lt;jour</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#34;5&#34;</span> <span style="color:#a6e22e">nom=</span><span style="color:#e6db74">&#34;Vendredi&#34;</span> <span style="color:#a6e22e">ferme=</span><span style="color:#e6db74">&#34;1&#34;</span><span style="color:#f92672">/&gt;</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&lt;jour</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#34;6&#34;</span> <span style="color:#a6e22e">nom=</span><span style="color:#e6db74">&#34;Samedi&#34;</span> <span style="color:#a6e22e">ferme=</span><span style="color:#e6db74">&#34;1&#34;</span><span style="color:#f92672">/&gt;</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&lt;jour</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#34;7&#34;</span> <span style="color:#a6e22e">nom=</span><span style="color:#e6db74">&#34;Dimanche&#34;</span> <span style="color:#a6e22e">ferme=</span><span style="color:#e6db74">&#34;1&#34;</span><span style="color:#f92672">/&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;/horaires&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;services&gt;</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&lt;service&gt;</span>Station de gonflage<span style="color:#f92672">&lt;/service&gt;</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&lt;service&gt;</span>Carburant additiv<span style="color:#f92672">&lt;E9&gt;&lt;/service&gt;</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&lt;service&gt;</span>Automate CB 24/24<span style="color:#f92672">&lt;/service&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;/services&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;prix</span> <span style="color:#a6e22e">nom=</span><span style="color:#e6db74">&#34;Gazole&#34;</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#34;1&#34;</span> <span style="color:#a6e22e">maj=</span><span style="color:#e6db74">&#34;2024-05-08 12:30:00&#34;</span> <span style="color:#a6e22e">valeur=</span><span style="color:#e6db74">&#34;1.739&#34;</span><span style="color:#f92672">/&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;prix</span> <span style="color:#a6e22e">nom=</span><span style="color:#e6db74">&#34;SP98&#34;</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#34;6&#34;</span> <span style="color:#a6e22e">maj=</span><span style="color:#e6db74">&#34;2024-05-11 13:15:00&#34;</span> <span style="color:#a6e22e">valeur=</span><span style="color:#e6db74">&#34;1.999&#34;</span><span style="color:#f92672">/&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;rupture</span> <span style="color:#a6e22e">nom=</span><span style="color:#e6db74">&#34;E10&#34;</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#34;5&#34;</span> <span style="color:#a6e22e">debut=</span><span style="color:#e6db74">&#34;2024-05-10 16:16:02&#34;</span> <span style="color:#a6e22e">fin=</span><span style="color:#e6db74">&#34;&#34;</span> <span style="color:#a6e22e">type=</span><span style="color:#e6db74">&#34;temporaire&#34;</span><span style="color:#f92672">/&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;/pdv&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&lt;/pdv_liste&gt;</span>
</span></span></code></pre></div><p>First of all, a good news is that our gas stations, identified by an XML element <code>pdv</code> (which stand for &lsquo;point de vente&rsquo; in French,
a sales point) are given an unique Id. All the objects are well defined, nonetheless, a savage evolution of the format could happen.</p>
<p>That&rsquo;s the downside of using a file API, some Rest protocols ensure that the routes will not drastically evolve over time, without a
major version increment. Here the only we will catch this evolution, will be when the ETL will broke..</p>
<p>Anyway, there is still some good news on the record. We can see that latitude and longitudes are provided on our stations, which will make
it perfect for a nice display on a map. I always have found, from a user perspective, that maps are better to pick some locations than
streets nor towns names.</p>
<p>Additionally, a list of element <code>prix</code> are providing somes prices and date of last update, for the related gas station.
Some data will not be leveraged, as will not be useful for the exposed dashboard, for instance opening hours, services provided by
the station, etc..</p>
<p>Let build up some entities based on the information we can gather in this flat file, and also some information about users.</p>
<h3 id="managing-users">Managing users</h3>
<p>To manage user, password and accounts, we will use a simple user table, using solely mail, name and username. All the details of encrypted
passwords and JWT are managed by an external Streamlit library : <a href="https://github.com/mkhorasani/Streamlit-Authenticator">Streamlit-Authenticator</a>.</p>
<p>In order to mirror each users loaded by this library, this table will be populated by records in the library, but not any credentials.
Let&rsquo;s apply the <em>least priviledge</em> principle, there is absolutely no need for a hashed password storing in the DB here, so let&rsquo;s lighthen
the data schema on this side.</p>
<p>Unfortunately, in the way I was thinking using it, the initial library had a major security flaw. In fact, if you would like to reset a
password for a user, anyone could do it. So anyone could reset another user password, without applying some confirmation mecanism.</p>
<p><img alt="password" loading="lazy" src="https://media3.giphy.com/media/v1.Y2lkPTc5MGI3NjExNG1lc2Q0ZjY5azV5aHQzcm9vZWpxZzFsdWdrcHRnMDZiN3dieXh1aCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/l0G17mcoGBEabVgn6/giphy.gif#center"></p>
<p>To ensure that the user which is triggering the password reset operation, we will send a confirmation code to the user by email,
which he will have to enter to proceed into password reset finalization.</p>
<p>To store and enable this application logic, a table containing verification codes that would be sent to reset password has to be
created. Password will never be stored, they will only be transmitted on the fly to the users by mail, upon reset.</p>
<h3 id="managing-stations-and-prices">Managing Stations and Prices</h3>
<p>To let users create somes customized stations, a table <code>Custom_stations</code> will be derived from table <code>Stations</code>.
Each instance of a <code>Price</code> item will be associated to a Station. And a price will be associated also to a type of
gas, i.e. is it diesel, unleaded, ethanol derived fuel..</p>
<p>In order to track which gas type the user would like to be subscribed, an association table will be declared to link a
user to a gas type. This table creates a bounded link at the ORM <code>SQLAlchemy</code> level between a user and a type of fuel, from
table <code>gas_types</code>.</p>
<p>Here is bellow a diagram of all the entities and the association (1:1 or Many to One are not represented,
but PK and FK are, and should be enought to read it).</p>
<p><img alt="EA" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/EA_GasStations.png#center"></p>
<h3 id="mirroring-theses-entities-using-an-orm">Mirroring theses entities using an ORM</h3>
<p>In order to conveniently use all theses tables, we will declares <code>SQLAlchemy</code> classes and link them using the version <code>2</code> declarative
implementation.</p>
<p>Here is the declaration of the previously mentionned classes, stored into a <code>models.py</code> module of our application.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> typing <span style="color:#f92672">import</span> List
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> sqlalchemy <span style="color:#66d9ef">as</span> sa
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sqlalchemy.orm <span style="color:#f92672">import</span> DeclarativeBase, Mapped, mapped_column, relationship
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">Base</span>(DeclarativeBase):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">pass</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>association_table <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Table(
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;association_table&#34;</span>,
</span></span><span style="display:flex;"><span>    Base<span style="color:#f92672">.</span>metadata,
</span></span><span style="display:flex;"><span>    sa<span style="color:#f92672">.</span>Column(<span style="color:#e6db74">&#34;gastype_id&#34;</span>, sa<span style="color:#f92672">.</span>ForeignKey(<span style="color:#e6db74">&#34;gas_types.id&#34;</span>), primary_key<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>),
</span></span><span style="display:flex;"><span>    sa<span style="color:#f92672">.</span>Column(<span style="color:#e6db74">&#34;user_id&#34;</span>, sa<span style="color:#f92672">.</span>ForeignKey(<span style="color:#e6db74">&#34;users.id&#34;</span>), primary_key<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>),
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">GasType</span>(Base):
</span></span><span style="display:flex;"><span>    __tablename__ <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;gas_types&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># id = sa.Column(sa.Integer, primary_key=True)</span>
</span></span><span style="display:flex;"><span>    id: Mapped[int] <span style="color:#f92672">=</span> mapped_column(primary_key<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    xml_id <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>String, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    name <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>String, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    users: Mapped[List[<span style="color:#e6db74">&#34;User&#34;</span>]] <span style="color:#f92672">=</span> relationship(
</span></span><span style="display:flex;"><span>        secondary<span style="color:#f92672">=</span>association_table, back_populates<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gastypes&#34;</span>
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> __repr__(self):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;&lt;GasType </span><span style="color:#e6db74">{</span>self<span style="color:#f92672">.</span>name<span style="color:#e6db74">}</span><span style="color:#e6db74">&gt;&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">User</span>(Base):
</span></span><span style="display:flex;"><span>    __tablename__ <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;users&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># id = sa.Column(sa.Integer, primary_key=True)</span>
</span></span><span style="display:flex;"><span>    id: Mapped[int] <span style="color:#f92672">=</span> mapped_column(primary_key<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    email <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>String, unique<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    username <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>String, unique<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    name <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>String, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># add a reference to the stations</span>
</span></span><span style="display:flex;"><span>    stations <span style="color:#f92672">=</span> relationship(<span style="color:#e6db74">&#34;CustomStation&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># add a reference to the gas types followed</span>
</span></span><span style="display:flex;"><span>    gastypes: Mapped[List[<span style="color:#e6db74">&#34;GasType&#34;</span>]] <span style="color:#f92672">=</span> relationship(
</span></span><span style="display:flex;"><span>        secondary<span style="color:#f92672">=</span>association_table, back_populates<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;users&#34;</span>
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> __repr__(self):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;&lt;User </span><span style="color:#e6db74">{</span>self<span style="color:#f92672">.</span>username<span style="color:#e6db74">}</span><span style="color:#e6db74">&gt;&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">to_dict</span>(self):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;id&#34;</span>: self<span style="color:#f92672">.</span>id,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;email&#34;</span>: self<span style="color:#f92672">.</span>email,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;username&#34;</span>: self<span style="color:#f92672">.</span>username,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;name&#34;</span>: self<span style="color:#f92672">.</span>name,
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">to_csv</span>(self):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;</span><span style="color:#e6db74">{</span>self<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74">,</span><span style="color:#e6db74">{</span>self<span style="color:#f92672">.</span>email<span style="color:#e6db74">}</span><span style="color:#e6db74">,</span><span style="color:#e6db74">{</span>self<span style="color:#f92672">.</span>username<span style="color:#e6db74">}</span><span style="color:#e6db74">,</span><span style="color:#e6db74">{</span>self<span style="color:#f92672">.</span>name<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">VerificationCode</span>(Base):
</span></span><span style="display:flex;"><span>    __tablename__ <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;verification_codes&#34;</span>
</span></span><span style="display:flex;"><span>    id <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>Integer, primary_key<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    user_id <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>Integer, sa<span style="color:#f92672">.</span>ForeignKey(<span style="color:#e6db74">&#34;users.id&#34;</span>), nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    code <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>String, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    created_at <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>DateTime, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> __repr__(self):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;&lt;VerificationCode </span><span style="color:#e6db74">{</span>self<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74">&gt;&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">Station</span>(Base):
</span></span><span style="display:flex;"><span>    __tablename__ <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;stations&#34;</span>
</span></span><span style="display:flex;"><span>    id <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>Integer, primary_key<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    latitude <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>Float, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    longitude <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>Float, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    town <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>String, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    address <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>String, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    zip_code <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>String, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    sa<span style="color:#f92672">.</span>Index(<span style="color:#e6db74">&#34;latitude_longitude_index&#34;</span>, latitude, longitude, unique<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> __repr__(self):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;&lt;Station </span><span style="color:#e6db74">{</span>self<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74">&gt;&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">to_dict</span>(self):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;id&#34;</span>: self<span style="color:#f92672">.</span>id,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;latitude&#34;</span>: self<span style="color:#f92672">.</span>latitude,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;longitude&#34;</span>: self<span style="color:#f92672">.</span>longitude,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;town&#34;</span>: self<span style="color:#f92672">.</span>town,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;address&#34;</span>: self<span style="color:#f92672">.</span>address,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;zip_code&#34;</span>: self<span style="color:#f92672">.</span>zip_code,
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">CustomStation</span>(Base):
</span></span><span style="display:flex;"><span>    __tablename__ <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;custom_stations&#34;</span>
</span></span><span style="display:flex;"><span>    id <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>Integer, sa<span style="color:#f92672">.</span>ForeignKey(<span style="color:#e6db74">&#34;stations.id&#34;</span>), primary_key<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    user_id <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>Integer, sa<span style="color:#f92672">.</span>ForeignKey(<span style="color:#e6db74">&#34;users.id&#34;</span>), nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    custom_name <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>String, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> __repr__(self):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;&lt;CustomStation </span><span style="color:#e6db74">{</span>self<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74">-</span><span style="color:#e6db74">{</span>self<span style="color:#f92672">.</span>user_id<span style="color:#e6db74">}</span><span style="color:#e6db74">&gt;&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">to_dict</span>(self):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;id&#34;</span>: self<span style="color:#f92672">.</span>id,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;user_id&#34;</span>: self<span style="color:#f92672">.</span>user_id,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;custom_name&#34;</span>: self<span style="color:#f92672">.</span>custom_name,
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">Price</span>(Base):
</span></span><span style="display:flex;"><span>    __tablename__ <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;prices&#34;</span>
</span></span><span style="display:flex;"><span>    gastype_id <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>Integer, sa<span style="color:#f92672">.</span>ForeignKey(<span style="color:#e6db74">&#34;gas_types.id&#34;</span>), primary_key<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    station_id <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>Integer, sa<span style="color:#f92672">.</span>ForeignKey(<span style="color:#e6db74">&#34;stations.id&#34;</span>), primary_key<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    updated_at <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>DateTime, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>    price <span style="color:#f92672">=</span> sa<span style="color:#f92672">.</span>Column(sa<span style="color:#f92672">.</span>Float, nullable<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span></code></pre></div><p>Fair enough !</p>
<p>How to bind this with our streamlit app ?</p>
<p>Nothing more complicated than instanciating a <code>Session</code> type object !</p>
<p>Let&rsquo;s deep dive a little bit into the code, under <code>session.py</code> module of the app :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> logging
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> functools <span style="color:#f92672">import</span> lru_cache
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> typing <span style="color:#f92672">import</span> Generator
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> sqlalchemy
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> streamlit <span style="color:#66d9ef">as</span> st
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sqlalchemy <span style="color:#f92672">import</span> create_engine
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sqlalchemy.orm <span style="color:#f92672">import</span> scoped_session, sessionmaker
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sqlalchemy_utils <span style="color:#f92672">import</span> create_database, database_exists
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> models <span style="color:#f92672">import</span> Base, GasType
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>logger <span style="color:#f92672">=</span> logging<span style="color:#f92672">.</span>getLogger(<span style="color:#e6db74">&#34;gas_station_app&#34;</span>)
</span></span><span style="display:flex;"><span>engine <span style="color:#f92672">=</span> create_engine(<span style="color:#e6db74">&#34;sqlite:///db.sqlite3&#34;</span>, pool_pre_ping<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@lru_cache</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">create_session</span>() <span style="color:#f92672">-&gt;</span> scoped_session:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Create a session given the url in settings.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    Session <span style="color:#f92672">=</span> scoped_session(
</span></span><span style="display:flex;"><span>        sessionmaker(autocommit<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>, autoflush<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>, bind<span style="color:#f92672">=</span>engine)
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> Session
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_session</span>() <span style="color:#f92672">-&gt;</span> Generator[scoped_session, <span style="color:#66d9ef">None</span>, <span style="color:#66d9ef">None</span>]:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Retrieve a session.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    Session <span style="color:#f92672">=</span> create_session()
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">yield</span> Session
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">finally</span>:
</span></span><span style="display:flex;"><span>        Session<span style="color:#f92672">.</span>remove()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>database_creation <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>
</span></span><span style="display:flex;"><span>db_session <span style="color:#f92672">=</span> create_session()
</span></span><span style="display:flex;"><span>logger<span style="color:#f92672">.</span>info(<span style="color:#e6db74">&#34;session created&#34;</span>)
</span></span><span style="display:flex;"><span>created_engine <span style="color:#f92672">=</span> db_session<span style="color:#f92672">.</span>bind
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> database_exists(created_engine<span style="color:#f92672">.</span>url):
</span></span><span style="display:flex;"><span>    logger<span style="color:#f92672">.</span>info(<span style="color:#e6db74">&#34;Database does not exist, creating it&#34;</span>)
</span></span><span style="display:flex;"><span>    create_database(created_engine<span style="color:#f92672">.</span>url)
</span></span><span style="display:flex;"><span>    database_creation <span style="color:#f92672">=</span> <span style="color:#66d9ef">True</span>
</span></span><span style="display:flex;"><span>Base<span style="color:#f92672">.</span>metadata<span style="color:#f92672">.</span>bind <span style="color:#f92672">=</span> engine
</span></span><span style="display:flex;"><span>Base<span style="color:#f92672">.</span>metadata<span style="color:#f92672">.</span>create_all(bind<span style="color:#f92672">=</span>created_engine)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">### initialize the database with mandatory data</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">create_gastypes</span>(db_session):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Create the gas types in the database.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Args:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">        db_session: sqlalchemy session
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Returns:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">        None
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    logger<span style="color:#f92672">.</span>info(<span style="color:#e6db74">&#34;Creating gas types&#34;</span>)
</span></span><span style="display:flex;"><span>    gas_dict <span style="color:#f92672">=</span> {<span style="color:#e6db74">&#34;Gazole&#34;</span>: <span style="color:#ae81ff">1</span>, <span style="color:#e6db74">&#34;SP95&#34;</span>: <span style="color:#ae81ff">2</span>, <span style="color:#e6db74">&#34;SP98&#34;</span>: <span style="color:#ae81ff">6</span>, <span style="color:#e6db74">&#34;E85&#34;</span>: <span style="color:#ae81ff">3</span>, <span style="color:#e6db74">&#34;GPLc&#34;</span>: <span style="color:#ae81ff">4</span>, <span style="color:#e6db74">&#34;E10&#34;</span>: <span style="color:#ae81ff">5</span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> name, xml_id <span style="color:#f92672">in</span> gas_dict<span style="color:#f92672">.</span>items():
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> db_session<span style="color:#f92672">.</span>query(GasType)<span style="color:#f92672">.</span>filter(GasType<span style="color:#f92672">.</span>name <span style="color:#f92672">==</span> name)<span style="color:#f92672">.</span>first():
</span></span><span style="display:flex;"><span>            gas_type <span style="color:#f92672">=</span> GasType(name<span style="color:#f92672">=</span>name, xml_id<span style="color:#f92672">=</span>xml_id)
</span></span><span style="display:flex;"><span>            db_session<span style="color:#f92672">.</span>add(gas_type)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>        db_session<span style="color:#f92672">.</span>commit()
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">except</span> sqlalchemy<span style="color:#f92672">.</span>exc<span style="color:#f92672">.</span>IntegrityError:
</span></span><span style="display:flex;"><span>        db_session<span style="color:#f92672">.</span>rollback()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> database_creation:
</span></span><span style="display:flex;"><span>    create_gastypes(db_session<span style="color:#f92672">=</span>db_session)
</span></span></code></pre></div><p>What&rsquo;s happening here is very simple, at the loading of the app, we do instanciante our app and session (as it is imported by
<code>home.py</code> module, the main of the application).</p>
<p>A session instanciation is done :
<code>engine = create_engine(&quot;sqlite:///db.sqlite3&quot;, pool_pre_ping=True)</code> will create the sqlalchemy engine.</p>
<p>As we will be using a LRU cache, less frequent call will be made to the sqlalchemy engine.
Python will use same output from function <code>get_session</code> more often, until the cache expires.</p>
<p>What is this weird function <code>create_gastypes</code> ?</p>
<p>If sqlalchemy detects that the Sqlite database is
empty, it will trigger creation of the empty tables.
But to properly work, our gas_type table has to be fed up with data from the specifications of the <a href="https://www.prix-carburants.gouv.fr/rubrique/opendata/">Open Data API</a>.</p>
<p><img alt="specs" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/specs.png#center"></p>
<p>No way to create a smart logic here, so let&rsquo;s hardcode them, and hope that they will not evolve over time..</p>
<p>That&rsquo;s all ! Every other module can call for a <code>db_session</code> from this module, and it&rsquo;ll do the trick 🚀</p>
<p>Now all our data warehouse is ready to be filled up with data, let&rsquo;s review the ETL process.</p>
<p><em>NB: I wont cover Streamlit-Authenticator related elements as they are well described in the GH documentation of the package,
feel free to have a look to it, very convenient.</em></p>
<h2 id="daily-retrieval-of-data">Daily retrieval of data</h2>
<p>Let&rsquo;s sum up what do we have for now :</p>
<ul>
<li>A streamlit free workspace, which can retrieve data from a local sqlite file</li>
<li>A flat file with price data</li>
<li>An UI at <a href="https://carburoam.streamlit.app/">carburoam.streamlit.app</a> which will display solely the streamlit</li>
</ul>
<p>Where is the ETL out here ?</p>
<p>Indeed, we miss a crucial part of a data engineering project : an orchestration tool. If I could have
an airflow instance somewhere, I would definetely go for instanciating a simple DAG in here. But we do not
have such element. So we will make something much simple.</p>
<h3 id="pure-python-job-orchestrator-implementation">Pure python job orchestrator implementation</h3>
<p>We will only leverage the main Python process of the streamlit app, and create a subprocess to run all the
mecanism of update.
It will only contains a Thread with a timer, which will trigger a task to update the prices.</p>
<p><img alt="update" loading="lazy" src="https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExY3R5dXFuMjI1MjI5ZXlrb3phazQydmg0cDFleGN3NWpucHJhM3dnbyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/j3WJjjm1OKV73l6E6e/giphy.gif#center"></p>
<p>It&rsquo;s kind of &ldquo;Hello World&rdquo; of a CRON job, let&rsquo;s review step by step how it&rsquo;s achieved :
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
<head>
  <title></title>
</head>

<body>
  

  <div class="mermaid" align="center">
    
flowchart TD
U[User] -->|Load Landing Page| L{Streamlit app}
L -->|pid.txt file </br> exists| PE[Do not trigger subprocess]
L -->|pid.txt file doesn't </br> exists| PN[Read last job execution]
PN -->|lastjob.txt file </br> exists| LE[Check current date]
LE -->|delta between last execution </br> less than threshold| LT[Do not trigger subprocess]
LE -->|delta between last execution </br> more than threshold| MT[Trigger subprocess </br> and delete logs older than 1 day]
PN -->|lastjob.txt file doesn't </br> exists| LN[Trigger subprocess </br> and delete logs older than 1 day]

  </div>
</body>
</html>
</p>
<p>Is that all ? Pretty much, yes. Using the Database could make things a little
bit more complex, so we will only check it the subprocess has created a file <code>pid.txt</code>,
containing it&rsquo;s PID and another file, <code>lastjob.txt</code>, containing last execution job.</p>
<p>This way, it will not knock the database during development, when we have to redeploy often the app to test
stuff. And if an ETL trouble occurs, we can kill the previous subprocess given it&rsquo;s PID and start a new one, by removing
<code>pid.txt</code>.</p>
<p>Additionally, To help a little bit during debugging, the stdout and stderr of the script
will be routed to a text file, under a folder <code>outputs</code>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> logging
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> os
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> subprocess
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> sys
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> uuid
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> datetime <span style="color:#f92672">import</span> datetime
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> pathlib <span style="color:#f92672">import</span> Path
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> streamlit <span style="color:#66d9ef">as</span> st
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> utils <span style="color:#f92672">import</span> WAIT_TIME_SECONDS
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>logger <span style="color:#f92672">=</span> logging<span style="color:#f92672">.</span>getLogger(<span style="color:#e6db74">&#34;gas_station_app&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">trigger_etl</span>():
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Trigger the ETL process in a subprocess.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># create a new uuid for process opening</span>
</span></span><span style="display:flex;"><span>    str_uuid <span style="color:#f92672">=</span> str(uuid<span style="color:#f92672">.</span>uuid4())
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;outputs/stdout_</span><span style="color:#e6db74">{</span>str_uuid<span style="color:#e6db74">}</span><span style="color:#e6db74">.txt&#34;</span>, <span style="color:#e6db74">&#34;wb&#34;</span>) <span style="color:#66d9ef">as</span> out, open(
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;outputs/stderr_</span><span style="color:#e6db74">{</span>str_uuid<span style="color:#e6db74">}</span><span style="color:#e6db74">.txt&#34;</span>, <span style="color:#e6db74">&#34;wb&#34;</span>
</span></span><span style="display:flex;"><span>    ) <span style="color:#66d9ef">as</span> err:
</span></span><span style="display:flex;"><span>        subprocess<span style="color:#f92672">.</span>Popen([<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;</span><span style="color:#e6db74">{</span>sys<span style="color:#f92672">.</span>executable<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>, <span style="color:#e6db74">&#34;utils.py&#34;</span>], stdout<span style="color:#f92672">=</span>out, stderr<span style="color:#f92672">=</span>err)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> os<span style="color:#f92672">.</span>path<span style="color:#f92672">.</span>exists(<span style="color:#e6db74">&#34;pid.txt&#34;</span>):
</span></span><span style="display:flex;"><span>        logger<span style="color:#f92672">.</span>info(<span style="color:#e6db74">&#34;No pid file found, creating one&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># if it doesn&#39;t exist, trigger the subprocess job</span>
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># delete and remove output files under outputs</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">for</span> file <span style="color:#f92672">in</span> Path(<span style="color:#e6db74">&#34;outputs&#34;</span>)<span style="color:#f92672">.</span>glob(<span style="color:#e6db74">&#34;*.txt&#34;</span>):
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># get last modified date</span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>                last_modified <span style="color:#f92672">=</span> datetime<span style="color:#f92672">.</span>fromtimestamp(file<span style="color:#f92672">.</span>stat()<span style="color:#f92672">.</span>st_mtime)
</span></span><span style="display:flex;"><span>                <span style="color:#75715e"># if the file is older than 1 day, remove it</span>
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">if</span> (datetime<span style="color:#f92672">.</span>now() <span style="color:#f92672">-</span> last_modified)<span style="color:#f92672">.</span>days <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">1</span>:
</span></span><span style="display:flex;"><span>                    logger<span style="color:#f92672">.</span>info(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Removing </span><span style="color:#e6db74">{</span>file<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>                    file<span style="color:#f92672">.</span>unlink()
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">FileNotFoundError</span>:
</span></span><span style="display:flex;"><span>                <span style="color:#75715e"># it means another process has deleted the file</span>
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">pass</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> os<span style="color:#f92672">.</span>path<span style="color:#f92672">.</span>exists(<span style="color:#e6db74">&#34;lastjob.txt&#34;</span>):
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># check the last job date, do not start subprocess if recent</span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">&#34;lastjob.txt&#34;</span>, <span style="color:#e6db74">&#34;r&#34;</span>) <span style="color:#66d9ef">as</span> file:
</span></span><span style="display:flex;"><span>                date <span style="color:#f92672">=</span> file<span style="color:#f92672">.</span>read()
</span></span><span style="display:flex;"><span>                <span style="color:#75715e"># parse the date (dumped as datetime.now())</span>
</span></span><span style="display:flex;"><span>                date <span style="color:#f92672">=</span> datetime<span style="color:#f92672">.</span>strptime(date, <span style="color:#e6db74">&#34;%Y-%m-</span><span style="color:#e6db74">%d</span><span style="color:#e6db74"> %H:%M:%S.</span><span style="color:#e6db74">%f</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>                st<span style="color:#f92672">.</span>session_state[<span style="color:#e6db74">&#34;lastjob&#34;</span>] <span style="color:#f92672">=</span> date
</span></span><span style="display:flex;"><span>                <span style="color:#75715e"># if the detla from now is greater than WAIT_TIME_SECONDS</span>
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">if</span> (datetime<span style="color:#f92672">.</span>now() <span style="color:#f92672">-</span> date)<span style="color:#f92672">.</span>total_seconds() <span style="color:#f92672">&gt;</span> WAIT_TIME_SECONDS:
</span></span><span style="display:flex;"><span>                    logger<span style="color:#f92672">.</span>info(<span style="color:#e6db74">&#34;Last job was not recent, starting new job&#34;</span>)
</span></span><span style="display:flex;"><span>                    trigger_etl()
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>                    logger<span style="color:#f92672">.</span>info(<span style="color:#e6db74">&#34;Last job was recent, skipping&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>            trigger_etl()
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> os<span style="color:#f92672">.</span>path<span style="color:#f92672">.</span>exists(<span style="color:#e6db74">&#34;lastjob.txt&#34;</span>):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">&#34;lastjob.txt&#34;</span>, <span style="color:#e6db74">&#34;r&#34;</span>) <span style="color:#66d9ef">as</span> file:
</span></span><span style="display:flex;"><span>            date <span style="color:#f92672">=</span> file<span style="color:#f92672">.</span>read()
</span></span><span style="display:flex;"><span>            date <span style="color:#f92672">=</span> datetime<span style="color:#f92672">.</span>strptime(date, <span style="color:#e6db74">&#34;%Y-%m-</span><span style="color:#e6db74">%d</span><span style="color:#e6db74"> %H:%M:%S.</span><span style="color:#e6db74">%f</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>            st<span style="color:#f92672">.</span>session_state[<span style="color:#e6db74">&#34;lastjob&#34;</span>] <span style="color:#f92672">=</span> date
</span></span></code></pre></div><p>This way we can get a nice metric to display the last time the ETL has ran :</p>
<p><img alt="metric_date" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/metric_date.png#center"></p>
<p><em>NB: The sys.executable is very important to be sure that we are using the same Python executable
than streamlit app, with all dependencies installed. Using python directly could cause unexpected bugs</em></p>
<p>How about the timed thread implementation ?</p>
<p>Pretty simple too :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> os
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> signal
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> threading <span style="color:#f92672">import</span> Timer
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> threading
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>WAIT_TIME_SECONDS <span style="color:#f92672">=</span> <span style="color:#ae81ff">60</span> <span style="color:#f92672">*</span> <span style="color:#ae81ff">60</span> <span style="color:#f92672">*</span> <span style="color:#ae81ff">6</span>  <span style="color:#75715e"># each 6 hours</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">ProgramKilled</span>(<span style="color:#a6e22e">Exception</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">pass</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">signal_handler</span>(signum, ffoorame):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">raise</span> ProgramKilled
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">Job</span>(threading<span style="color:#f92672">.</span>Thread):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> __init__(self, interval, execute, <span style="color:#f92672">*</span>args, <span style="color:#f92672">**</span>kwargs):
</span></span><span style="display:flex;"><span>        threading<span style="color:#f92672">.</span>Thread<span style="color:#f92672">.</span>__init__(self)
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>daemon <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>stopped <span style="color:#f92672">=</span> threading<span style="color:#f92672">.</span>Event()
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>interval <span style="color:#f92672">=</span> interval
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>execute <span style="color:#f92672">=</span> execute
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>args <span style="color:#f92672">=</span> args
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>kwargs <span style="color:#f92672">=</span> kwargs
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">stop</span>(self):
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>stopped<span style="color:#f92672">.</span>set()
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>join()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">run</span>(self):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">while</span> <span style="color:#f92672">not</span> self<span style="color:#f92672">.</span>stopped<span style="color:#f92672">.</span>wait(self<span style="color:#f92672">.</span>interval<span style="color:#f92672">.</span>total_seconds()):
</span></span><span style="display:flex;"><span>            self<span style="color:#f92672">.</span>execute(<span style="color:#f92672">*</span>self<span style="color:#f92672">.</span>args, <span style="color:#f92672">**</span>self<span style="color:#f92672">.</span>kwargs)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">main_etl</span>():
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">&#34;Running ETL job at &#34;</span>, datetime<span style="color:#f92672">.</span>now())
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># print the process pid</span>
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">&#34;Process ID: &#34;</span>, os<span style="color:#f92672">.</span>getpid())
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">&#34;lastjob.txt&#34;</span>, <span style="color:#e6db74">&#34;w&#34;</span>) <span style="color:#66d9ef">as</span> file:
</span></span><span style="display:flex;"><span>        file<span style="color:#f92672">.</span>write(str(datetime<span style="color:#f92672">.</span>now()))
</span></span><span style="display:flex;"><span>    loadXML()
</span></span><span style="display:flex;"><span>    dump_stations()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">etl_job</span>():
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># check if status file exists</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> os<span style="color:#f92672">.</span>path<span style="color:#f92672">.</span>exists(<span style="color:#e6db74">&#34;pid.txt&#34;</span>):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">&#34;pid.txt&#34;</span>, <span style="color:#e6db74">&#34;w&#34;</span>) <span style="color:#66d9ef">as</span> file:
</span></span><span style="display:flex;"><span>            file<span style="color:#f92672">.</span>write(str(os<span style="color:#f92672">.</span>getpid()))
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># start etl at beginning of the thread</span>
</span></span><span style="display:flex;"><span>        main_etl()
</span></span><span style="display:flex;"><span>        signal<span style="color:#f92672">.</span>signal(signal<span style="color:#f92672">.</span>SIGTERM, signal_handler)
</span></span><span style="display:flex;"><span>        signal<span style="color:#f92672">.</span>signal(signal<span style="color:#f92672">.</span>SIGINT, signal_handler)
</span></span><span style="display:flex;"><span>        job <span style="color:#f92672">=</span> Timer(WAIT_TIME_SECONDS, main_etl)
</span></span><span style="display:flex;"><span>        job<span style="color:#f92672">.</span>start()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>                time<span style="color:#f92672">.</span>sleep(<span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">except</span> ProgramKilled:
</span></span><span style="display:flex;"><span>                print(<span style="color:#e6db74">&#34;Program killed: running cleanup code&#34;</span>)
</span></span><span style="display:flex;"><span>                <span style="color:#75715e"># remove the pid file</span>
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">if</span> os<span style="color:#f92672">.</span>path<span style="color:#f92672">.</span>exists(<span style="color:#e6db74">&#34;pid.txt&#34;</span>):
</span></span><span style="display:flex;"><span>                    os<span style="color:#f92672">.</span>remove(<span style="color:#e6db74">&#34;pid.txt&#34;</span>)
</span></span><span style="display:flex;"><span>                job<span style="color:#f92672">.</span>cancel()
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">break</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">&#34;PID file already found, job as already started. Exiting...&#34;</span>)
</span></span><span style="display:flex;"><span>        exit(<span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;__main__&#34;</span>:
</span></span><span style="display:flex;"><span>    etl_job()
</span></span></code></pre></div><p>The main routine is under the function <code>etl_job</code> which was previously called.
We use a double security check, to verify if the PID file is not created already (with concurrency, multiples users could
try to load the page).</p>
<p>We get the signal handlers to do some cleanup when the process receive the termination signal from the parent process (i.e. the app
is shutdown), so that the thread can remove the pid file and so on.</p>
<p>Then we start the timer and launch an infinite loop until an Exception is raised by signal handler.
This way the script will remove the pid file before exiting.</p>
<p><img alt="cleanup" loading="lazy" src="https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExYzdtM2RvcHc2ZTlza2RkeGk3eDI1ZTVndHRvb2IwNDF4c3ZsZWs5aiZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/l2Je9c6EJAuE7mYMM/giphy.gif#center"></p>
<p>Good ! Now we have a full ETL to load into the database the last prices exposed on the French API.</p>
<p>Let&rsquo;s craft a nice and UX friendly dashboard so that users can :</p>
<ul>
<li>A main page, with a price dashboard showing price list and redirection to other pages of the app</li>
<li>Login into the website, retrieve their password / login if they forgot it automatically</li>
<li>Modify, Update and Delete their profile, giving full control over it (RGPD), on users fields (mail, name..) and gas related details (gas types)</li>
<li>Pick some stations to add to their dashboard</li>
</ul>
<p>As a bonus :</p>
<ul>
<li>An about page, in order to show help informations</li>
<li>A nice sidebar to give a professionnal look to the app, thanks to this <!-- raw HTML omitted -->open source app<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup><!-- raw HTML omitted -->, where I borrowed the
UI design of the sidebar.</li>
</ul>
<h2 id="designing-the-ui">Designing the UI</h2>
<h3 id="home-page">Home page</h3>
<p>The home page needs to behave differently, wether the user is logged in or not. Based on this assumption, the goal will be
different. Let&rsquo;s review them step by step.</p>
<h4 id="unlogged-users">Unlogged users</h4>
<p>For newcomers, the main ideas are :</p>
<ol>
<li>To provide a way to create an account on the platform
<img alt="welcome" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/app_ui/welcome.png#center"></li>
<li>View a demo of the dashboard. I would never create an account on something I can&rsquo;t see before, so adding this option is a real bonus.
<img alt="demo1" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/app_ui/demo_1.png#center">
<img alt="demo2" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/app_ui/demo_2.png#center"></li>
<li>Show an about page in order to let user have a look about this app
<img alt="about" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/app_ui/about.png#center"></li>
</ol>
<h4 id="logged-users">Logged users</h4>
<p>For logged users, the main purpose are, if I put myself into a user shoes, by order of priority :</p>
<ol>
<li>The ability to get an instant glimpse of price of my favorites stations, sorted by ascending price.
<img alt="pricelist" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/app_ui/pricelist.png#center"></li>
<li>Be able to get information about the freshness of the data
<img alt="lastdate" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/app_ui/lastdate.png#center"></li>
<li>Get the approximative idea of the expected annual savings using this dashboard, to act like an incentive
<img alt="savings" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/app_ui/savings.png#center"></li>
<li>Get an overview of the other pages purposes and abilities, to customize profile and so on
<img alt="pages" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/app_ui/pages.png#center"></li>
</ol>
<h4 id="admin-user">Admin user</h4>
<p>For the admin user, a landing page shall provide :</p>
<ol start="5">
<li>
<p>Insight about app engagement
<img alt="admin" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/app_ui/admin.png#center"></p>
</li>
<li>
<p>Some management option to handle operation for users (password resets, ETL refresh if fails..)
<img alt="admin_actions_1" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/app_ui/admin_actions_1.png#center">
<img alt="admin_actions_2" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/app_ui/admin_actions_2.png#center"></p>
</li>
</ol>
<p>As this page will be accessible to <em>admin</em> user only, no sensitive data will be exposed.</p>
<p>By bundling all theses stuffs into a single app, deploy it on Streamlit Cloud, we have a live running web app !</p>
<p>And everything is managed by streamlit, no headache !</p>
<p><img alt="tree" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/tree.png#center"></p>
<h2 id="keep-the-app-alive-and-db-on-the-local-filesystem-accross-time">Keep the app alive and DB on the local filesystem accross time</h2>
<p>As it has been explained previously, the DB will bootstrap <code>Users</code> objects from S3 storage. However,
the instances of <code>CustomStation</code>, <code>Price</code> and <code>Stations</code> are only visible in the SQLite DB.
There is a risk that we loose the SQLite file (in case of reboot of the Streamlit environment, they explicitely
express that they will not maintain a local filesystem, it&rsquo;s up to the developer to setup some workaround mecanism)</p>
<p>To ensure the application doesn&rsquo;t enter into sleeping mode, and that the Streamlit orchestrator removes the
container/VM or server where the app resides, some application logic has to be setup in order
to produce traffic on the app. This way we feel safe about these eventuality.</p>
<p><img alt="sleeping" loading="lazy" src="/posts/2024/05/streamlit-gas-stations/sleeping.png#center"></p>
<p>To do this, and ensure that at least 1 visit will be produced on the website I borrewed a nice Github action from
another very cool app made also by a French developer, <!-- raw HTML omitted -->Jean Milpied<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup><!-- raw HTML omitted --></p>
<p>Here is the Github Action YAML file :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">name</span>: <span style="color:#ae81ff">Trigger Probe of Deployed App on a CRON Schedule</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">on</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">schedule</span>:
</span></span><span style="display:flex;"><span>    - <span style="color:#f92672">cron</span>: <span style="color:#e6db74">&#34;0 */48 * * *&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#75715e"># Allows you to run this workflow manually from the Actions tab</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">workflow_dispatch</span>:
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">jobs</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">build-and-probe</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">runs-on</span>: <span style="color:#ae81ff">ubuntu-latest</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">steps</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">Checkout Repository</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">uses</span>: <span style="color:#ae81ff">actions/checkout@v2</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>      - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">Build Docker Image</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">run</span>: <span style="color:#ae81ff">docker build -t my-probe-image -f probe-action/Dockerfile .</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>      - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">Run Docker Container</span>
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">run</span>: <span style="color:#ae81ff">docker run --rm my-probe-image</span>
</span></span></code></pre></div><p>The probe action is a JavaScript script ran by puppeteer:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-js" data-lang="js"><span style="display:flex;"><span><span style="color:#66d9ef">const</span> <span style="color:#a6e22e">puppeteer</span> <span style="color:#f92672">=</span> <span style="color:#a6e22e">require</span>(<span style="color:#e6db74">&#34;puppeteer&#34;</span>);
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">const</span> <span style="color:#a6e22e">TARGET_URL</span> <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;https://carburoam.streamlit.app/&#34;</span>;
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">const</span> <span style="color:#a6e22e">WAKE_UP_BUTTON_TEXT</span> <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;app back up&#34;</span>;
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">const</span> <span style="color:#a6e22e">PAGE_LOAD_GRACE_PERIOD_MS</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">8000</span>;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">console</span>.<span style="color:#a6e22e">log</span>(<span style="color:#a6e22e">process</span>.<span style="color:#a6e22e">version</span>);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>(<span style="color:#66d9ef">async</span> () =&gt; {
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">const</span> <span style="color:#a6e22e">browser</span> <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> <span style="color:#a6e22e">puppeteer</span>.<span style="color:#a6e22e">launch</span>({
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">headless</span><span style="color:#f92672">:</span> <span style="color:#66d9ef">true</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">ignoreHTTPSErrors</span><span style="color:#f92672">:</span> <span style="color:#66d9ef">true</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">args</span><span style="color:#f92672">:</span> [<span style="color:#e6db74">&#34;--no-sandbox&#34;</span>],
</span></span><span style="display:flex;"><span>  });
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">const</span> <span style="color:#a6e22e">page</span> <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> <span style="color:#a6e22e">browser</span>.<span style="color:#a6e22e">newPage</span>();
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">console</span>.<span style="color:#a6e22e">log</span>(<span style="color:#a6e22e">page</span>); <span style="color:#75715e">// Print the page object to inspect its properties
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">await</span> <span style="color:#a6e22e">page</span>.<span style="color:#66d9ef">goto</span>(<span style="color:#a6e22e">TARGET_URL</span>);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">console</span>.<span style="color:#a6e22e">log</span>(<span style="color:#a6e22e">page</span>); <span style="color:#75715e">// Print the page object to inspect its properties
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>
</span></span><span style="display:flex;"><span>  <span style="color:#75715e">// Wait a grace period for the application to load
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>  <span style="color:#66d9ef">await</span> <span style="color:#a6e22e">page</span>.<span style="color:#a6e22e">waitForTimeout</span>(<span style="color:#a6e22e">PAGE_LOAD_GRACE_PERIOD_MS</span>);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">const</span> <span style="color:#a6e22e">checkForHibernation</span> <span style="color:#f92672">=</span> <span style="color:#66d9ef">async</span> (<span style="color:#a6e22e">target</span>) =&gt; {
</span></span><span style="display:flex;"><span>    <span style="color:#75715e">// Look for any buttons containing the target text of the reboot button
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>    <span style="color:#66d9ef">const</span> [<span style="color:#a6e22e">button</span>] <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> <span style="color:#a6e22e">target</span>.<span style="color:#a6e22e">$x</span>(
</span></span><span style="display:flex;"><span>      <span style="color:#e6db74">`//button[contains(., &#39;</span><span style="color:#e6db74">${</span><span style="color:#a6e22e">WAKE_UP_BUTTON_TEXT</span><span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;)]`</span>,
</span></span><span style="display:flex;"><span>    );
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> (<span style="color:#a6e22e">button</span>) {
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">console</span>.<span style="color:#a6e22e">log</span>(<span style="color:#e6db74">&#34;App hibernating. Attempting to wake up!&#34;</span>);
</span></span><span style="display:flex;"><span>      <span style="color:#66d9ef">await</span> <span style="color:#a6e22e">button</span>.<span style="color:#a6e22e">click</span>();
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>  };
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">await</span> <span style="color:#a6e22e">checkForHibernation</span>(<span style="color:#a6e22e">page</span>);
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">const</span> <span style="color:#a6e22e">frames</span> <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> <span style="color:#a6e22e">page</span>.<span style="color:#a6e22e">frames</span>();
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">for</span> (<span style="color:#66d9ef">const</span> <span style="color:#a6e22e">frame</span> <span style="color:#66d9ef">of</span> <span style="color:#a6e22e">frames</span>) {
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">await</span> <span style="color:#a6e22e">checkForHibernation</span>(<span style="color:#a6e22e">frame</span>);
</span></span><span style="display:flex;"><span>  }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">await</span> <span style="color:#a6e22e">browser</span>.<span style="color:#a6e22e">close</span>();
</span></span><span style="display:flex;"><span>})();
</span></span></code></pre></div><p>The script is triggered using a Docker Image of puppeteer, and probe the website and click on
wake up if it&rsquo;s sleeping :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-Dockerfile" data-lang="Dockerfile"><span style="display:flex;"><span><span style="color:#75715e"># probe-action/Dockerfile</span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span><span style="color:#66d9ef">FROM</span><span style="color:#e6db74"> ghcr.io/puppeteer/puppeteer:17.0.0</span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span><span style="color:#66d9ef">COPY</span> ./probe-action/probe.js /home/pptruser/src/probe.js<span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span><span style="color:#66d9ef">ENTRYPOINT</span> [ <span style="color:#e6db74">&#34;/bin/bash&#34;</span>, <span style="color:#e6db74">&#34;-c&#34;</span>, <span style="color:#e6db74">&#34;node -e \&#34;$(&lt;/home/pptruser/src/probe.js)\&#34;&#34;</span> ]<span style="color:#960050;background-color:#1e0010">
</span></span></span></code></pre></div><p>Using this technic, every 48 hours, the script will trigger a probe action and ensure the app stays up.
However, the current downside of current version, is the lack of ability to make Green/Blue deployments, meaning that
if the current Streamlit service fails, the app is deployed again elsewhere without the saved Sqlite DB.</p>
<p>This happened a few time during development, but for a side project without any particular ambition, It seems
rather enough to me !</p>
<p>Some backup mecanism could be set up, but at this point, using SQlite might be uneffective. I will add a TODO note in the project for the future
to implement some backup logic.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Streamlit is a very versatile tool, giving the possibility to :</p>
<ul>
<li>craft a small app, with a cool and responsive UI</li>
<li>with some hacks, build up a small ETL to give some daily updates to the data exposed. Do not take it as a
production battle tested feature, but for some fun side projects, it will be enough.</li>
</ul>
<p>Make sure your are giving a well designed data schema, in order to retrieve the maximum performances from your ORM, implementing
the app logic will be so easy and straightforward !</p>
<p>Besides, you will have fancy ORM objects, always nice when implementing a backend !</p>
<p>A lot of❤️ to various developers who Open sourced their apps (<a href="https://pdfworkdesk.streamlit.app/">pdf-workdesk</a>,
<a href="https://reparatorai.streamlit.app/">reparatorAI</a>, librairies (<a href="https://github.com/mkhorasani/Streamlit-Authenticator/">Streamlit-Authenticator</a>),
without them the work would have been way much harder, or maybe impossible.
Give them a lot of 🌟, it will please them a lot !</p>
<p>My thanks goes also to Streamlit teams, thanks a lot for making possible for developers to expose their
crafted dashboards for free. Very appreciated !</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>A nice pdf-editing <a href="https://github.com/SiddhantSadangi/pdf-workdesk">app</a> made by Siddhant Sadangi, have a look to
his other apps on GH, they are amazing !&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Another nice app, showing some BI information about the chance you have to repair your devices. Have a look
for the probe action <a href="https://github.com/JeanMILPIED/reparatorAI/tree/main/probe-action">here</a>, and deep dive the
blog post provided by <a href="https://dcyoung.github.io/post-streamlit-keep-alive/">David Young</a>.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>Lightening fast, Parquet to CSV</title>
      <link>https://emilien-foissotte.github.io/posts/2023/08/fast-convert/</link>
      <pubDate>Sat, 26 Aug 2023 09:50:14 +0200</pubDate>
      <guid>https://emilien-foissotte.github.io/posts/2023/08/fast-convert/</guid>
      <description>&lt;h1 id=&#34;tldr&#34;&gt;TL;DR&lt;/h1&gt;
&lt;p&gt;This post will expose you how to convert in a very convenient and fast way 🚀 some &lt;code&gt;Apache Parquet&lt;/code&gt;
files to &lt;code&gt;CSV&lt;/code&gt;, and vice-versa, using either DuckDB 🦆 or Pandas 🐍 for a baseline comparison&lt;/p&gt;
&lt;p&gt;As a quick bonus, we will embedded this tool in a small convient CLI script, easily triggered from your favorite
shell 👨‍💻&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s go !&lt;/p&gt;
&lt;h2 id=&#34;intro&#34;&gt;Intro&lt;/h2&gt;
&lt;p&gt;Recently, I&amp;rsquo;ve been working a little bit more on Data Engineering tasks (setup a Datalake, convert data,
design pipelines, make cleanup of some data). 📊&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h1 id="tldr">TL;DR</h1>
<p>This post will expose you how to convert in a very convenient and fast way 🚀 some <code>Apache Parquet</code>
files to <code>CSV</code>, and vice-versa, using either DuckDB 🦆 or Pandas 🐍 for a baseline comparison</p>
<p>As a quick bonus, we will embedded this tool in a small convient CLI script, easily triggered from your favorite
shell 👨‍💻</p>
<p>Let&rsquo;s go !</p>
<h2 id="intro">Intro</h2>
<p>Recently, I&rsquo;ve been working a little bit more on Data Engineering tasks (setup a Datalake, convert data,
design pipelines, make cleanup of some data). 📊</p>
<p>From time to time, I had to convert .csv data, which is perfect to rapidly catch any important info or check that decryption
is effective and so on.. 👀</p>
<p>But CSV files are very memory consuming, and in order to save some costs on AWS S3 Storage, it is way
better to handle some files using <code>Apache Parquet</code> format ⚡</p>
<p>And eventually, I&rsquo;ve been finding myself doing again the same commands, in order to convert Parquet to CSV
and vice-versa. I tried to find a CLI tool which is plebiscited by Data Engineering community, but infortunalely
I couldn&rsquo;t encounter one !</p>
<p>And low efficient commands were going one again..</p>
<p><img alt="cat" loading="lazy" src="https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExbWVxbHpwc2FzNHh2anFtcDRoaXdianJhOGl5bDFwYXJsdW5pNHVzbyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/JIX9t2j0ZTN9S/giphy.gif#center"></p>
<p>So sad that with parquet you can&rsquo;t vizualize your data , the following
command won&rsquo;t help you dealing with your data pieces.. 😥</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>❯ head file.parquet
</span></span><span style="display:flex;"><span>B%R1x,&lt;I/
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>!
</span></span><span style="display:flex;"><span>b
</span></span><span style="display:flex;"><span>B5
</span></span><span style="display:flex;"><span> u
</span></span><span style="display:flex;"><span>:!-
</span></span><span style="display:flex;"><span>&lt;M<span style="color:#f92672">=</span>
</span></span><span style="display:flex;"><span>P<span style="color:#f92672">(</span>-    bG
</span></span></code></pre></div><p>As always in optimization problems, there is no free lunch, you cannot get the convience of
being able to vizualize your data in a glimpse, and get a storage space super-efficient format. 🤑</p>
<p>But, after a while I remembered a post from a DuckDB advocate, showing how DuckDB could handle
this kind of operations, let&rsquo;s try to do it on our machine ! 🚀</p>
<h2 id="comparing-pandas-and-duckdb">Comparing Pandas and DuckDB</h2>
<p>Few month ago, I&rsquo;ve encountered a Linkedin <a href="https://www.linkedin.com/posts/motherduck_csv-to-parquet-using-duckdb-cli-activity-7043982478671306752-z2EK?utm_source=share&amp;utm_medium=member_desktop">post</a> from a DuckDB advocate about crafting a one line script to
efficiently convert a CSV file into a parquet file.</p>
<p>I decided to give it a try and compare it from classical tool to do so (like pandas).</p>
<h2 id="setup-a-baseline-for-a-conversion-tool">Setup a baseline for a conversion tool</h2>
<p>Let&rsquo;s first download a medium size dataset, for instance the <a href="https://grouplens.org/datasets/movielens/">MovieLens 25M datasets</a></p>
<p>Stay tuned, we will be using it in a future post !</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>❯ head -n <span style="color:#ae81ff">3</span> ratings.csv
</span></span><span style="display:flex;"><span>userId,movieId,rating,timestamp
</span></span><span style="display:flex;"><span>1,296,5.0,1147880044
</span></span><span style="display:flex;"><span>1,306,3.5,1147868817
</span></span></code></pre></div><p>Let see with pandas and pyarrow installed, how does the baseline tool behave with this operations.
Just for the record and the sake of reproducibility is the snapshot of my <code>pip freeze</code> of my whole venv</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>❯ pip freeze
</span></span><span style="display:flex;"><span>numpy<span style="color:#f92672">==</span>1.25.2
</span></span><span style="display:flex;"><span>pandas<span style="color:#f92672">==</span>2.1.0
</span></span><span style="display:flex;"><span>pyarrow<span style="color:#f92672">==</span>13.0.0
</span></span><span style="display:flex;"><span>python-dateutil<span style="color:#f92672">==</span>2.8.2
</span></span><span style="display:flex;"><span>pytz<span style="color:#f92672">==</span>2023.3.post1
</span></span><span style="display:flex;"><span>six<span style="color:#f92672">==</span>1.16.0
</span></span><span style="display:flex;"><span>tzdata<span style="color:#f92672">==</span>2023.3
</span></span></code></pre></div><p>And the performances are the following :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>❯ /usr/bin/time -l -h -p python -c <span style="color:#e6db74">&#34;import pandas; df=pandas.read_csv(&#39;ratings.csv&#39;); df.to_parquet(&#39;ratings.parquet&#39;)&#34;</span>
</span></span><span style="display:flex;"><span>real 14,43
</span></span><span style="display:flex;"><span>user 9,40
</span></span><span style="display:flex;"><span>sys 2,80
</span></span><span style="display:flex;"><span>          <span style="color:#ae81ff">1774600192</span>  maximum resident set size
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  average shared memory size
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  average unshared data size
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  average unshared stack size
</span></span><span style="display:flex;"><span>             <span style="color:#ae81ff">1037341</span>  page reclaims
</span></span><span style="display:flex;"><span>                <span style="color:#ae81ff">5284</span>  page faults
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  swaps
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  block input operations
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  block output operations
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  messages sent
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  messages received
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  signals received
</span></span><span style="display:flex;"><span>                 <span style="color:#ae81ff">864</span>  voluntary context switches
</span></span><span style="display:flex;"><span>               <span style="color:#ae81ff">25149</span>  involuntary context switches
</span></span><span style="display:flex;"><span>         <span style="color:#ae81ff">77809521962</span>  instructions retired
</span></span><span style="display:flex;"><span>         <span style="color:#ae81ff">42836239895</span>  cycles elapsed
</span></span><span style="display:flex;"><span>          <span style="color:#ae81ff">2854670336</span>  peak memory footprint
</span></span></code></pre></div><p>Now let&rsquo;s use DuckDB 🦆</p>
<p>To <a href="https://duckdb.org/docs/installation/">install</a> it, very simple :</p>
<p>For macOS</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>brew install duckdb
</span></span></code></pre></div><p>and for Linux (be sure to get the right arch)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>curl -SL https://github.com/duckdb/duckdb/releases/download/v0.8.1/duckdb_cli-linux-amd64.zip -o /tmp/duckdb.zip
</span></span><span style="display:flex;"><span>unzip /tmp/duckdb.zip
</span></span><span style="display:flex;"><span>mv /tmp/duckdb/* /usr/local/bin/
</span></span><span style="display:flex;"><span>chmod +x /usr/local/bin/duckdb
</span></span></code></pre></div><p>Let&rsquo;s convert that <code>.csv</code> file 🚀 :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>❯ /usr/bin/time -l -h -p duckdb -c <span style="color:#e6db74">&#34;COPY (select * from read_csv_auto(&#39;ratings.csv&#39;)) TO &#39;ratings.parquet&#39; (FORMAT PARQUET)&#34;</span>
</span></span><span style="display:flex;"><span>100% ▕████████████████████████████████████████████████████████████▏
</span></span><span style="display:flex;"><span>real 9,12
</span></span><span style="display:flex;"><span>user 24,58
</span></span><span style="display:flex;"><span>sys 2,36
</span></span><span style="display:flex;"><span>           <span style="color:#ae81ff">603959296</span>  maximum resident set size
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  average shared memory size
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  average unshared data size
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  average unshared stack size
</span></span><span style="display:flex;"><span>              <span style="color:#ae81ff">551447</span>  page reclaims
</span></span><span style="display:flex;"><span>                <span style="color:#ae81ff">1741</span>  page faults
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  swaps
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  block input operations
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  block output operations
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  messages sent
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  messages received
</span></span><span style="display:flex;"><span>                   <span style="color:#ae81ff">0</span>  signals received
</span></span><span style="display:flex;"><span>                 <span style="color:#ae81ff">761</span>  voluntary context switches
</span></span><span style="display:flex;"><span>               <span style="color:#ae81ff">34911</span>  involuntary context switches
</span></span><span style="display:flex;"><span>        <span style="color:#ae81ff">135618259891</span>  instructions retired
</span></span><span style="display:flex;"><span>         <span style="color:#ae81ff">96669139421</span>  cycles elapsed
</span></span><span style="display:flex;"><span>           <span style="color:#ae81ff">645996544</span>  peak memory footprint
</span></span></code></pre></div><p>Let&rsquo;s compare the data between those runs !</p>
<p><img alt="data" loading="lazy" src="https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExNXNlbDl1dnZnMzVyMHN5MTF5cHQ3MnN1ZXowNXc4NmEzYW9kbnhxZCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/LaVp0AyqR5bGsC5Cbm/giphy.gif#center"></p>
<p>First impression : Love that progress bar, it is always annoying to wait for operations and never be sure when it is going to end.</p>
<p>However it seems that the file is a little less compressed than pandas one (might be some tweaks to do to compress it
with duckdb) :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>❯ du -h ratings*.parquet
</span></span><span style="display:flex;"><span>225M    ratings_duckdb.parquet
</span></span><span style="display:flex;"><span>168M    ratings_pandas.parquet
</span></span></code></pre></div><p>However, when looking at the peak memory item, we can see that DuckDB process it on chunks, while
Pandas loads all the object in memory. On low memory system or with big objects, it can be limitating.</p>
<p>The overall peak is <code>4.4</code> times less important with DuckDB. Excellent !</p>
<p><img alt="duck" loading="lazy" src="https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExcWUzOGJ0OGpjanJuajg2MTkyemRxY3FqdWV1emRrdmE3cmN5bHNrZCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/x4bgmvMlRSYRVcTm29/giphy.gif#center"></p>
<h2 id="developing-a-small-cli-tool---the-fancy-way">Developing a small CLI tool - the fancy way</h2>
<p>Simply tweak your <code>.zshrc</code> or <code>.bashrc</code> with theses incredible functions :</p>
<script src="https://gist.github.com/Emilien-Foissotte/beed79c794db3642830cf149701e27c4.js"></script>

<p>And transform your files in a one-liner command :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>parquet_to_csv file.parquet
</span></span></code></pre></div><p>and boom, you get a <code>file.csv</code> 💥</p>
<p>And do the reverse operation very simply :</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>csv_to_parquet file.csv
</span></span></code></pre></div><p>And here is your <code>file.parquet</code>, so fast and efficient ! 🏎️</p>
<h2 id="conclusion">Conclusion</h2>
<p>We just covered a very handy feature with DuckDB, but with this small example, we have
been able to turn this versatile tool in a very handy CLI software, which will save you
so much time in your daily Data Engineering life !</p>
<p>Do not hesitate to add some of your smarts one-liners commands and function
you have in your <code>.bashrc</code> and <code>.zshrc</code></p>
<p>Lots of ❤️ to the DuckDB team for the incredible work !</p>
<h3 id="links-and-references">Links and references</h3>
<ul>
<li><a href="https://www.linkedin.com/posts/motherduck_csv-to-parquet-using-duckdb-cli-activity-7043982478671306752-z2EK?utm_source=share&amp;utm_medium=member_desktop">Linkedin Post</a></li>
<li><a href="https://www.linkedin.com/posts/mehd-io_csv-to-parquet-using-duckdb-cli-activity-7043984992632229888-7GJr?utm_source=share&amp;utm_medium=member_desktop">Repost from Mehdi Ouazza, great Data Eng advocate to follow !</a></li>
<li><a href="https://duckdb.org/docs/installation/">DuckDB documentation</a></li>
<li><a href="https://duckdb.org/pdf/SIGMOD2019-demo-duckdb.pdf">Discover DuckDB PDF</a></li>
</ul>
]]></content:encoded>
    </item>
  </channel>
</rss>
