TL;DR

This post will show some tips on how to work efficiently as a Data Engineer πŸš€, either navigating throught documentation or using a local LLM to ease your development experience (having a Mac chip will be mandatory for this one). πŸ‘¨β€πŸ’»

Let’s go !

Intro

Nowadays, working in limited internet connection can occur and there is a huge gap compared to our workstation setup 🦾

In those situations, connection speed might be very slow, with a very broken bandwith. This makes it very difficult to work in those environments, but with a few preparation you might be as effective than before ! πŸ’₯

sloth

Get ready to boost your productivity, on a train, on a bus or enjoying a family trip in the countryside ! 🚜

Keep a browsable documentation, everywhere

Never received a RTFM raising an Issue ? Make yourself a gift, read the documentation when approaching a new concept or feature of a package ! πŸ“–

Unix troubles

A very good tool to use when working on UNIX computers is man, a simple documentation tool that gives you every details of software installed.

However, it can be hard to navigate the entire documentation. And it’s quite difficult to get the exact purpose you are looking for.

To solve this, a community tool have emerged : tldr.

In a few lines of text documentation you will find out the main usage of the software and its syntax.

For instance, for xargs command :

xargs_tldr

Pretty neat, isn’t it ? πŸ”₯

To install it, have a look at the repository.

Browsing Python documentation

As Python is becoming the 1st programming language, you will surely encounter issues while working on it. You are struggling using a built-in Python object ? 🐍

A deep exploration using Pythonic Wizard tricks like yourobject.__dict__ asn’t provided you any useful information ?

Fire up the buit-in documentation server associated to your Python version with :

pydoc -p 0

and type b to open automatically a browser page. 🌐

All the installed package will have their documentation provided here, with docstrings and examples.

For instance, here is the docstring of pyspark with the agg function :

pyspark_agg

Browsing offline Duckdb documentation

Sometimes, with the right preparation a Data Engineer can safely work on a dbt projects using unit-tests or data-tests using mock sources. πŸ“Š

For instance, I’ve been working recently with lea, a lightweight alternative to dbt which is plug and play with duckdb.

My staging models where used to retrieve some local data which was not too large to oversize my hard-drive.

I declare views as follow :

.
β”œβ”€β”€ seeds
β”‚Β Β  β”œβ”€β”€ inventory.sql
β”‚Β Β  β”œβ”€β”€ raw_animals.csv
β”‚Β Β  └── raw_inventory.parquet
β”œβ”€β”€ views
β”‚Β Β  β”œβ”€β”€ analytics
β”‚Β Β  β”‚Β Β  └── stats.sql
β”‚Β Β  β”œβ”€β”€ core
β”‚Β Β  β”‚Β Β  └── wrangled_inventory.sql
β”‚Β Β  β”œβ”€β”€ gold
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ animals.sql
β”‚Β Β  β”‚Β Β  └── inventory.sql
β”‚Β Β  └── staging
β”‚Β Β      β”œβ”€β”€ animals.py
β”‚Β Β      └── inventory.py
β”œβ”€β”€ wrangling.db

The seeds/inventory.sql model contains :

CALL load_aws_credentials('my-profile');
DROP TABLE IF EXISTS inventory;
COPY (
  SELECT * FROM 's3://mylarge-bucket/inventory.parquet'
) TO 'seeds/raw_inventory.parquet' (FORMAT PARQUET);

When I’m still online, I make a :

duckdb < seeds/inventory.sql

It generates a dump of raw_inventory.parquet file βš™οΈ

And later on I can declare staging model, which contains :

from __future__ import annotations

import pathlib

import pandas as pd

here = pathlib.Path(__file__).parent
inventory = pd.read_parquet(here.parent.parent / "seeds" / "raw_inventory.parquet")

You need to review duckdb docs on an edge case for a function ?

Before your offline trip, download the latest documentation 1 as a zip file at https://duckdb.org/duckdb-docs.zip.

Download it, unzip it :

mkdir ~/duckdb-offline && mv ~/Downloads/duckdb-docs.zip ~/duckdb-offline
cd duckdb-offline && unzip duckdb-docs.zip

Once it’s unzipped, load the python built-in webserver, it will be available everywhere offline, even with the search bar, very neat ! πŸ¦†

python -m http.server

duckdb-docs

Facing a hard issue on this one ? Ask for local LLM help !

Using managed services like Github Copilot is super handy, but it might be costly (~100$/yr) and not suitable when developing in limited bandwith environments. 🐌

To overcome theses challenges, if your are a happy owner of Apple M2 or M3 chip, you will have enough computing power to run a local LLM, within 1 to 3B weights.

Hopefully, the ARM architecture of the chip will save us also from completely wipe the battery out of power. πŸ”‹

To do so, before your first offline session, head to tabby documentation, a framework that makes available local LLM to LSP servers.

Install tabby and launch it with :

tabby serve --port 8889 --device metal --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct

Use the plugin on the IDE 2 of your choice, for instance vim-tabby on neovim, with a few setup to do not conflict with Copilot setup :

Lazy.nvim configuration :

return {
	"TabbyML/vim-tabby",
	lazy = false,
	dependencies = {
		"neovim/nvim-lspconfig",
	},
	init = function()
		vim.g.tabby_agent_start_command = { "npx", "tabby-agent", "--stdio" }
		vim.g.tabby_inline_completion_trigger = "manual"
		vim.g.tabby_inline_completion_keybinding_accept = "<leader>%"
	end,
}

and use leader + % to accept the tabby autocomplete from LLM.

If your are more familiar with a chat interface, head to http://0.0.0.0:8889/ to use the chat web app ! πŸ’¬

Here is a result defining fibonacci function :

tabby_suggestions

Not so bad !

Conclusion

Going offline is still an occasion of executing deepwork, even if the network connection is not so great. It doesn’t mean that you have to fail on each issue and try to debug it without documentation.

It’s a great way of standing on the shoulders of giants 3, tons of developer tried to provide the most efficient documentation, RTFM. And nowadays, having a copilot is a great you to have straight to the point code suggestions, do not underestimate it.


  1. Duckdb team provide several ways to browse offline the documentation, head to offline doc page for more. ↩︎

  2. Tabby team developped also plugins for Visual Studio Code, IntelliJ. Have a look at documentation↩︎

  3. Knowledge is cumulative, [wikipedia]( https://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants %C3%A9ants) pour ↩︎