local stuff for local (gov) people

Data Thing Part 2: Lakes and Warehouses and Lakehouses...

Now we've started to work out what we don't know, we can get to work with knowing things!
Spoiler alert: I end up even more confused.

Bringing our datasets together

I start learning about Data Lakes and Warehouses and Lakehouses. It all sounds very big and technical and expensive.
A warehouse sounds like a MEGADATABASE where you copy your other structured database data to for analysis. A lake is the same thing but you can also include unstructured data! I am picturing a huge heap of flytipped CSV files and SQL databases and PDFs. This sounds right up our street.

Then we have on-premise vs cloud. It seems counterproductive to bring everything into an on premise MEGADATABASE when we are meant to be going cloud first. But how do I turn this into something I can try out without running up a fortune in cloud computing costs?

Using what we already have

We have servers with SQL and Postgres databases in our data centre. I start looking into open source things like Apache Airflow and DBT Core which are snazzy. I soon realise I could spend many months going down a rabbit hole with this stuff because I find it interesting, but do we have the knowledge and resources to support this setup as a team?
I need a thing where we can spin up a working example easily...

Looking at this problem from the other end - we have users writing reports in software bundled with applications, and a couple of users starting to get interested in Power BI. We've just upgraded our Microsoft Licences to E5, so perhaps it is time to talk to Microsoft about what they offer? I start looking into Fabric and Azure storage and realise I need someone with a PHD in Microsoft, so through our account manager I book an introductory meeting with an Azure specialist.

We start thinking about the balance between cost and ease of use, maintenance and familiarity.

We need a clearer goal

It needs to be tangible and ideally not include the terms "data driven", "unlocking our data" or any vague references to insights because by this point I am overhearing them! Don't get me started on "chat to your data"... 😣
It needs to be specific and focused enough that we can actually deliver it and it doesn't become another service plan mega rollover. Let's come up with some outcomes:

Using a demo project to build momentum

At the start of the year I was fiddling with the OpenAI API and wrote a script to automate the summarisation of comments on planning applications. It's still going through rounds of changes and tweaking. It currently has to be run manually by me from some CSV extracts.
Sounds like I've got a good area to focus on:


#data thing