Skip to main content

Data Overview

Understanding the Dataset

The workshop uses two main data sources: Prishtina Places Dataset:
  • Restaurants: Individual JSON files (r1.json, r2.json, etc.)
  • Coffee Shops: Individual JSON files (c1.json, c2.json, etc.)
  • Data Fields: Name, location, rating, reviews, coordinates, address
  • Source: Scraped using SerpAPI from Google Places
Revenue Data:
  • Format: PostgreSQL database
  • Content: Synthetic daily revenue data for each establishment
  • Time Range: Historical data for trend analysis

Data Structure

Each JSON file contains a raw SERPApi response with a JSON object with metadata and local results of place objects with the following structure:
{
    "search_metadata": {
        "id": "6867f58ea289710dcf3ea48f",
        "status": "Success",
        "json_endpoint": "https://serpapi.com/searches/e3f0d0cce2c8410a/6867f58ea289710dcf3ea48f.json",
        "created_at": "2025-07-04 15:38:54 UTC",
        "processed_at": "2025-07-04 15:38:54 UTC",
        "google_local_url": "https://www.google.com/search?q=Coffee&uule=w+CAIQICIiUHJpc2h0aW5hLFByaXN0aW5hIERpc3RyaWN0LEtvc292bw&hl=en&gl=de&tbm=lcl",
        "raw_html_file": "https://serpapi.com/searches/e3f0d0cce2c8410a/6867f58ea289710dcf3ea48f.html",
        "total_time_taken": 3.92
    },
    "search_parameters": {
        "engine": "google_local",
        "q": "Coffee",
        "location_requested": "Prishtina, Pristina District, Kosovo",
        "location_used": "Prishtina,Pristina District,Kosovo",
        "google_domain": "google.com",
        "hl": "en",
        "gl": "de",
        "device": "desktop"
    },
    "local_results": [
        {
            "position": 1,
            "rating": 4.9,
            "reviews": 158,
            "reviews_original": "(158)",
            "lsig": "AB86z5U-pSAyfo30lztdUc25lx9e",
            "thumbnail": "https://serpapi.com/searches/6867f58ea289710dcf3ea48f/images/fcea8ffca5da861b96c24707d286def2eb8a768ad27776066a661b1623ff9519.jpeg",
            "place_id": "4333347444629616077",
            "place_id_search": "https://serpapi.com/search.json?device=desktop&engine=google_local&gl=de&google_domain=google.com&hl=en&location=Prishtina%2C+Pristina+District%2C+Kosovo&ludocid=4333347444629616077&q=Coffee",
            "gps_coordinates": {
                "latitude": 42.65882,
                "longitude": 21.159815
            },
            "title": "Newborn Brew",
            "type": "Coffee shop",
            "address": "M555+GW, 1000 Johan V. Hahn"
        }
    ],
    "pagination": {
        "current": 1,
        "next": "https://www.google.com/search?q=Coffee&uule=w+CAIQICIiUHJpc2h0aW5hLFByaXN0aW5hIERpc3RyaWN0LEtvc292bw&hl=en&gl=de&start=20&tbm=lcl",
        "other_pages": {
            "2": "https://www.google.com/search?q=Coffee&uule=w+CAIQICIiUHJpc2h0aW5hLFByaXN0aW5hIERpc3RyaWN0LEtvc292bw&hl=en&gl=de&start=20&tbm=lcl",
            "3": "https://www.google.com/search?q=Coffee&uule=w+CAIQICIiUHJpc2h0aW5hLFByaXN0aW5hIERpc3RyaWN0LEtvc292bw&hl=en&gl=de&start=40&tbm=lcl",
            "4": "https://www.google.com/search?q=Coffee&uule=w+CAIQICIiUHJpc2h0aW5hLFByaXN0aW5hIERpc3RyaWN0LEtvc292bw&hl=en&gl=de&start=60&tbm=lcl",
            "5": "https://www.google.com/search?q=Coffee&uule=w+CAIQICIiUHJpc2h0aW5hLFByaXN0aW5hIERpc3RyaWN0LEtvc292bw&hl=en&gl=de&start=80&tbm=lcl"
        }
    },
    "serpapi_pagination": {
        "current": 1,
        "next_link": "https://serpapi.com/search.json?device=desktop&engine=google_local&gl=de&google_domain=google.com&hl=en&location=Prishtina%2C+Pristina+District%2C+Kosovo&q=Coffee&start=20&uule=w+CAIQICIiUHJpc2h0aW5hLFByaXN0aW5hIERpc3RyaWN0LEtvc292bw",
        "next": "https://serpapi.com/search.json?device=desktop&engine=google_local&gl=de&google_domain=google.com&hl=en&location=Prishtina%2C+Pristina+District%2C+Kosovo&q=Coffee&start=20&uule=w+CAIQICIiUHJpc2h0aW5hLFByaXN0aW5hIERpc3RyaWN0LEtvc292bw",
        "other_pages": {
            "2": "https://serpapi.com/search.json?device=desktop&engine=google_local&gl=de&google_domain=google.com&hl=en&location=Prishtina%2C+Pristina+District%2C+Kosovo&q=Coffee&start=20&uule=w+CAIQICIiUHJpc2h0aW5hLFByaXN0aW5hIERpc3RyaWN0LEtvc292bw",
            "3": "https://serpapi.com/search.json?device=desktop&engine=google_local&gl=de&google_domain=google.com&hl=en&location=Prishtina%2C+Pristina+District%2C+Kosovo&q=Coffee&start=40&uule=w+CAIQICIiUHJpc2h0aW5hLFByaXN0aW5hIERpc3RyaWN0LEtvc292bw",
            "4": "https://serpapi.com/search.json?device=desktop&engine=google_local&gl=de&google_domain=google.com&hl=en&location=Prishtina%2C+Pristina+District%2C+Kosovo&q=Coffee&start=60&uule=w+CAIQICIiUHJpc2h0aW5hLFByaXN0aW5hIERpc3RyaWN0LEtvc292bw",
            "5": "https://serpapi.com/search.json?device=desktop&engine=google_local&gl=de&google_domain=google.com&hl=en&location=Prishtina%2C+Pristina+District%2C+Kosovo&q=Coffee&start=80&uule=w+CAIQICIiUHJpc2h0aW5hLFByaXN0aW5hIERpc3RyaWN0LEtvc292bw"
        }
    }
}

JSON Data Exploration and Processing

Explore the JSON Data

Start by exploring the structure of your JSON files:
-- Explore the structure of restaurant files
DESC 'datasets/places/*.json';
SELECT * FROM 'datasets/places/*.json';

Clean the JSON

The JSON is somewhat unstructured because it contains a lot of unncessary metadata. Find a way to parse out the individual restaurant data.
-- Parse the data with the unnest function
SELECT unnest(local_results, max_depth:=5)
FROM 'datasets/_places/*.json';

Create a Table named ‘places’from the JSON

CREATE TABLE places 
AS SELECT unnest(local_results, max_depth:=5)
  FROM 'datasets/places/*.json';

Query aggregations

Query the count of places

SELECT count(*)
FROM places;

Query the average rating of places

SELECT avg(rating)
FROM places;

Analyze the data further on your own

Query the data on your own with other types of interesting aggregations / analytics. Write down some interesting queries and insights.

Find Top 10 Rated Places in Prishtina

Analyze the highest-rated and lowest-rated places with at least 50 reviews.
-- Top 10 by rating (minimum 50 reviews)
SELECT title, rating, reviews, address
FROM places
WHERE reviews >= 50
ORDER BY rating DESC, reviews DESC
LIMIT 10;

Geospatial Analytics

Install and Explore DuckDB Geospatial Extension

Enable geospatial capabilities:
-- Install spatial extension
INSTALL spatial;
LOAD spatial;

-- Create geometry points from coordinates
SELECT 
    title,
    ST_Point(longitude, latitude) as location,
    rating,
    reviews
FROM places
LIMIT 5;

Calculate Distances Between Places

Find the nearest place to ICK (Innovation Centre Kosovo)

-- Find restaurants within 1km of ICK (Innovation Centre Kosovo)
SELECT 
    title,
    ST_Distance_Sphere(
        ST_Point(latitude, longitude),
        ST_Point(42.65507999973656, 21.16488168321288)
    ) as distance_meters
FROM places
ORDER BY distance_meters;

Task Summary

Congratulations! You’ve completed the batch analytics portion of the workshop. You’ve learned to: ✅ Explore and clean JSON data
✅ Analyze places data from Google ✅ Implement geospatial analytics
Next Step: Ready for real-time analytics? Move on to Task 2: Real-Time Analytics to learn about ClickHouse and streaming data processing.