adam-gligor monologue about stuff, brain dump

Sqlalchemy complex queries and subqueries

Here’s how I put together a complex query in sqlalchemy using subqueries.

Approach

My brain already understands sql syntax so I choose something that reads like sql, it’s just my personal preference not the only syntax.

query or subquery = session.query(
  [select fields]
).select_from(
  [left_side]
).join(
  [right side or subquery]
).join(
  ....
).[other clauses]
[.subquery()]

Example

The following exercise showcases:

  • aggregation
  • filtering, sorting
  • table aliasing
  • column aliasing
  • sub queries

The model:

class Vendor(Base):
    __tablename__ = 'vendor'
    id = Column(Integer, primary_key=True)
    name = Column(String(100))
    status = Column(Integer)

class Domain(Base):
    __tablename__ = 'domain'
    id = Column(Integer, primary_key=True)
    fqdn = Column(String(255))
    vendor_id = Column(Integer, ForeignKey(Vendor.id))

class Revenue(Base):
    __tablename__ = 'revenue'
    id = Column(Integer, primary_key=True)
    date = Column(Date())
    domain_id = Column(Integer, ForeignKey(Domain.id))
    total_revenue = Column(Integer)

One vendor can have one or more domains and revenue tracks the daily revenues per domain

Problem

Find the vendors and the domains with highest revenue that contributed to 90% of the revenue of the past 7 days

session = DBSession()

# first subquery to calculate 90% of revenue of last 7 days
sub_query = session.query(
    0.9 * func.sum(Revenue.total_revenue)
).select_from(
    Revenue
).filter(
    func.datediff(func.now(), Revenue.date) <= 7
).subquery()

# second subquery will return only those domains that are in the top 90% revenue 
# (using join >= and sum to calculate partial totals)
revenue_a = aliased(Revenue)
revenue_b = aliased(Revenue)
sub_query2 = session.query(
    revenue_a.domain_id
).select_from(
    revenue_a
).join(
    revenue_b,
    revenue_b.total_revenue >= revenue_a.total_revenue
).filter(
    and_(revenue_a.total_revenue > 0, revenue_b.total_revenue > 0)
).filter(
    func.datediff(func.now(), revenue_a.date) <= 7
).group_by(
    revenue_a.domain_id
).having(
    func.sum(revenue_b.total_revenue) <= sub_query
).distinct().subquery()

# everything together
query = session.query(
    Vendor.id,
    Vendor.name,
    Domain.fqdn,
    func.sum(Revenue.total_revenue).label('revenue')
).select_from(
    Vendor
).join(
    Domain, Vendor.id == Domain.vendor_id
).join(
    Revenue, Revenue.domain_id == Domain.id
).join(
    sub_query2, sub_query2.c.domain_id == Domain.id
).group_by(
    Vendor.id, Domain.id, Domain.fqdn
)

Generated sql:

SELECT
    vendor.id AS vendor_id,
    vendor.name AS vendor_name,
    domain.fqdn AS domain_fqdn,
    sum(revenue.total_revenue) AS revenue
FROM
    vendor
    JOIN domain ON vendor.id = domain.vendor_id
    JOIN revenue ON revenue.domain_id = domain.id
    JOIN (
        SELECT DISTINCT
            revenue_1.domain_id AS domain_id
        FROM
            revenue AS revenue_1
            JOIN revenue AS revenue_2 ON revenue_2.total_revenue >= revenue_1.total_revenue
        WHERE
            revenue_1.total_revenue > ?
            AND revenue_2.total_revenue > ?
            AND datediff(CURRENT_TIMESTAMP, revenue_1.date) <= ?
        GROUP BY revenue_1.domain_id
        HAVING
            sum(revenue_2.total_revenue) <= (
                SELECT
                    ? * sum(revenue.total_revenue) AS anon_2
                FROM revenue
                WHERE
                    datediff(CURRENT_TIMESTAMP, revenue_1.date) <= ?
            )
    ) AS anon_1 ON anon_1.domain_id = domain.id
    GROUP BY
        vendor.id, domain.id, domain.fqdn

This problem can be solved in more than one way and this not necessary the best way, but it’s complex enough to serve as an example for the syntax.

Resources

Sqlalchemy docs