David Xie

Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation

2025-10-17T00:00:00+00:00

We are excited to announce that our paper Imaginarium has been accepted by ACM Transactions on Graphics (TOG)!

Overview

Generating artistic and coherent 3D scene layouts is crucial in digital content creation. Traditional optimization-based methods are often constrained by cumbersome manual rules, while deep generative models face challenges in producing content with richness and diversity. Furthermore, approaches that utilize large language models frequently lack robustness and fail to accurately capture complex spatial relationships.

Our Approach

In this paper, we present a novel vision-guided 3D layout generation system:

High-quality Asset Library: We first construct a high-quality asset library containing 2,037 scene assets and 147 3D scene layouts.
Image Generation Fine-tuning: We employ an image generation model to expand prompt representations into images, fine-tuning it to align with our asset library.
Robust Image Parsing: We develop a robust image parsing module to recover the 3D layout of scenes based on visual semantics and geometric information.
Scene Graph Optimization: We optimize the scene layout using scene graphs and overall visual semantics to ensure logical coherence and alignment with the images.

Results

Extensive user testing demonstrates that our algorithm significantly outperforms existing methods in terms of layout richness and quality.

Paper: arXiv:2510.15564
Code: GitHub

GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models

2025-10-09T00:00:00+00:00

We are proud to release GTR-Bench, a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network.

Motivation

Spatial-temporal intelligence of Vision-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI, and General Artificial Intelligence. Existing benchmarks mainly focus on egocentric perspective reasoning or geographic perspective reasoning with graphics context, failing to assess VLMs’ geographic spatial-temporal intelligence with both images/video and graphics context.

Key Challenges

GTR-Bench is more challenging as it requires:

Multiple perspective switches between maps and videos
Joint reasoning across multiple videos with non-overlapping fields of view
Inference over spatial-temporal regions unobserved by any video context

Key Findings

Evaluations of more than 10 popular VLMs reveal three primary deficiencies:

VLMs’ reasoning is impaired by an imbalanced utilization of spatial-temporal context
VLMs are weak in temporal forecasting, performing worse on temporal-emphasized tasks
VLMs lack proficiency in comprehending or aligning map data with multi-view video inputs

Even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%).

Paper: arXiv:2510.07791
Code & Data: GitHub

DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding

2025-04-11T00:00:00+00:00

We introduce the Diverse Semantic Map (DSM) framework, a novel scene representation designed to enhance deep reasoning in 3D Visual Grounding tasks.

Problem

Existing methods for 3D Visual Grounding are often constrained — they either only focus on geometric and visual cues, or, like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning.

Our Framework

The DSM framework enriches robust geometric models with a spectrum of VLM-derived semantics, including:

Appearance attributes: color, patterns, texture
Physical attributes: weight, material, surface properties
Affordance attributes: functional aspects and operational methods

We construct the DSM online by fusing multi-view observations within a temporal sliding window, creating a persistent and comprehensive world model.

DSM-Grounding

Building on this foundation, we propose DSM-Grounding, a new paradigm that shifts grounding from free-form VLM queries to a structured reasoning process over the semantic-rich map, markedly improving accuracy and interpretability.

Results

ScanRefer benchmark: 59.06% overall accuracy (IoU@0.5), surpassing prior methods by 10%
Semantic segmentation: 67.93% F-mIoU, outperforming all baselines including privileged ones
Successfully deployed on physical robots for navigation and grasping tasks
Paper: arXiv:2504.08307
Project Page: binicey.github.io/DSM

Google Gemini updates: Flash 1.5, Gemma 2 and Project Astra

2024-05-14T00:00:00+00:00

Learn more:Learn more:Learn more:Learn more:Learn more:Learn more:May 14, 2024 We’re introducing a series of updates across the Gemini family of models, including the new 1.5 Flash, our lightweight model for speed and efficiency, and Project Astra, our vision for the future of AI assistants. In December, we launched our first natively multimodal model Gemini 1.0 in three sizes: Ultra, Pro and Nano. Just a few months later we released 1.5 Pro, with enhanced performance and a breakthrough long context window of 1 million tokens.Developers and enterprise customers have been putting 1.5 Pro to use in incredible ways and finding its long context window, multimodal reasoning capabilities and impressive overall performance incredibly useful.We know from user feedback that some applications need lower latency and a lower cost to serve. This inspired us to keep innovating, so today, we’re introducing Gemini 1.5 Flash: a model that’s lighter-weight than 1.5 Pro, and designed to be fast and efficient to serve at scale.Both 1.5 Pro and 1.5 Flash are available in public preview with a 1 million token context window in Google AI Studio and Vertex AI. And now, 1.5 Pro is also available with a 2 million token context window via waitlist to developers using the API and to Google Cloud customers.We’re also introducing updates across the Gemini family of models, announcing our next generation of open models, Gemma 2, and sharing progress on the future of AI assistants, with Project Astra.Context lengths of leading foundation models compared with Gemini 1.5’s 2 million token capability1.5 Flash is the newest addition to the Gemini model family and the fastest Gemini model served in the API. It’s optimized for high-volume, high-frequency tasks at scale, is more cost-efficient to serve and features our breakthrough long context window.While it’s a lighter weight model than 1.5 Pro, it’s highly capable of multimodal reasoning across vast amounts of information and delivers impressive quality for its size.The new Gemini 1.5 Flash model is optimized for speed and efficiency, is highly capable of multimodal reasoning and features our breakthrough long context window.1.5 Flash excels at summarization, chat applications, image and video captioning, data extraction from long documents and tables, and more. This is because it’s been trained by 1.5 Pro through a process called “distillation,” where the most essential knowledge and skills from a larger model are transferred to a smaller, more efficient model.Read more about 1.5 Flash in our updated Gemini 1.5 technical report, on the Gemini technology page, and learn about 1.5 Flash’s availability and pricing.Over the last few months, we’ve significantly improved 1.5 Pro, our best model for general performance across a wide range of tasks.Beyond extending its context window to 2 million tokens, we’ve enhanced its code generation, logical reasoning and planning, multi-turn conversation, and audio and image understanding through data and algorithmic advances. We see strong improvements on public and internal benchmarks for each of these tasks.1.5 Pro can now follow increasingly complex and nuanced instructions, including ones that specify product-level behavior involving role, format and style. We’ve improved control over the model’s responses for specific use cases, like crafting the persona and response style of a chat agent or automating workflows through multiple function calls. And we’ve enabled users to steer model behavior by setting system instructions.We added audio understanding in the Gemini API and Google AI Studio, so 1.5 Pro can now reason across image and audio for videos uploaded in Google AI Studio. And we’re now integrating 1.5 Pro into Google products, including Gemini Advanced and in Workspace apps.Read more about 1.5 Pro in our updated Gemini 1.5 technical report and on the Gemini technology page.Gemini Nano is expanding beyond text-only inputs to include images as well. Starting with Pixel, applications using Gemini Nano with Multimodality will be able to understand the world the way people do — not just through text, but also through sight, sound and spoken language.Read more about Gemini 1.0 Nano on Android.Today, we’re also sharing a series of updates to Gemma, our family of open models built from the same research and technology used to create the Gemini models.We’re announcing Gemma 2, our next generation of open models for responsible AI innovation. Gemma 2 has a new architecture designed for breakthrough performance and efficiency, and will be available in new sizes.The Gemma family is also expanding with PaliGemma, our first vision-language model inspired by PaLI-3. And we’ve upgraded our Responsible Generative AI Toolkit with LLM Comparator for evaluating the quality of model responses.Read more on the Developer blog.As part of Google DeepMind’s mission to build AI responsibly to benefit humanity, we’ve always wanted to develop universal AI agents that can be helpful in everyday life. That’s why today, we’re sharing our progress in building the future of AI assistants with Project Astra (advanced seeing and talking responsive agent).To be truly useful, an agent needs to understand and respond to the complex and dynamic world just like people do — and take in and remember what it sees and hears to understand context and take action. It also needs to be proactive, teachable and personal, so users can talk to it naturally and without lag or delay.While we’ve made incredible progress developing AI systems that can understand multimodal information, getting response time down to something conversational is a difficult engineering challenge. Over the past few years, we’ve been working to improve how our models perceive, reason and converse to make the pace and quality of interaction feel more natural.Building on Gemini, we’ve developed prototype agents that can process information faster by continuously encoding video frames, combining the video and speech input into a timeline of events, and caching this information for efficient recall.By leveraging our leading speech models, we also enhanced how they sound, giving the agents a wider range of intonations. These agents can better understand the context they’re being used in, and respond quickly, in conversation.With technology like this, it’s easy to envision a future where people could have an expert AI assistant by their side, through a phone or glasses. And some of these capabilities are coming to Google products, like the Gemini app and web experience, later this year.We’ve made incredible progress so far with our family of Gemini models, and we’re always striving to advance the state-of-the-art even further. By investing in a relentless production line of innovation, we’re able to explore new ideas at the frontier, while also unlocking the possibility of new and exciting Gemini use cases.Learn more about Gemini and its capabilities. Your information will be used in accordance with Google’s privacy policy.

      Done. Just one step more.
    
      Check your inbox to confirm your subscription.
    You are already subscribed to our newsletter.
    You can also subscribe with a
    different email address
    
    .
    
  Let’s stay in touch. Get the latest news from Google in your inbox.
          Follow Us

Displaying External Posts on Your al-folio Blog

2022-04-23T23:20:09+00:00

External Posts on Your al-folio Blog

If you prefer publishing blog posts on medium.com or other external sources, starting version v0.5.0, al-folio lets you to display your external posts in the blog feed of your website! 🎉🎉

Configuring external sources of super simple. After upgrading to v0.5.0, just add the following section to your _config.yml:

external_sources:
  - name: medium.com  # name of the source (arbitrary string)
    rss_url: https://medium.com/@/feed

The example above adds your medium.com blog post feed as an external source. But you can add arbitrary RSS feeds as sources.

Any questions or suggestions? 👉 Start a discussion on GitHub!