I'm actually really excited for this!
I noticed recently there weren't any good open source hardware projects for voice assistants with a focus on privacy. There's another project I've been thinking about where I think the privacy aspect is Important, and figuring out a good hardware stack has been a Process. The project I want to work on isn't exactly a voice assistant, but same ultimate hardware requirements
Something I'm kinda curious about: it sounds like they're planning on a sorta batch manufacturing by resellers type of model. Which I guess is pretty standard for hardware sales. But why not do a sorta "group buy" approach? I guess there's nothing stopping it from happening in conjunction
I've had an idea floating around for a site that enables group buys for open source hardware (or 3d printed items), that also acts like or integrates with github wrt forking/remixing
We need more projects like home assistant. I started using it recently and was amazed. They sell their own hardware but the whole setup is designed to works on any other hardware. There are detailed docs for installation on your own hardware. And, it works amazingly well.
Same for their voice assistant. You can but their hardware and get started right away or you can place your own mics and speakers around home and it will still work. You can but your own beefy hardware and run your own LLM.
The possibilities with home assistant are endless. Thanks to this community for breaking the barriers created by big tech
> We need more projects like home assistant
Isn't openHAB an existing popular alternative?
HA long ago blew past OpenHAB in functionality and community.
Unless you have a hard-on for JVM services, HA is the better XP these days.
When I was evaluating both projects about 5 years ago, I went with openHAB because they had native apps with native controls (and thus nicer design imo). At the time, HA was still deep in YML config files and needed validation before saving etc etc. Not great UX.
Nowadays, HA has more of the features I would want and other external projects exist to create your own dashboards that take advantage of native controls.
Today I’m using Homey because I’m still a sucker for design and UX after a long day of coding boring admin panels in the day job, but I think in another few years when the hardware starts to show its age that I will move to home assistant. Hell, there exists an integration to bring HA devices into Homey but that would require running two hubs and potentially duplicating functionality. We shall see.
I keep it simple, I use the HomeKit bridge integration to expose the Home Assistant devices that I want in iOS. I don’t expose everything, though, some of the more advanced or obscure devices I purposely keep hidden in Home Assistant. It strikes a nice balance in my opinion.
i’m assuming you can do something similar with Google home, etc.
but like you said, you could always build your own dashboard from scratch if you wanted to.
> HA long ago blew past OpenHAB in [...] community.
Home Assistant seems insurmountable to beat at that specific metric, seems to be the single biggest project in terms of contributions from a wide community. Makes sense, Home Assistant tries to do a lot of things, and succeeds at many of them.
I think they meant "projects with a culture and mindset like homeassistant", not just a competitor to the existing project.
It’s a great project overall, but I’ve been frustrated by how anti-engineer it has been trending.
Install the Node-RED add on. I use that to do the tricky stuff.
Install the whole thing on top of stock Debian "supervised" then you get a full OS to use.
You get a fully integrated MQTT broker with full provisioning - you don't need a webby API - you have an IoT one instead!
This is a madly fast moving project with a lot of different audiences. You still have loads of choice all tied up in the web interface.
+1 on installing supervised on stock debian. It feels like any other software and I still get to keep full control of my system.
I’m currently running, HA, Frigate and pihole on same machine
Or the Digital alchemy addon. Let's you write your automations using typescript
Do you mean the move away from YAML first configs?
I was originally somewhat frustrated, but overall, it's much better (let's be honest, YAML sucks) and more user friendly (by that I mean having a form with pre-filled fields is easier than having to copy paste YAML).
Yes, config is a major part of it. But also a lack of good APIs, very poor dev documentation, not great logging. A general “take it or leave it” attitude, not interesting in enabling engineers to build.
I don’t think that’s true. Their docs are great and the community is active and responsive in forums and github
It's worse though when you need to add a ton of custom sensors at once, e.g., for properly automating a Solar PV + Battery solution.
Yes - for now. I think the ultimate end-goal is to get rid of the YAML config files, which, makes sense for the median user, but not for power users.
For example, I have my config on GitHub and share various YAML blueprints with a friend who also has the same Solar+Battery system as I do.
For "integrated" stuff, their stance is "UI Must Work". Tracing down the requirements, here:
https://design.home-assistant.io/#concepts/home
https://developers.home-assistant.io/docs/configuration_yaml_index
https://github.com/home-assistant/architecture/blob/master/adr/0010-integration-configuration.md
...usually there's YAML kicking around the backend, but for normal usage, normal users, the goal is to be able to configure all (most) things via UI.I've had to drop to YAML to configure (eg) writing stats to indexdb/graphana vs. sqlite (or something), or maybe to drop in or update an API_KEY or non-standard host/port, but 99% of the time the config is baroque, but usable via the web-app.
> But like, isn't YAML still available for configuring things?
For most, yes. But for some included integrations it's UI-only (all of those I've had to migrate, it's been a single click + comment out lines, and the config has been a breeze (stuff like just an api key/IP address + 1-2 optional params).
Oh thank got. Just started using HA few months ago and all these yaml is so confusing when I try to code it with ChatGPT , constant syntax or some other random errors.
> when I try to code it with ChatGPT
so don't do that... just rtfm and it's easy
How so?
Im a different user- but I can say I’ve been frustrated with their refusal to support OIDC/oauth/literally any standard login system. There is a very long thread on their forums documenting the many attempts for people to contribute this feature.[0] The devs simply shut it down every time, with little to no explanation.
I run many self hosted applications on my local network. Homeassistant is the only one I’m running that has its own dedicated login. Everything else I’m using has OIDC support, or I can at least unobtrusively stick a reverse proxy in front to require OIDC login.
[0] https://community.home-assistant.io/t/open-letter-for-improv...
Edit: things like this [1] don’t help either. Where one of the HA devs threatens to relicense a dependency so that NixOS can’t use it, because… he doesn’t want them to? The license permits them to. Seemed very against the spirit of open source to me.
I am working on automation of phones (open source) - https://github.com/BandarLabs/clickclickclick
I haven't been able to quite get the Llama vision models working but I suppose with new releases in future, it should work as good as Gemini in finding bounding boxes of UI elements.
Completely agree! Home Assistant feels like a breath of fresh air in a space dominated by big tech's walled gardens.
If it's possible for the hardware to facilitate a use case, the employees working on the product will try to push the limits as far as they possibly can in order to manufacture interesting and challenging problems that will get them higher performance ratings and promotions. They will rationalize away privacy violations by appealing to their "good intentions" and their amazing ability to protect information from nefarious actors. In their minds they are working for "the good guys" who will surely "do the right thing."
At various times in the past, the teams involved in such projects have at least prototyped extremely invasive features with those in-home devices. For example, one engineer I've visited with from a well-known in-home device manufacturer worked on classifiers that could distinguish between two people having sex and one person attacking another in audio captured passively by the microphones.
As the corporate culture and leadership shifts over time I have marginal confidence that these prototypes will perpetually remain undeveloped or on-device only. Apple, for instance, has decided to send a significant amount of personal data to their "Private Cloud" and is taking the tactic of opening "enough" if its infrastructure for third-party audit to make an argument that the data they collect will only be used in a way that the user is aware and approves of. Maybe Apple can get something like that to a good enough state, at least for a time. However, they're inevitably normalizing the practice. I wonder how many competitors will be as equally disciplined in their implementations.
So my takeaway is this: If there exists a pathway between a microphone and the Internet that you are not in 100% control over, it's not at all unreasonable to expect that anything and everything that microphone picks up at any time will be captured and stored by someone else. What happens with that audio will -- in general -- be kept out of your knowledge and control so long as there is insufficient regulatory oversight.
Open source
Yeah, OP is comparing this to Google/Amazon/Apple/etc devices but this is being developed by the nonprofit that manages development on Home Assistant and in cooperation with their large community of users. It's a very different attitude driving development of voice remotes for Home Assistant vs. large corporations. They've been around for a while now and have a proven track record of being actual, serious advocates for data privacy and user autonomy. Maybe they won't be forever, but then this thing is open source.
The whole point is that you control what these things do, and that you can run these things fully locally if you want with no internet access, and run your own custom software on them if that's what you want to do. This is a product for the Home Assistant community that will probably never turn much of a profit, nor do I expect it is intended to.
It's too bad it's sold out everywhere. I've tried the ESP32 projects (little cube guy) for voice assistants in HA but it's mic/speaker weren't good enough. When it did hear me (and I heard it) it did an amazing job. For the first time I talked to a voice assistant that understood "Turn off office lights" to mean "Turn off all the lights in the office" without me giving it any special grouping (like I have to do in Alexa and then it randomly breaks). It handled a ton of requests that are easy for any human but Alexa/Siri trip up on.
I cannot wait to buy 5 or more of these to replace Alexa. HA is the brain of my house and up till now Alexa provided the best hardware to interact with HA (IMHO) but I'd love something first-party.
How did you find it for music tasks?
I didn’t test that. I normally just manually play through my Sonos speaker groups on my phone. I don’t like the sound from the Echos so I’m not in the habit of asking them to do anything related to music.
Right now I only use Alexa for smart house control and setting timers
I'm definitely buying one for robotics, having a dedicated unit for both STT and TTS that actually works and integrates well would make a lot of social robots more usable and far easier to set up and maintain. Hopefully there's a ROS driver for it eventually too.
That's a pretty timely release considering Alexa and the Google assistant devices seem to have plateaued or are on the decline.
Curious what you mean by that.
For me the Alexa devices I own have gotten worse. Can't do simple things (setting a timer used to be instant, now it takes 10-15 seconds of thinking assuming it heard properly), playing music is a joke (will try to play through Deezer even though I disaled that integration months ago, and then will default to Amazon Music instead of Spotify which is set as the default).
And then even simple skills can't understand what I'm asking 60% of the time. The first maybe 2 years after launch it seemed like everything worked pretty good but since then it's been a frustrating decline.
Currently they are relagated to timers and music, and it can't even manage those half the time anymore.
It is, I think, a common feeling among Echo/Alexa users. Now that people are getting used to the amazing understanding capabilities of ChatGPT and the likes, it probably increases the frustration level because you get a hint of how good it could be.
I believe it boils down to two main issues:
- The narrow AI systems used for intent inference have not scaled with the product features.
- Amazon is stuck and can't significantly improve it using general AI due to costs.
The first point is that the speech-to-intent algorithms currently in production are quite basic, likely based on the state of the art from 2013. Initially, there were few features available, so the device was fairly effective at inferring what you wanted from a limited set of possibilities. Over time, Amazon introduced more and more features to choose from, but the devices didn't get any smarter. As a result, mismatches between actual intent and inferred intent became more common, giving the impression that the device is getting dumber. In truth, it’s probably getting somewhat smarter, but not enough to compensate for the increasing complexity over time.
The second point is that, clearly, it would be relatively straightforward to create a much smarter Alexa: simply delegate the intent detection to an LLM. However, Amazon can’t do that. By 2019, there were already over 100 million Alexa devices in circulation, and it’s reasonable to assume that number has at least doubled by now. These devices are likely sold at a low margin, and the service is free. If you start requiring GPUs to process millions of daily requests, you would need an enormous, costly infrastructure, which is probably impossible to justify financially—and perhaps even infeasible given the sheer scale of the product.
My prediction is that Amazon cannot save the product, and it will die a slow death. It will probably keep working for years but will likely be relegated by most users to a "dumb" device capable of little more than setting alarms, timers, and providing weather reports.
If you want Jarvis-like intelligence to control your home automation system, the vision of a local assistant using local AI on an efficient GPU, as presented by HA, is the one with the most chance of succeeding. Beyond the privacy benefits of processing everything locally, the primary reason this approach may become common is that it scales linearly with the installation.
If you had a cloud-based solution using Echo-like devices, the problem is that you’d need to scale your cloud infrastructure as you sell more devices. If the service is good, this could become a major challenge. In contrast, if you sell an expensive box with an integrated GPU that does everything locally, you deploy the infrastructure as you sell the product. This eliminates scaling issues and the risks of growing too fast.
I didn't downvote it, but claiming that Echo/Alexa are behind because of financial reasons is misguided at best.
Amazon is one of the richest companies on the planet, with vast datacenters that power large parts of the internet. If they wanted to improve their AI products they certainly have the resources to do so.
You are right, but that's not my point. The point is that it's difficult to scale in the cloud products that requires lots of AI workloads.
Here, home assistant is telling you: you can use your own infra (most people won't) or you can use our cloud.
It works because most likely the user base will be rather small and home assistant can get cloud resources as if it was infinite on that scale.
If their product was amazing, and suddenly millions of people wanted to buy the cloud version, they would have a big problem: cloud infrastructure is never infinite at scale. They would be limited by how much compute their cloud provider is able/willing to sell them, rather than how much of that small boxes they could sell, possibly loosing the opportunity to corner the market with a great product.
If you package everything, you don't have that problem (you only have the one to be able to make the product, which I agree is also not small). But in term of energy efficiency, it also does not have to be that bad: the apple silicon line has shown that you can have very efficient hardware with significant AI capabilities, if you design a SOC for that purpose, it can be energy efficient.
Maybe I'm wrong that the approach will get common, but the fact that scaling AI services to millions of users is hard stand.
This is very well thought out but I think your premise is a bit wrong. I have about a dozen Echos of various generations in my house. The oldest one is the very original from the preview stage. They still do everything I want them to and my entire family still uses them daily with zero frustration.
Local GPU doesn’t make sense for some of the same reasons you list. First, hardware requirements are changing rapidly. Why would I spend say $500 on a local GPU setup when in two years the LLM running on it will slow to a crawl due to limited resources? Probably would make more sense to rent a GPU on the cloud and upgrade as new generations come out.
Amazon has the opposite situation: their hardware and infra is upgraded en masse so different economies. Also while your GPU is idling at 20-30W while you aren’t home they can have 100% utilization of their resources because their GPUs are not limited to one customer at a time. Plus they can always offload the processing by contracting OpenAI or similar. Google is in an even better position to do this. Running a local LLM today doesn’t make a lot of sense, but it probably will at some point in like 10 years. I base this on the fact that the requirements for a device like a voice assistant are limited so at some point the hardware and software will catch up. We saw this with smartphones: you can now go 5 years without upgrading and things still work fine. But that wasn’t the case 10 years ago.
Second, Amazon definitely goofed. They thought people would use the Echos for shopping. They didn’t. Literally the only uses for them are alarms and timers, controlling lights and other smart home devices, and answering trivia questions. That’s it. What other requirements do you have that don’t fall in this category? And the Echos do this stuff incredibly well. They can do complex variations too, including turning off the lights after a timer goes off, scheduling lights, etc. Amazon is basically giving these devices away but the way to pivot this is to release a line of smart devices that connect to the Echos: smart bulbs and switches, smart locks, etc. They do have TVs which you can control with an Echo fairly well (and it is getting better). An ecosystem of smart devices that seamlessly interoperate will dwarf what HA has to offer (and I say this as someone who is firmly on HA’s side). And this is Amazon’s core competency: consumer devices and sales.
If your requirement is that you want Jarvis, it’s not the voice device part of it that you want. You want what it is connected to: a self driving car you can summon, DoorDash you can order by saying “I want a pizza”, a phone line so it can call your insurance company and dispute a claim on your behalf.
Now the last piece here is privacy and it’s a doozy. The only way to solve this for Amazon is to figure out some form of encrypted computation that allows for your voice prompts to be processed without them ever hearing clear voice versions. Mathematically possible, practically not so much. But clearly consumers don’t give a fuck whatsoever about it. They trust Amazon. That’s why there are hundreds of millions of these devices. So in effect while people on HN think they are the target market for these devices, they are clearly the opposite. We aren’t the thought leaders, we are the Luddites. And again I say this as someone who wishes there was a way to avoid the privacy issue, to have more control over my own tech, etc. I run an extensive HA setup but use Echos for the voice control because at least for now they are be best value. I am excited about TFA because it means there might be a better choice soon. But even here a $59 device is going to have a hard time competing with one that routinely go on sale for $19.
That’s interesting because I have a bunch of Echos of various types in my house and my timers and answers are instant. Is it possible your internet connection is wonky or you have a slow DNS server or congested Wi-Fi? I don’t have the absolute newest devices but the one in my bedroom is the very original Echo that I got during their preview stage, the one in my kitchen is the Echo Show 7” and I have a bunch of puck ones and spherical ones (don’t remember the generations) around the house. One did die at one point after years of use and got replaced but it was in my kids room so I suspect it was subject to some abuse.
I too get pretty consistent response and answers from Alexa these days. There has been some vague decline in the quality of answers (I think sometime back they removed the ability to ask for Wikipedia data), but have no trouble with timers and the few linked wemo switches I have.
I’m also the author of an Alexa skill for a music player (basic “transport” control mostly) that i use every day, and it still works the same as it always did.
Occasionally I’ll get some freakout answer or abject failure to reply, but it’s fairly rare. I did notice it was down for a whole weekend once; that’s surely related to staffing or priorities.
Amazon also fired a large number of people from the Alexa team last year. I don't really think Alexa is a major priority for Amazon at this point.
I don't blame them, sure there are millions of devices out there, but some people might own five device. So there aren't as many users as there are devices and they aren't making them any money once bought, not like the Kindle.
Frankly I know shockingly few people who uses Siri/Alexa/Google Assistant/Bixby. It's not that voice assistants don't have a use, be it is a much much small use case than initially envisioned and there's no longer the money to found the development, the funds went into blockchain and LLMs. Partly the decline is because it's not as natural an interface as we expected, secondly: to be actually useful, the assistants need access to control things that we may not be comfortable with, or which may pose a liability to the manufacturers.
That aligns with some of the frustration I’ve heard from others. It’s surprising (and disappointing) how these platforms, which seemed to have so much potential early on, have started to feel more like a liability
I was an early adopter of google home, have had several generations (including the latest). I quite like the devices, but the voice recognition seems to be getting worse not better. And the Pandora integration crashes frequently.
In addition, it's a moron. I'm not sure it's actually gotten dumber, but in the age of chatgpt, asking google assistant for information is worse than asking my 2nd grader. Maybe it will be able to quote part of a relevant web page, but half the time it screws that up. I just want it to convert my voice to text, submit it to chatgpt or claude, and read the response back to me.
All that said, the audio quality is good and it shows pictures of my kid when idle. If they suddenly disappeared I would replace them.
GH is basically abandonware at this stage it seems. They just seem to break random things, and there haven’t been any major updates / features for ages (and Gemini is still a way off for most).
Google Home's Nest integration is recent and top-notch though.
Hopefully in a year they'll have rolled out the Gemini integration and things will be back on track.
Google and Amazon refuse to put GenAI into their existing speakers (which barely function). No doubt they want a new product launch to charge more.
On the Google side it's become basically useless for anything beyond interacting with local devices and setting timers and reminders (in other words, the things that FOSS should be able to do very easily). Its only edge over other options used to be answering questions quickly without having to pull out a screen, but now it refuses to answer anything (likely because Google Search has removed their old quick answers in favor of Gemini answers).
Had to laugh a bit at the caveat about powerful hardware. Was bracing myself for GPU and then it says N100 lol
I mean, comparatively many people are hosting their home Assistant on an raspberry Pi so it is relatively powerful :D
And the CM5 is nearly equivalent in terms of the small models you run. Latency is nearly the same, though you can get a little more fancy if you have an N100 system with more RAM, and "unlocked" thermals (many N100 systems cap the power draw because they don't have the thermal capacity to run the chip at max turbo).
If we're being fair you can more like, walk models, not run them :)
An 125H box may be three times the price of an N100 box, but the power draw is about the same (6W idle, 28W max, with turbo off anyway) and with the Arc iGPU the prompt processing is in the hundreds, so near instant replies to longer queries are doable.
I don't fully understand the cloud upsell. I have a beefy GPU. I would like to run the "more advanced" models locally.
By "I don't fully understand," I mean just that. There's a lot of marketing copy, but there's a lot I'd like to understand better before plopping down $$$ for a unit. The answers might be reasonable.
Ideally, I'd be able to experiment with a headset first, and if it works well, upgrade to the $59 unit.
I'd love to just have a README, with a getting started tutorial, play, and then upgrade if it does what I want.
Again: None of this is a complaint. I assume much of this is coming once we're past preview addition, or is perhaps there and my search skills are failing me.
You can do exactly that - set up an Assist pipeline that glues together services running wherever you want, including a GPU node for faster-whisper. The HA interface even has a screen where you can test your pipeline with your computer’s microphone.
It’s not exactly batteries-included, and doesn’t exercise the on-device wake word detection that satellite hardware would provide, but it’s doable.
But I don’t know that the unit will be an “upgrade” over most headsets. These devices are designed to be cheap, low-power, and have to function in tougher scenarios than speaking directly into a boom mic.
It's an upgrade mostly because putting on a headset to talk to an assistant means it's not worth using the assistant.
Does it use Node-RED for the pipeline?
No, all of the voice parts are either inbuilt or direct addons.
Finding microphones that look nice, can pick up voice at high enough quality to extract commands and that cover an entire room is surprisingly hard.
If this device delivers on audio quality it's totally worth it at $59.
I've found it quite hard to find decent hardware with both the input capability needed for wakeword and audio capture at a distance, whilst also having decent speaker quality for music playback.
I started using the Box-3 with heywillow which did amazing input and processing using ML on my GPU, but the speaker is aweful. I build a speaker of my own using a raspberry pi Z2W, dac and some speakers in a 3d printed enclosure I designed, and added a shim to the server so that responses came from my speaker rather than the cheap/tiny speaker in the box-3. I'll likely do the same now with the Voice PE, but I'm hoping that the grove connector can be used to plonk it on top of a higher quality speaker unit and make it into a proper music player too.
As soon as I have it in my hands, I intend to get straight to work looking at a way to modify my speaker design to become an addon "module" for the PE.
In many cases the issue isn't the microphone but the horrid amount of reflections that the sound produces before reaching it. A quite good microphone can be built using cheap, yet very clean, capsules like the AOM-5024L-HD-F-R (80 dB s/n) which is ~$3 at Mouser, but room acoustics is a lot more important and also a real pain in the ass when also not a bank account drain if done professionally, although usually carpets, wood furniture, curtains to cover glass and sound panels on concrete walls can be more than enough.
100%. For a lot of users that have WAF and time available to contend with, this is a steal.
Bear in mind that a $50 google home or Alexa mini(?) is always going to be whatever google deem it to be. This is an open device which can be whatever you want it to be. That’s a lot of value in my eyes.
This device is just the mic/speaker/wakeword part. It connects to home-assistant to do the decoding and automation. You can test it right now by downloading home-assistant and running it on a pi or a VM. You can run all the voice assist stuff locally if you want. There are services for the voice to text, text to voice and what they call intents which are simple things like "turn off the lights in the office". The cloud offering from Nuba Casa, not only funds the development of Home Assistant but also give remote access if you want it. As part of that you can choses to offload some of the voice/text services to their cloud so that if you are just running it on a Pi it will still be fast.
I can't speak to home assistant specifically, but the last time I looked at voice models, supporting multiple languages and doing it Really Well just happens to require a model with a massive amount of RAM, especially to run at anything resembling real-time.
It's be awesome if they open sourced that model though, or published what models they're using. But I think it unlikely to happen because home assistant is a sorta funnel to nabu casa
That said, from what I can find, it sounds like Assist can be run without the hardware, either with or without the cloud upgrade. So you could definitely use your own hardware, headset, speakers, etc. to play with Assist
shrug whisper seems to do well on my GPU, and faster than realtime.
Found what I was thinking of [1]
Part of my misremembering is I was thinking of smaller/iot usecase which, alongside the 10GB VRAM requirements for the large multilingual model, felt infeasible -shrug-
[1] https://git.acelerex.com/automation/opcua.ts/-/project_membe...
I've been using it to generate subtitles for home movies, for an aging family member who is losing their hearing, and it's phenomenal
The cloud sale is easy if you are an HA user already. If you don’t use Home Assistant right now, you probably rec it the target audience. I purchase the yearly cloud service as it’s an easy way to support HA development. It also gives you remote access to your system without having to do any setup. It provides an https connection which allows you to program esp32 devices through Chrome. And now they added the ability to do TTS and STT on someone else’s hardware. HA even allows you to setup a local llm for house control commands but route other queries directly to the cloud.
I don't mind paying for hardware. I do mind my privacy, and don't want that kind of information in the cloud, or even traces from encryption I haven't audited myself.
I wonder how this compares to the Respeaker 2 https://wiki.seeedstudio.com/ReSpeaker_Mic_Array_v2.0/
The respeaker has 4 mics and can easily cancel out the noise introduced by a custom external speaker
It's worth noting that product is listed in the "Discontinued Products" section of the linked wiki.
Both of the ReSpeaker products in the non-discontinued section (ReSpeaker Lite, ReSpeaker 2-Mics Pi HAT) have only 2 mics, so it appears that things are converging in that direction.
The S3-Box-3 also only has two mics, and I found I can talk to that from another room of the house and it detects what I said perfectly fine.
I don't just want the hardware, I want the software too. I want something that will do STT on my speech, send the text to an API endpoint I control, and be able to either speak the text I give it, or live stream an audio response to the speakers.
That's the part I can't do on my own, and then I'll take care of the LLMs myself.
All of these components are available separately or as add-ons for Home Assistant.
I currently do STT with heywillow[0] and an S3-Box-3 which uses an LLM running on a server I have to do incredibly fast, incredibly accurate STT. It uses Coqui XTTS for TTS, with very high quality LLM based voice; you can also clone a voice by supplying it with a few seconds of audio (I tested cloning my own with frightening results).
Playback to a decent speaker can be done in a bunch of ways; I wrote a shim that captures the TTS request to Coqui and forwards it to a Pi based speaker I built, running MPD which then requests the audio from the STT server (Coqui) and plays it back on my higher quality speaker than the crappy ones built in to the voice-input devices.
If you just want to use what's available HA, there's all of the Wyoming stuff, openWakeword (not necessary if you're using this new Voice PE because it does on-device wakeword), Piper for TTS, or MaryTTS (or others) and Whisper (faster-whisper) for STT, or hook in something else you want to use. You can additionally use the Ollama integration to hook it into an Ollama model running on higher end hardware for proper LLM based reasoning.
[0]heywillow.io
I do the same, Willow has been unmaintained for close to a year, and calling it "incredibly fast" and "incredibly accurate" tells me that we have very different experiences.
It's a shame it's been getting no updates, I noticed that, but their secret sauce is all open stuff anyway so just replace them with the upstream components; their box-3 firmware and the application server is really the bit they built (as well as the "correction" service).
If it wasn't fast or accurate for you, what were you running it on? I'm using the large model on a Tesla GPU in a Ryzen 9 server, using the XTTS-2 (Coqui) branch.
The thing about ML based STT/TTS and the reasoning/processing is that you get better performance the more hardware you throw at it; I'm using nearly £4k worth of hardware to do it; is it worth it? No, is it reasonable? Also no, but I already had the hardware and it's doing other things.
I'll switch over to Assist and run Ollama instead now there's some better hardware with on-device wake-word from Nabu.
My wife and I have been very happy with Home Assistant so far. The one thing we're missing is voice control, and until now it seemed like there just wasn't a clean solution for HA voice control. You were stuck doing some hobbyist shenanigans and hand-writing boatloads of YAML, or you were hooking up a HomeKit/Alexa which defeats the purpose of HA. This is a game-changer.
They recommend an N100 in the blog post, but I might buy one anyway to see if my HA box's Celeron J3455 will do the job.
I had great trouble simply connecting Bluetooth speaker to use it as voice input and for sound output. The overall state of sound subsystem for diy voice assistant feels third-class at best.
As someone not that familiar with haas, can someone explain why there's not a clear path to replace Alexa or Google home? I considered using haas recently to get a gpt like response after being frustrated with Google home, but it seems this is a complete mess. is there a way to get this yet?
> explain why there's not a clear path to replace Alexa or Google home?
There is. I've used HA with their default assist pipeline (Cloud HA STT, Cloud HA LLM, Cloud HA TTS) and I've also plugged in different providers at each step (both remote and local for each part: STT/LLM/TTS) and it's super cool. Their default LLM isn't great but it works, plugging in OpenAI made it work way better. My local models weren't great in speed but I don't have hardware dedicated for this purpose (currently), seeing an entire local pipeline was amazing for the promise of it in the future. It's too slow (on my hardware) but we are so close to local models (SST/TTS could be improved as well but they are much easier to do already locally).
If this new HA hardware comes even close to performing as well as the Echo's in my house (low bar) I'll replace them all.
What does it use LLMs for?
Taking the text of what you said and figuring out what you want to do. It sends what you said plus a list of devices/states and a list of functions (to turn off/on, set temp, etc of devices). The LLM takes "Turn off basement lights" and turns that into "{function: "call_service", args: ['lights.on', 'entity-id-123']}" (<- Completely made up but it's something like that) that it passes back to HA along with what to say back to the user ("Lights turned off" or whatever) and HA will run the function and then do TTS to respond to you.
Looks great! The biggest issue I see is music. 90% of my use is "play some music" but none of the major streaming music providers offer APIs for obvious reasons. I'm not sure how you can get around that really.
To do this in Home Assistant, you'd probably want to run Music Assistant and integrate it in. Looks like they manage to support some streaming providers, not entirely sure how: https://music-assistant.io/music-providers/
Getting it to play the right thing from voice commands is a bit of a rabbit hole: https://music-assistant.io/integration/voice/
Are there any MacOS software versions of this? I've been looking for opensource wake-work for a "Hey Siri"-like integration, but I'm very apprehensive of anything, malicious or not, monitoring the sound input for a specific word in an efficient way.
OpenWakeWord has worked well for me especially using well-trained models like “Hey, Mycroft”.
Nice. A totally local voice assistant.
This makes sense for cars, where there's much local stuff to control. But for a home unit, what do you want to do that is entirely local? Turning the heat up and down gets boring after a while. If it does entertainment selection or shopping, it needs outside world connections.
(Today's rant: I recently purchased a humidifier. It's just a little unit with a water tank, a water-softening filter, and an ultrasonic vaporizer. That part works fine. Then there are the controls.
All this thing really needs is an on-off switch and a humidity knob, and maybe lights for power, humidification, and water tank empty. But no. It has five touch buttons and a round display about four inches across. The display is on even if the unit is off. Pressing the on/off button turns it on. If it's humidifying, there's a whole light show. The tank lights up purple. Swooping arcs of blue run up both edges of the round display. It's very impressive, especially in a dark bedroom. If you press and hold the second button for two seconds, about half the light show is suppressed.
There are three fan speeds, and a button for that. Only the highest one will propel the water vapor high enough to avoid it hitting the floor and uselessly condensing before it mixes with the air. So that feature was not necessary.
The display shows one number. It's usually the current humidity, but if you press the humidity set button, the number displayed becomes the setting, which is changed upwards by successive presses until it wraps around. After a few seconds, the display reverts to current humidity.
Turning the unit off or removing the water tank resets all settings to the default.
This is the low-end unit. The next step up comes with an IR remote. It's one way - the remote has buttons but no display. Since you have to be close to the display to use the buttons effectively, that doesn't help much. The step up after that is, inevitably, a cloud-based phone app.
So this thing could potentially be interfaced to a voice assistant. That's only useful if there's enough information coming back from the device that the assistant software knows what the device is doing, and the assistant software understands that device status. If all it does is send remote button pushes, the result will be frustration.
So you need some degree of intelligence at both ends - the end that talks to the human, and the end that talks to the device. If the user says "House, it's too dry in here", the assistant system needs to be able to check the status of the humidifier. Has power? Talking? On? Humidity setting reasonable? Fan running? Tank not empty? If it can't do that, it's part of the problem, not part of the solution.)
> what do you want to do that is entirely local?
Keeping my daily life from becoming public? These companies can't be trusted with the information they have. Why should I give them more that they can leak?
Self hosted isn't always local only. I have a vpn server on my home router and control my home assistant worldwide. No corporation controls my access or data.
Is anyone aware of an effort to repurpose Echo hardware to do HA voice?
I've looked into this, and found nothing. One can surely repurpose the case and speakers, but the microphones are soldered on-board, and the board is not hackable and needs to go. To best of my awareness, there are no ways to load a custom firmware on a newer Echo device - they're locked down pretty tight.
My plea / request : Make a home assistant a DROP IN replacement for a standard light switch. It has power, its adds functionality from the get-go (smart lighting), it’s placed in a convenient position for the room and no extra wires etc required.
The now 8-year-old blog post titled "Perfect Home Automation"[1] on the HA website agrees with you from the first heading, and is borne out by my personal experience too. Nobody in your house should need to retrain to do things they are already doing.
1. https://www.home-assistant.io/blog/2016/01/19/perfect-home-a...
Would a zigbee or z wave switch fit your needs? It’s “offline” but does need a hub
Look at Shelly light switches.
You've misunderstood what they're asking for. They're asking for Home Assistant hardware (microphone, speaker, wifi) that, instead of being a standalone box taking up space on the counter/table/etc, fits into the hole in the hall where they currently have a lightswitch.
I guess I did misunderstand, because that request seems strange to me. I’m assuming they have more than one switch. Which one should have Home Assistant on it? Seems like an odd deployment strategy. A pi isn’t that big..
Without trying to digress, but why not make it modular too ? I.e. base model is a smart switch, one unit is the “base” unit and the rest talk to that. Possibly even add further switches, dials (thermostat or dimmer etc). Perfect placement in my opinion.
So my original comment was not a misunderstanding. They are smart switch drop in replacements.
? I have my house packed to the brim with tplink Wi-Fi smart switches, they work fine.
The switches I linked are esp32. They live inside the wall. They get great signal.
Yes - exactly this. If there are multiple needed, then some can be smarter/ more capable than others, but this removes the “just another box and cable(s)” issue.
Agreed.
They sell UL rated models, have an option for cloud connectivity but zero requirement, your switch still works if the Shelly loses connectivity with whatever home automation server you have, and it's a small box that you wire in behind the switch.
They also make drop in replacement dimmer switches. Even easier than the small box style. https://us.shelly.com/products/shelly-plus-wall-dimmer
If it runs fully on premise that would be great. Im still not comfortable buying a device that records everything I say and uploads it to a cloud
Fully on-prem can be done if you've got the LLM compute power in place.
My experience with home assistance voice pipeline is nothing works and stt is terrible. I'll have to wait and see the reviews.
Genuine question - How hackable is this? Can I have the voice commands redirected to my backend server where I can process it as I please?
This is Home Assistant. Everything is hackable.
Inside Home Assistant the processing is delegated to integrations providing Speech-to-Text, command processing, Text-to-Speech. You can make custom integrations for all of them
It's fully open-source. I think the default use-case is to have the voice commands processed locally
Probably as much as any other smart speaker without having to give your data away.
What I don't like is that most voice assistances perform really bad on my native language so I don't use them at all. For english speakers yes, but for all other not so much. I guess it will get better.
That is one of the major things that Home Assistant are trying to fix. They have groups working on most languages and are adding them to their open as they improve. https://www.home-assistant.io/voice_control/contribute-voice
Though a separate hardware helps - I believe voice and automation can be integrated more seamlessly to our existing devices (phones/laptops) with high compute built in.
Llama and whisper are already public so that should help innovation in this area.
You can use your phone to text or talk to HA's assistant. I've done that a number of times when Alexa fails. Having dedicated hardware is a huge step up for me. I've tried their ESP32 mini cube assistant thing before and it showed a lot of promise but the hardware (speaker and mic, processor was fine) was lacking. This seems to be a good mic and speaker wrapped around a similar core so I'm super excited for it.
The voice input can really be done however you like, the benefit of a device like the Voice PE is the wake word detection on-device.
I have an office-style desk-phone (SNOM) connected to a SIP server and I can pick the receiver up and talk to the Assistant, but you can plug in any way you like to get the audio to/from HA.
With your phone, wake words are usually locked down by Apple/Google so you can't really have it hands-free, and that's the problem this device is solving; not the audio input itself, but the wake-word/handfree input.
On an Android phone, you can replace the Google Assistant with the Home Assistant one, but you still have to activate it the usual way, press a button or launch the app etc.
With existing phones and laptops, there’s either activation friction (pressing the “listen to me” button) or the device has to be always listening, which requires a lot of trust in your hardware vendors.
With an open source and potentially local-only device, you can have your voice assistant and keep your privacy.
last i checked open source whisper does not support streaming or diarization out of the box. you really need both for a good voice assistant experience
Open as in 3d print files, rpi etc.? If so this is the project I am looking for!
I am very excited for this. One question I couldn’t find an answer for though is whether the hardware is open enough to be usable with other home automation systems. I am using OpenHAB and they too have an integrated voice assistant. I looked into migrating to HA a couple times but eventually gave up, primarily because it felt like such a waste of time to migrate a fully working environment with dozens of rules and scripts to yaml files.
It's all open and so should be able to work with OpenHAB as well but it would need somebody to either write a firmware that's compatibale with the OpenHAB endpoints or add ESPHome interegeation into OpenHAB. Somebody might have already done that for their voice stuff. There is not much yaml in home assistant now unless you want it. I'd give it a go in a VM and see what it finds on your network :)
Moving a fully functional setup with complex rules and scripts is a daunting task
A good emphasis in the summary, that certain other companies will only focus on monetization at the expense of features and functionality.
Here's what I'm looking for in a voice assistant:
- Full privacy: nothing goes to the "cloud"
- Non-shitty microphones and processing: i want to be able to be heard without having to yell, repeat, or correct
- No wake words: it should listen to everything, process it, and understand when it's being addressed. Since everything is private and local, this is now doable
- Conversational: it should understand when I finished talking, have ability to be interrupted, all with low latency
- Non-stupid: it's 2024, and alexa and siri and google are somehow absolutely abysmal at doing even the basics
- Complete: i don't want to use an app to get stuff configured. I want everything to be controlled via voice
> No wake words: it should listen to everything, process it, and understand when it's being addressed
Even humans struggle with this one - that's what names are for!
Yeah, I’m having a hard time imagining how no-wake-word could work in practice.
This is one advantage of a system with a constrained set of commands/grammars, as opposed to the Alexa/Siri model of trying to process all arbitrary text while in active mode. It can simply ignore/discard any invocations which don't match those specific grammars (and no need to wait to confirm that the device is awake).
"Computer, turn lights to 50%" -> "turn lights to fifty percent" -> {action: "lights", value: 50}
"My new computer has a really beefy graphics card" -> "has a really beefy graphics card" -> {action: null}
[dead]
Like that really annoying friend who jumps in every other sentence with "Well actually..."
I have a coworker that set up an Alexa an year or so ago, I don't know what was the issue, but it would jump into Teams meetings after every noise in his house.
after setting up the system, if I say "turn the ceiling lights to 20%", who else would be changing the lights?
But also, post-fix wake word would also be natural if it was recording all the time. "turn on the lights, Google", for instance
Sure, if the system is set up to only respond to very specific commands that humans would not respond to, I guess that could work. I was thinking more about the other way around, where a person might speak to someone else in the room and be overheard and acted upon - "turn on the lights!" could be a command for the computer controlling the room, or the human standing next to the Christmas tree, for example.
Someone in a TV show that you're watching?
How much are you willing to pay though? Full privacy means powerful enough hardware to do everything else on the list on-device and _quickly_. I don't know that most people have the budget for that
Looks like you are in the market for a butler.
Especially your last point will, IMO, not be possible for a long time.
I'd imagine with 1-2 TVs constantly talking, general conversations and other random noises it'd get expensive quick. Definitely closer to a rack than a RaspPi or old laptop hardware wise. Also add to that more/better mics for coverage and the complexity of it guessing when you're asking it to remind you to buy toothpaste or your SO... It can probably be done by tracking who's home, who's in the room with the speaker, who the speaker is, etc but it's all cost..
without a wake word that's a lot of compute unless you live alone and don't watch tv or listen to music
they even used a wake word in star trek fwiw
While we are getting shoveled AI keyword everywhere, I'm actually disappointed I don't see it here.
The first thought I had when encountering LLM was that it can finally make these devices understand you and make them finally useful... and I don't need to know some presceipted keywords.
You can actually integrate LLMs with Assist pipelines, it’s just orthogonal to this hardware announcement. Check out https://www.home-assistant.io/blog/2024/06/05/release-20246/...
It's also really cool. You can make it so that the home assistant itself first tries to understand what you do, like turning on the living room lights or setting the bathroom temperature to 21.5 degrees celsius. If the assistant pipeline does not understand what you are asking for, it can send your question to the LLM of your choice. You can also make the LLM to control the lights, heat etc, but at least for now ChatGPT is pretty bad with that. So let home assistant do the home automation, and then let ChatGPT to answer your questions about the most popular ruler in the 19th century France.
Home Assistant is such a fantastic project. I've been waiting for something like this for a long time; I just pre-ordered three.
My only remaining wish is that I can replace Siri with this (without needing some workaround)
All I want is a voice assistant that I can call "computer" like Star Trek, I don't want to have to say a brand name thankyou!
If you run openWakeWord, “computer” is one of very many pretrained models the community has made: https://github.com/fwartner/home-assistant-wakewords-collect...
You could’ve always set Alexa to respond to “Computer” instead.
Ah I admit I haven't looked into it for several years, good to see they added the feature - I might have to grab one
The problem is that it will go off every single time you watch Star Trek.
Can confirm, this works fabulously!
And on back order everywhere. I just spent the last 2 weeks getting a esp32-s3-box setup to do this but its lack of audio out really irks me.
And the mic is not all that great either. I have a couple of them but they just weren't reliably picking up my voice and I couldn't hear the reply either (when it did hear me). I figured it would be easy to add a speaker to them but that sent me down a rabbit hole that I gave up on and put them in a drawer. I'll buy this for sure though because when the ESP32 box thing worked it worked really well and I loved being able to swap out parts of the assist pipeline.
To be fair, the issue with the Box-3 is HA's implementation; I used it with heywillow.io and it was incredible, I could speak to it from another room and it would pick up perfectly.
The audio out is terrible so I wrote a shim-server that captures the request to the TTS server for heywillow and sent it to a speaker I build myself running MPD on a Pi with a nice DAC and have it play the responses instead of the box-3's tiny speaker.
I don't expect the audio-out on this to be much better with its tiny speaker, but at least it has a 3.5mm jack.
I'm going to look into what that Grove port can do too and perhaps build a new speaker "module" that the Voice PE can sit on top of to make it a proper music device.
I ended up moddng the s3 yaml to turn off the internal speaker and to forward all voice responses to a google hub.
> And on back order everywhere.
I just clicked through to my large country and the first vendor and was able to buy 2 for delivery tomorrow. So it says. So maybe not on back order everywhere.
If it's an ESP32-S3-BOX-3, there is audio out (assuming you mean being able to send arbitrary audio to it to play). Due to the framework used it's not available, but there's an alternative firmware available on GitHub that uses the newer framework and it exposes a media player entity you can send any audio to.
I didn’t have the -3 version. Learned the hard way after loading up that alt framework last week and the screen went blank I did end up implementing that same solution on my hardware though.
anyone tried https://getleon.ai/ ?
I tried years ago, I don't think I got it working, ended up using Rhasspy/voice2json instead (TIL: the creator of both is now the Voice Eng Lead for Home Assistant).
Looks like the GitHub is still somewhat active, although their roadmap links to a dead Trello: https://github.com/leon-ai/leon
Perfect will dig more into it. Currently i like to have an spotify client without ui for my kids ;)
sorry if this question takes away from the great strives the team went through but wouldn't it be much easier (hardware wise) to jailbreak one of the existing great hardware thingies like Apple HomePod or the Google one or Alexa?
The fact that it hasn't (widely?) been done yet suggests the answer is "no".
The hardware in those devices is generally better, most of them have much better speakers, but they're locked down, the wake-word detection hardware isn't open or accessible so changing it to do what we need would be difficult, and you're just hoping there's a way in.
Existing examples of opening them (as in freedom) replace the PCB entirely, which puts you back to square one of needing open hardware.
This feels like the right approach to me; I've been building my own devices for this purpose with off-the-shelf parts, and designing enclosures, but this is much sleeker; I just hope an add-on or future version comes with much better audio out (speakers) because that's where it and things like it (e.g. the S3-Box-3) are really lacking.
I've picked up an Echo Dot a few years ago when Amazon were practically giving them away, thinking that surely someone would have jailbroken it by now to allow it to be used with Home Assistant.
It was only after researching later that I discovered that this wasn't currently possible and recommended approach was to buy some replacement internals that cost more than the device itself (and if I recall correctly, more than the new Home Assistant Voice Preview Edition).
I don't think they are that easy to jail break but I may be wrong. I think they wanted to create an open device that people could build from rather than just a hacked up alexa.
or maybe find cheap Chinese smart speaker which is hackable?
It's not clear to me from the description if this is also completely open source hardware. Are the schematics, BoM, firmware published under a permissible license? If so, where are they accessible?
And if not, I would be curious to know why it haven't been fully open sourced.
I would think so in the end. They talked about the case design being open. The software and firmware are all open already and they said that they really wanted people to be able to take these components and make new devices.
They have relesased the designs for the yellow so I assume it will all come. https://github.com/NabuCasa/yellow
Can someone describe the use case here? I don't quite understand what its purpose is.
Is this a fully-private, open source alternative to Alexa, that by definition requires a CPU locally to run ?
Is the device supposed to be the nerve center of IoT devices ?
Can it access the Wifi to do web crawls on command (music, google, etc)?
The nerve center would be your Home Assistant instance, which is not this device. You can run Home Assistant on whatever hardware you like, including options sold by Nabu Casa.
This device provides the microphone, speaker, and WiFi to do wake-word detection, capture your input, send it off to your HA instance, and reply to you with HA’s processed response. Whether your HA instance phones out to the internet to produce the response is up to you and how you’ve configured it.
If you have home automation, surely you've run into this situation when Comcast flakes (or similar):
"OK, Google, turn lights on" "Check your connection and try again"
As far as I can tell, if you have Home Assistant + this new device, you've fixed that problem.
You should talk to Sonos about partnering with them. They currently have a very limited Sonos voice assist, plus Google Voice and Alexa, but the latter two are limited pre-LLM assistants.
I’m assuming they eventually want to create their own LLM and something privacy focused would be good match for their customers. I don’t know how they feel about open source though
RIP Mycroft. A tad too early.
Nabu Casa employ one of the Mycroft devs now and i think some of the tech came from that project so it's not all gone :)
What voices do they use?
I think in some ways it could redefine how we think about voice control... taking it from the cloud and putting it back into users' hands, like literally
Well shoot. Now i want to record everything in my house and transcribe it for logs. I already wanted to do that but didn't think there was a sane way.. assuming this lets me create a custom pipeline, that's wicked
It isn't even one year since the press stories about how dumb a product Alexa was and how it makes no money and all the devs are getting laid off. Something changed now?
Well, the various Echo devices were allegedly built as loss leaders in the hope people would use them to make orders on Amazon. This is backed by the most active open source project on GitHub, which already has extensive support for voice pipelines both with and without LLMs, and is likely priced sensibly.
A lot has changed in the open source ecosystem since commercial assistants were first launched. We have reliable open source wakeword detectors, and cheap/free LLMs can do the intent parsing, response generation, and even action calling.
It was a bad product at making money for Amazon, but they are useful for smart homes. Home Assistant is pretty squarely in the smart home category.
I bought two the second they were announced, I already use the software stack with the m5 atoms and they are terrible devices, but the software works well enough for me.
If it's not clear, the Home Assistant business plan is different from the Amazon one for Alexa... and the Home Assistant open source project is even more different.
I've been using the HA cloud voice assistant on my phone for the past few weeks, and it's such a great change from Alexa, because integrating new services and adding sentences is actually possible.
Alexa, on the other hand, won't even allow a third party app to read its shopping list. It's no longer clear to me why Alexa even exists any more except as a kitchen timer.
They must be working on a LLM backend for it so it isn't dumb as a rock.
Nothing makes sense otherwise, agreed.
Amazon lost 25 billion dollars on Alexa (between 2017 and 2021, from WSJ https://archive.is/uMTOB). Selling the hardware at a loss and I imagine a bigger portion was the thousands of people they had working in that division.
So yeah, Alexa is a dumb product... for Amazon. No one uses Alexa to buy anything from Amazon because the only way you can be sure of what you're ordering from Amazon is to be looking at the site. Otherwise you might get cat food from "JOYFUNG BEST Brand 2024" and not Purina.
Voice Assistants for Home Automation, like what Home Assistant is offering, are awesome. And this in particular is exciting exactly because of Alexa's failure as a product. Amazon clearly does not care about Alexa now, its been getting worse as they try to shoehorn in more and more monetization strategies.
> “We worried we’ve hired 10,000 people and we’ve built a smart timer,” said a former senior employee.
How the hell did Amazon hire that many people to develop such low-tech devices.
> the only way you can be sure of what you're ordering from Amazon is to be looking at the site
Ah . . . an optimist!
Huh? Being able to do things like turn off lights or change the TV volume with your voice is actually quite a nice convenience
Not super convinced the XMOS audio processing chip is really gonna buy a lot. Trying to do audio input processing feels like a dynamic task, requiring such adaption. XMOS is the most well known audio processor and a beast, but not sure it's really gonna help here!
I really hope we see some open-source machine -learned systems emerge.
I saw Insta360 announce their video conferencing solution today. Optics looks pretty medium, nothing wild, but Insta360 is so good at video that I expect it'll be great. But there's a huge 14 microphone array on it, and that's the hard job; figuring out how to get good audio from speakers in a variety of locations around a room. It really made me wish for more open source footing here, some promising start, be it the conference room or open living space. I've given all of 60s to look through this, and was kinda hopeful because heck yeah Home Assistant, but my initial read isn't super promising, isn't that this is starting the proper software base needed to listen well to the world.
https://petapixel.com/2024/12/17/the-insta360-connect-is-a-2...
They showed a video at the end of their broadcast last night comparing what the raw microphone hears and what comes out of the XMOS chip and you can hear a much clearer voice all the time even when there is noise or you are far away from the device. It is also used to cancel out the music if you are using it's speaker output. I don't think it's doing any voice processing but it's cleaning up the audio a lot which makes the job of the wake word processor and the speach to text a lot easier. Up until now this was missing from a lot of the home made voice assistance and I think why Alexa can understand you from the next room but my home made one struggles with all but quiet conditions.
Alexa Echo Dot has 6 or 7 microphones. I'd expect that makes it much easier to filter out voices directionally than only the 2 microphone this hardware has. I hope they release a version with more microphones.
how does this compare to ESP32-S3-BOX-3B ?
What is a good GPU to put in a home server that can run the TTS / STT and the local LLM required to make this shine?
A 3090 is too expensive and power hungry. Maybe a 3060 12Gb? Is there anything in the "workstation" lineup that is more efficient especially since I don't need the video outs?
Majel Barrett voice please.
[dead]
[dead]
[dead]
i don't wanna talk to a computer
> i don't wanna talk to a computer
You are in luck. You can get a human butler. But not for $59.
I'm also very excited. I've had some ESP32 microphones before, but they were not really able to understand the wake word, sometimes even when it was quiet and you were sitting next to the mic.
This one looks like it can recognize your voice very well, even when music is playing.
Because... when it works, it's amazing. You get that Star Trek wake word (KHUM-PUTER!), you can connect your favorite LLM to it (ChatGPT, Claude Sonnet, Ollama), you can control your home automation with it and it's as private as you want.
I ordered two of these, if they are great, I will order two more. I've been waiting for this product for years, it's hopefully finally here.
As a side note, it always slightly puzzles me when I see "voice interface" and "private" used together. Maybe it takes living alone to issue voice commands and feel some privacy.
(Yes, I do understand that "privacy" here is mostly about not sending it for processing to third parties.)
Private meaning that a big American corporation is not listening and using my voice to either track me or teach their own AI service with it.
> Yes, I do understand that "privacy" here is mostly about not sending it for processing to third parties.
Then why does it puzzle you?
You wouldn’t ask your partner deeply private questions in front of your mom either. Not sure how you think it’s a dig against voice assistant privacy.
There are levels of privacy. Because I'm not going to ask deeply private questions, it doesn't mean that I want everyone to be snooping into what I'm planning to eat tonight.
[dead]
I don't like these interaces because unless they are button activated or something, they must be always listening and sending sound from where you are to a 3rd party server. No thanks. Of course this could be happening with my phone, but at least it have to be a malicious action to record me 24/7
The developers could do sneaky things with any device that has wifi and a mic.
I mean... That's not true, though.
The main pitch of a tool like this is that I can absolutely verify it's not true.
I'm currently running a slightly different take of this (Esp 32 based devices, with whisper through Willow inference server, with Willow autocorrect, tied into home assistant).
For context, it works completely offline. My modem can literally be unplugged and I can control my smart devices just fine, with my voice. Entirely on my local network, with a couple of cheap devices and a ten year old gaming PC as the server.
My data
How these ESP32-systems work is that you send a wake word to the device itself. It can detect the word without an internet connection, the device itself understands it and wakes up. After the device is woken up, it sends your speech to home assistant, which either
I'm planning on building a proxmox rack server next year, so I'm probably going to just handle all the discussions locally. The home assistant cloud is quite private too, at least that's what they say (and they're in EU, so I think there might be truth in what they say)...[dead]
I'm trying to understand. Is there an SDK I can use to enhance this? Or is this a package product?
I'm really hoping it's the former. But I don't see any information about how to develop with this.
Yep, ESPHome SDK. It's all open source and well-documented:
https://esphome.io/
Some notable blog posts, docs and a video on the wake words and voice assistant usage:
https://community.home-assistant.io/t/on-device-wake-word-on...
https://esphome.io/components/voice_assistant.html
https://www.home-assistant.io/voice_control/create_wake_word...
https://www.youtube.com/watch?v=oSKBWtBJyDE
A group buy for an existing product makes sense. Want to buy a 24TB Western Digital hard drive? It’s $350. But if you and your 1000 closest friends get together the price can be $275.
But for a first time unknown product? You get a lot fewer interested parties. Lots of people want to wait for tech reviews and blog posts before committing to it. And group buys being the only way to get them means availability will be inconsistent for the foreseeable future. I don’t want one voice assistant. I want 5-20, one for every space in my house. But I am not prepared to commit to 20 devices of a first run and I am not prepared to buy one and hope I’ll get the opportunity to buy more later if it doesn’t flop. Stability of the supply chain is an important signal to consumers that the device won’t be abandoned.
> But for a first time unknown product? You get a lot fewer interested parties. Lots of people want to wait for tech reviews and blog posts before committing to it.
I used to think so too. But then Kickstarter proved that actually, as long as you have a good advertising style, communicate well, and get lucky, you can get people to contribute literal millions for a product that hasn't even reached the blueprints stage yet.
[flagged]
Kickstarter isn't a group buy.
A group buy is when you want to buy a bunch of existing product at wholesaler prices. Kickstarter is about funding new project that don’t exist yet. Like if the wholesaler refuses to sell you 1000 video cards, just give the money back. If you spend the Kickstarter money and can’t land a product there isn’t much you can do for refunds.
> I am not prepared to buy one and hope I’ll get the opportunity to buy more later
As long as this thing works and there's demand for it, I doubt we'll ever run out of people willing to connect an XU316 and some mics to an ESP32-S3 and sell it to you with HA's open source firmware flashed to it, whether or not HA themselves are still willing to.
I agree! I mean, just look at the market for Meshtastic devices! So many options! Or devices with WLED pre-installed! It'll take a Lot for Esp32 to go out of style
There are two types of "group buy". The one that you illustrated, but also one not only focused on saving bucks but also helping small, independent makers/producers to sell their usually more sustainable or more private product (which is also usually more expensive due to the lack of economies of scale).
Kickstarter shows that a lot of people feel different.
Kickstarter isn’t a group buy. Similar, but not the same.
>> I want 5-20, one for every space in my house.
I don't have a small house, but I'm trying to think why I would need even 5 of these, let alone 20. The majority of the time my family spends together is in the open layout on our main floor where the kitchen flows into the living room with an adjacent sun room off the living room.
I'm genuinely curious why you need so many of these.
I do agree that if you do have a legit use case for so many, buying so many in essentially a first run is a risky thing. Coupled with the ability for this to be supported for more than a fleeting couple of years is also a huge risk.
I have four bedrooms, living/family room, study, office, rumpus room, garage, workshop, and trying to build out a basement with three more rooms. Each of these rooms have some form of smart lighting or devices like TVs or thermostats that people have a much easier time controlling with voice than phone apps. Granted this may sound extravagant but I have a large family so all this space is all very well utilized hence the need for a basement expansion. Again, at $25/room and bought over time the Echo Dots are a really simple way to add very easy to use controls that require almost no user training. We pause the living room TV and “set condition two throughout the fleet” at the end of the day with these devices.
What's condition 2?
Just using where I might want it in childhood home as an example - master bedroom - master bathroom - grandma's room - my room - brother's room - upstairs bathroom - upstairs loft? - office room - living room/diningroom - kitchen/kitchentable/familyroom - garage?
9-14 devices for a 5 person household. May be a stretch since I'm not sure if my grandma could even really use it. Bathroom's a stretch but I'm imagining being in the shower and wanting to note multiple showerthoughts
I invested in Mycroft and it flopped. Here’s hoping some others can go where they couldn’t.
I think Mycroft was unfortunately just ahead of its time. STT was just becoming good enough but NLU wasn’t quite there yet. Add in you’re up against Apple Google and Amazon who were able to add integrations like music and subsidize the crap out of their products.
I just think this time around is different. Open Whisper gives them amazing STT and LLMs can far more easily be adapted for the NLU portion. The hardware is also dirt cheap which makes it better suited to a narrow use case.
I guess the difference here is that HA has a huge community already. I believe the estimate was around 250k installations running actively. I suspect a huge chunk of the HA users venn diagram slice fits within the voice users slice.
Our estimates are more than a million active instances https://analytics.home-assistant.io/
> Analytics in Home Assistant are opt-in and do not reflect the entire Home Assistant userbase. We estimate that a third of all Home Assistant users opt in.
I'm a big fan of home assistant, and use it to control a LOT of my home, have done for years, have tonnes of hardware dedicated to and for it, and I've also ordered some of these Voice devices.
I'm also opted OUT of the analytics.
IIRC one of the main devs behind this device came from Mycroft.
Yep, Mike Hansen was on the live stream launching the new device. He also notably created Rhasspy [1], which is open-source voice assistant software for Raspberry Pi (when connected to a microphone and speaker).
[1] https://rhasspy.readthedocs.io/en/latest/
OP's username checks out.
I believe Mycroft was killed in part due to a patent troll:
https://www.theregister.com/AMP/2023/02/13/linux_ai_assistan...
Hopefully the troll is no longer around
I think another part is that there is a failure mechanism on their boards that was recently identified: https://community.openconversational.ai/t/sj-201-sj201-failu...
The short version, from the post, is that there are 4 capacitors that are only rated for 6.3v, but the power supply is 12v. Eventually one of these capacitors will fail, causing the board to stop working entirely.
It would be hard for a company to stay in business when they are fighting a patent troll lawsuit and having to handle returns on every device they sold through kickstarter.
Not really sure what the benefit of group buy would be here. Nuba Casa, the company that supports the development of home assistant and developed this product, already has a few products they sell. They had this stocked all over the world for the announcement and it sold out. I assume they had already made a few thousand. They will get more stock now and it will sell just like the other things they make. Any profit from this will go back into development of Home Assistant.
Heh thus far I've been an excited spectator of HomeAssistant, and wasn't aware of Nuba Casa until doing research for a different comment on the thread. I do love and appreciate their model here
I guess the benefits that came to mind are - alternative crowdsourced route for sourcing hardware, to avoid things like that raspberry pi shortage (although if it's due to broader supply chain issues then this doesn't necessarily help) - hardware forks! If someone wanted a version with a more powerful ESP32, or a GPS, or another mic, or an enclosure for a battery and charging and all that, took the time to fork the design to add these features, and found X other users interested in the fork to get it produced... (of course I might be betraying my ignorance on how easy it is to set up this sort of alternative manufacturing chain or what unit amounts are necessary to make this kind of forking economical)
Your idea about group buys is really intriguing. I wonder if the community might organically set something like that up once there’s enough interest